Computational Methods To Study The Structure and Dynamics of Biomolecules and Biomolecular Processes

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 849

Springer Series on Bio- and Neurosystems 8

Adam Liwo Editor

Computational Methods
to Study the Structure
and Dynamics of
Biomolecules and
Biomolecular Processes
From Bioinformatics to Molecular
Quantum Mechanics
Second Edition
Springer Series on Bio- and Neurosystems

Volume 8

Series editor
Nikola Kasabov, Knowledge Engineering and Discovery Research Institute,
Auckland University of Technology, Penrose, New Zealand
The Springer Series on Bio- and Neurosystems publishes fundamental principles
and state-of-the-art research at the intersection of biology, neuroscience, informa-
tion processing and the engineering sciences. The series covers general informatics
methods and techniques, together with their use to answer biological or medical
questions. Of interest are both basics and new developments on traditional methods
such as machine learning, artificial neural networks, statistical methods, nonlinear
dynamics, information processing methods, and image and signal processing. New
findings in biology and neuroscience obtained through informatics and engineering
methods, topics in systems biology, medicine, neuroscience and ecology, as well as
engineering applications such as robotic rehabilitation, health information tech-
nologies, and many more, are also examined. The main target group includes
informaticians and engineers interested in biology, neuroscience and medicine, as
well as biologists and neuroscientists using computational and engineering tools.
Volumes published in the series include monographs, edited volumes, and selected
conference proceedings. Books purposely devoted to supporting education at the
graduate and post-graduate levels in bio- and neuroinformatics, computational
biology and neuroscience, systems biology, systems neuroscience and other related
areas are of particular interest.

More information about this series at http://www.springer.com/series/15821


Adam Liwo
Editor

Computational Methods
to Study the Structure and
Dynamics of Biomolecules
and Biomolecular Processes
From Bioinformatics to Molecular Quantum
Mechanics
Second Edition

123
Editor
Adam Liwo
Faculty of Chemistry
University of Gdańsk
Gdańsk, Poland

ISSN 2520-8535 ISSN 2520-8543 (electronic)


Springer Series on Bio- and Neurosystems
ISBN 978-3-319-95842-2 ISBN 978-3-319-95843-9 (eBook)
https://doi.org/10.1007/978-3-319-95843-9

Library of Congress Control Number: 2018948713

1st edition: © Springer-Verlag Berlin Heidelberg 2014


2nd edition: © Springer Nature Switzerland AG 2019
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface to the Second Edition

In silico, studies of the biomolecular system are now routinely performed to aid
experiment as well as to get some knowledge of the systems and processes that
occur there, in situations in which the experiment requires too much cost and labor
or gives fragmentary information (e.g., the details of protein dynamics). Such
studies constitute a truly interdisciplinary field, which comprises quantum
mechanics, molecular physics, molecular biology, numerical mathematics, and
computer science, which makes it virtually impossible for anyone to be an expert in
all these diverse domains.
The motivation behind the shape and structure of the book, starting from its first
edition published 4 years ago, was the old Latin proverb, which says “Verba
docent, exempla trahunt” or, slightly rephrasing, it is best to learn by looking at
good examples. Therefore, this book is a collection of chapters written by leading
scientists in the field, who are developers of the methods or experts in applying the
existing methods to solve concrete problems. As in the first edition of this book, the
chapters are grouped into four thematic sections (methodology, applications of
molecular simulations, bioinformatics, and molecular quantum mechanics), plus the
introduction written by Harold A. Scheraga, one of the very pioneers of the
application of theoretical methods in studying biological systems. The book is
addressed both to end users and to method developers; the researchers who start
applying or developing computational methods can learn, by the case studies
reported in the consecutive chapters, how to proceed and how to avoid errors, while
advanced researchers in the field can grasp on good solutions.
Considerable attention received by the first edition of the book was the motivation
to work on the second one. Because the field is advancing rapidly, many chapters were
updated, often extended in scope. These are the chapters authored by the scientists
from the laboratories of Andrzej Koliński, Mariusz Makowski, Joanna Trylska,
Ulrich Hansmann, Marek Cieplak, Marta Pasenkiewicz-Gierula, Sławomir Filipek,
Anders Irbäck, Patrick Senet, Istvan Simon, Irena Roterman, and Giovanni La Penna.
Two more chapters have been added, one about all-atom MD studies of peptide
aggregation, authored by Maksim Kouza, Andrzej Kolinski, Irina Buhimschi, and
Andrzej Kloczkowski, and another one, pertaining to the bioinformatics section,

v
vi Preface to the Second Edition

about protein secondary structure assignment and dihedral angle prediction,


authored by Eshel Faraggi and Andrzej Kloczkowski. With these significant modi-
fications and additions, the book will, hopefully, continue to be useful to the scientific
community.

Gdańsk, Poland Adam Liwo


April 2018
Preface to the First Edition

Since the second half of the twentieth century, machine computations play a con-
tinuously increasing role in science and engineering. Computer simulations are
particularly important in studying biological systems at the molecular level, because
they are often the only way to get an idea of the behavior of the whole system. The
difference in timescale and size scale, as well as in the required accuracy of
description, demands the use of different approaches, from comparative analysis of
sequence and structural databases or analyzing the networks of interdependence
between cell components and processes, through coarse-grained modeling where
individual molecules come into play, although at an approximate level, to atomi-
cally detailed simulations and, finally, molecular quantum mechanics.
Aside to contributing to our understanding of the complex machinery of living
cells and organisms, the computation of three-dimensional structure and dynamic
behavior of biomacromolecules and their complexes with ligands is slowly
becoming an alternative to expensive screening experiments, which are vital in the
search for lead compounds in drug design. The variety of available techniques made
it necessary to set up systems with which to test the development of the existing and
the quality of new approaches. For the prediction of protein structure, such a system
known as Community-Wide Experiment on the Critical Assessment of Techniques
for Protein Structure Prediction (CASP; see http://www.predictioncenter.org) was
established in the year 1994 by John Moult and colleagues, and already the
10th edition of this experiment was held in 2012. Similar systems to test the
performance of protein-docking algorithms (CAPRI), prediction of crystal struc-
tures of small organic molecules, and prediction of RNA structures, respectively,
were established later, following the successful example of CASP. Consequently,
the computational techniques are constantly subject to rigorous verification.
This book provides an overview of modern computer-based techniques for the
calculations of structure, properties, and dynamics of biomolecules and biomolec-
ular processes. Its 22 chapters have been contributed by leading scientists from all
over the world and address computer simulation techniques for studying biological
phenomena from the perspective of both methodology and applications. The
chapters are grouped into four thematic issues on the methodology of molecular

vii
viii Preface to the First Edition

simulations, application of molecular simulations, bioinformatics methods and use


of experimental information in molecular simulations, and selected applications of
molecular quantum mechanics, respectively.
The introductory chapter (Chapter 1) has been written by Harold A. Scheraga,
one of the very pioneers of simulation studies of biomacromolecules, whose
Empirical Conformational Energy Program for Peptides (ECEPP) was the very first
software to compute stable conformations of polypeptide chains, by using a
physics-based force field. This chapter addresses the evolution of the investigations
of the structures of proteins and other biomolecules, from early physicochemical
experiments on these molecules such as pH titrations, hydrodynamic measure-
ments, through atomically detailed models of Pauling and Corey and early
description of protein energy surfaces by using hard-sphere potentials, to the
development of modern force fields, both atomically detailed and coarse-grained,
the latter having the advantage because of their ability to treat larger systems.
Implicit treatment of solvent effects, including electrostatic effects at the level of the
Poisson–Boltzmann equation, is also discussed. Selected application of the use of
both all-atom and coarse-grained force fields to solve biologically related problems
is described. In summary, this chapter is an excellent introduction to the problems
addressed in detail in other chapters of the book.
Chapters 2–9 address the methodology of molecular simulations. Chapters 2 and 3
from the group of Andrzej Kolinski, one of the leading developers of coarse-
grained models for protein structure and dynamics, discuss the coarse-grained
models of protein structure in the context of applications in protein structure pre-
diction and simulations of protein dynamics, respectively. Various types of
coarse-grained models such as knowledge-based models, including the SICHO and
CABS models developed in the Kolinski group, the physics-based models,
including the UNRES model developed in the Liwo and Scheraga groups, and
simpler models such as the elastic network models and their applications, are
discussed. These two chapters are an excellent summary of the state-of-the art and
future perspectives of the coarse-grained approaches to protein structure. Chapter 4,
written by Mariusz Makowski, describes the development of his fully physics-based
coarse-grained potentials for side chain–side chain interactions. This chapter
enables the reader to get an idea as to how much effort is required to develop a
reliable physics-based coarse-grained force field and can be very instructive to those
new in coarse graining. Chapter 5, from the Joanna Trylska group, is a compre-
hensive review of the coarse-grained models of nucleic acids and protein–DNA
complexes and their applications. Chapter 6, written by Yury Vorobjev, addresses
the problem of implicit treatment of protein–solvent interactions. This chapter
introduces a rigorous thermodynamic treatment of the solvent contribution to the
free energy of proteins in solution. Models of the electrostatic contribution to the
free energy, which are based on the Poisson–Boltzmann equation and its solution
through volume integration and the dielectric-surface integration developed by the
authors, as well as the simplified generalized Born model, are discussed. The
computation of the free energy of cavity formation is also discussed. Applications
for the calculation of pKa values of ionizable groups in proteins and simulations of
Preface to the First Edition ix

protein conformation are presented. Chapter 7 from the Yuko Okamoto group
discusses optimization of force field parameters.
The last two chapters of the simulation methods part of the book are devoted to
techniques for conformational search and dynamics. Chapter 8 from the group of
Ulrich Hansmann, who is one of the leading developers of conformational sampling
techniques, discusses approaches for the enhancement of the capability of Monte
Carlo and molecular dynamics methods to search the conformational space. The
theory and applications of generalized ensemble sampling methods, including the
widely used replica-exchange method and multicanonical sampling, are discussed.
In chapter 9, written by Alfredo Cardenas, methods for construction of the entire
trajectory from short independently simulated fragments are discussed with
emphasis on the milestoning method developed largely by the author and Ron Elber.
These approaches enable us to parallelize the otherwise serial task of computing a
dynamic trajectory of a system through initial conversion of the initial-value prob-
lem to minimization of the action of a system, which is a parallelizable boundary-
value problem, and then determination of the timescale of subsequent events by
using, e.g., the milestoning method. Such an approach is likely to become a viable
alternative if not replacement for molecular dynamics because of its potential to be
implemented on distributed computing architectures.
The next section of the book, composed of chapters 10–15, is devoted to bio-
logical applications of molecular simulation techniques. In chapter 10, written by
Marek Cieplak, application of the structure-based (Gō-like) models of proteins in
simulating mechanostability of virus capsids is discussed. A comprehensive review
of modeling lipid membranes by means of all-atom molecular dynamics is provided
in chapter 11 from the Marta Pasenkiewicz-Gierula group. This chapter is followed
by a review of the molecular modeling of membrane proteins contributed by the
Sławomir Filipek group. Chapters 13 and 14 from the Anders Irback and Sylwia
Rodziewicz-Motowidlo groups, respectively, discuss simulations of amyloid for-
mation. Finally, chapter 15 from the Patrick Senet group discusses the application
of molecular dynamics to study functionally important motions of the human
Hsp70 chaperone. A procedure for verification of the calculated dynamic profiles
based on neutron-scattering measurements is also outlined.
Chapters 16–19 describe examples of the use of structural database or experi-
mental information in molecular simulations, a topic commonly termed bioinfor-
matics. Chapter 16, contributed by the Istvan Simon group, addresses the important
issue of intrinsically disordered proteins, the discovery of which has overthrown the
old paradigm that a protein must have a well-defined 3D structure to exert its
biological function. The authors give a comprehensive overview of bioinformatics
methods for the prediction of intrinsically disordered regions from amino acid
sequence of a protein. The importance of the topic is best demonstrated by the fact
that blind prediction of intrinsically disordered regions in proteins is a separate
category in recent CASP experiments. In Chapter 17 from the Bogdan Lesyng
group, techniques for finding the alignment (similarities) between protein structures
are discussed and a new method thereof is introduced based on local descriptors. In
Chapter 18, contributed by the Irena Roterman group, a new method for the
x Preface to the First Edition

simulation of protein-folding pathways is described, which is based on sampling


from locally allowed conformational states, the probability function of which is
determined from protein structure statistics in the first stage, chain-energy opti-
mization in the second stage, and finally minimization of the solvent-exposed
surface of nonpolar and maximization of that of polar residues in the last stage.
Chapter 19, contributed by the Jorge Vila group, describes the use of 13C chemical
shifts in modeling protein structure and verification of the quality of the structures
determined by other experimental techniques.
The last section of the book, composed of chapters 20–22, is devoted to the
application of molecular quantum mechanics. In Chapter 20, contributed by
Giovanni La Penna, various quantum mechanical approaches to calculations of the
structures and energetics of peptides and proteins in the presence of metal ions are
reviewed and use of explicit and continuous representation of the solvent is dis-
cussed. Chapters 21 and 22 are contributed by the Ewa Broclawik group and
address quantum mechanical approaches at studying redox reactions at non-heme
enzymatic centers (Chapter 21) and the electronic properties of the active forms of
porphyrins (Chapter 22). These two chapters describe the use of the most advanced
computational approaches of molecular quantum mechanics, including coupled
cluster and complete active space perturbation theory (CASPT2).
One purpose of constructing this book was to provide an overview, even a
sketchy one, of the constantly growing field of molecular modeling of biological
systems. The other purpose, especially aimed at younger readers, was to present
modern theory and applications, described by the scientists who are actively
working on the subject, in a single book. Therefore, the reader has an opportunity to
see what the theory is behind the simulations. This is an important issue nowadays
when methods are often referred to by software names and versions rather than by
the physics and algorithms. This tendency is very likely to continue for the same
reason that an average car user does not need to know the details of engine func-
tioning or fuel composition. However, to drive a car safely, one must know and
apply traffic rules. Without having some knowledge behind molecular modeling
software, it is only too easy to trespass on its scope of the application or to use it in
a wrong way and draw unjustified conclusions. Hopefully, this book will serve
readers as a collection of stories told by experienced drivers and will provide useful
examples and advices when driving through the still bumpy roads of biomolecular
simulations.

January 2013 Adam Liwo


Acknowledgements

This book is a collaborative effort of many people whose direct contribution or


support enabled it to materialize. I would like to thank all the authors, for their
excellent contributions, which often contain their own new results not published yet
elsewhere. In particular, I would like to thank Prof. Harold Scheraga, my post-
doctoral advisor over 20 years ago, for contributing his chapter which provides an
excellent overview of the field of molecular simulations. I would also like to
express special thanks to those of the authors who continued to support the book by
updating and extending their chapters. The book would never appear without an
invitation from Springer and, in particular, stimulation and encouragement from my
Springer contact, Leontina di Cecco, whom I would like to thank at this point.
Finally, I would like to thank my wife Kasia and my daughter Asia for their
understanding and stimulation during the time I was involved in editing this book.

xi
Contents

Part I Introduction
Simulations of the Folding of Proteins: A Historical Perspective . . . . . . 3
Harold A. Scheraga

Part II Molecular Simulations: Methodology


Protein Structure Prediction Using Coarse-Grained Models . . . . . . . . . 27
Maciej Blaszczyk, Dominik Gront, Sebastian Kmiecik, Mateusz Kurcinski,
Michal Kolinski, Maciej Pawel Ciemny, Katarzyna Ziolkowska,
Marta Panek and Andrzej Kolinski
Protein Dynamics Simulations Using Coarse-Grained Models . . . . . . . . 61
Sebastian Kmiecik, Jacek Wabik, Michal Kolinski, Maksim Kouza
and Andrzej Kolinski
Physics-Based Modeling of Side Chain—Side Chain Interactions
in the UNRES Force Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Mariusz Makowski
Modeling Nucleic Acids at the Residue–Level Resolution . . . . . . . . . . . . 117
Filip Leonarski and Joanna Trylska
Modeling of Electrostatic Effects in Macromolecules . . . . . . . . . . . . . . . 163
Yury N. Vorobjev
Optimizations of Protein Force Fields . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Yoshitake Sakae and Yuko Okamoto
Enhanced Sampling for Biomolecular Simulations . . . . . . . . . . . . . . . . . 257
Workalemahu Berhanu, Ping Jiang and Ulrich H. E. Hansmann
Determination of Kinetics and Thermodynamics of Biomolecular
Processes with Trajectory Fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Alfredo E. Cardenas

xiii
xiv Contents

Part III Molecular Simulations: Applications


Mechanostability of Virus Capsids and Their Proteins
in Structure-Based Coarse-Grained Models . . . . . . . . . . . . . . . . . . . . . . 307
Marek Cieplak
Computer Modelling of the Lipid Matrix of Biomembranes . . . . . . . . . 331
Marta Pasenkiewicz-Gierula and Michał Markiewicz
Modeling of Membrane Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
Dorota Latek, Bartosz Trzaskowski, Szymon Niewieczerzał,
Przemysław Miszta, Krzysztof Młynarczyk, Aleksander Dębiński,
Wojciech Puławski, Shuguang Yuan, Agnieszka Sztyler, Urszula Orzeł,
Jakub Jakowiecki and Sławomir Filipek
Peptide Folding in Cellular Environments: A Monte Carlo
and Markov Modeling Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
Daniel Nilsson, Sandipan Mohanty and Anders Irbäck
Molecular Dynamics Studies on Amyloidogenic Proteins . . . . . . . . . . . . 467
Sylwia Rodziewicz-Motowidło, Emilia Sikorska and Justyna Iwaszkiewicz
Raman and Infrared Spectra of Acoustical, Functional Modes
of Proteins from All-Atom and Coarse-Grained Normal
Mode Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
Adrien Nicolaï, Patrice Delarue and Patrick Senet
Explicit-Solvent All-Atom Molecular Dynamics of Peptide
Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
Maksim Kouza, Andrzej Kolinski, Irina Alexandra Buhimschi
and Andrzej Kloczkowski

Part IV Use of Structural Database or Experimental Information


in Modeling Protein Structure and Dynamics
Bioinformatical Approaches to Unstructured/Disordered Proteins
and Their Complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
Bálint Mészáros, Zsuzsanna Dosztányi, Erzsébet Fichó, Csaba Magyar
and István Simon
Theoretical and Computational Aspects of Protein Structural
Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
Paweł Daniluk and Bogdan Lesyng
Fuzzy Oil Drop Model Application—From Globular Proteins
to Amyloids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
M. Banach, L. Konieczny and I. Roterman
Contents xv

13
C Chemical Shifts in Proteins: A Rich Source of Encoded
Structural Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
Jorge A. Vila and Yelena A. Arnautova
Protein Secondary Structure Assignments and Their Usefulness
for Dihedral Angle Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 699
Eshel Faraggi and Andrzej Kloczkowski

Part V Applications of Molecular Quantum Mechanics


When Water Plays an Active Role in Electronic Structure.
Insights from First-Principles Molecular Dynamics Simulations
of Biological Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715
Giovanni La Penna and Oliviero Andreussi
Electronic Properties of Iron Sites and Their Active Forms
in Porphyrin-Type Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755
Mariusz Radoń and Ewa Broclawik
Bioinorganic Reaction Mechanisms—Quantum Chemistry
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825
Tomasz Borowski and Ewa Broclawik
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 851
Part I
Introduction
Simulations of the Folding of Proteins:
A Historical Perspective

Harold A. Scheraga

Abstract Highlights of the evolutionary development of the physical approach to


biology during the last 80 years are traced in this chapter. The historical sequence of
events that led to the introduction of modern simulation methods to treat biological
processes is described in detail.

1 Introduction

The physical approach to biology, ultimately culminating in molecular simulations,


began to be formulated in the two decades preceding the appearance of a book
by Cohn and Edsall [1]. Cohn headed a physical chemistry laboratory at Harvard
Medical School devoted to the study of the effects of ionic strength and pH on
protein solubility, and on the nature of proteins as acids and bases, making use of K.
Linderstrøm-Lang’s theoretical treatment of the titration curve of a protein [2] based
on the simultaneous publication of the Debye-Hückel theory [3]. The book by Cohn
and Edsall [1] summarized this early work.
With Svedberg’s development of the ultracentrifuge [4], it became clear that puri-
fied globular proteins, that were soluble in water or saline solutions, were well-
defined macromolecules with molecular weights of many thousands. As a result,
interest began to be focused, for example by Neurath [5] and Oncley [6] on the use
of hydrodynamic measurements, such as viscosity, diffusion, sedimentation velocity,
and flow birefringence of proteins in aqueous solution to determine the physical size
and shape of such, assumed-rigid, macromolecules, which varied considerably from
the near-spherical serum albumin to the very asymmetric rod-like fibrinogen.
According to Edsall [7], Spiegel-Adolf, and Anson and Mirsky, in the 1920s
and 1930s, demonstrated the reversibility of the denaturation of serum albumin and
hemoglobin, respectively, and Hsien Wu published the first good theory of protein

H. A. Scheraga (B)
Baker Laboratory of Chemistry and Chemical Biology, Cornell University,
Ithaca, NY 14853-1301, USA
e-mail: has5@cornell.edu

© Springer Nature Switzerland AG 2019 3


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_1
4 H. A. Scheraga

denaturation. Much later, Anfinsen [8] provided convincing evidence for the refolding
of unfolded bovine pancreatic ribonuclease A (RNase A) to the native conformation
in experiments that were later followed up in many other laboratories with many
other proteins.
Later, Scheraga implemented use of the theory to determine rotational diffusion
constants from flow-birefringence measurements [9] and, with Mandelkern [10],
made use of Flory’s theory of the hydrodynamic properties of solutions of syn-
thetic polymers [11, 12], to modify Neurath’s and Oncley’s treatment, and showed
that proteins of various asymmetric shapes were not rigid molecules, that they had
asymmetries different from those computed by Neurath and Oncley, and that their
hydrodynamic properties depend not only on their asymmetries but also on their
flexible volumes which swell considerably upon application of increasing amounts
of denaturing agents such as urea.

2 Molecular Treatment of Protein Molecules

Though introducing useful information about globular proteins, the hydrodynamic


experiments [10] could not provide atomic details of these macromolecules. How-
ever, the situation improved considerably with Pauling and coworkers’ proposal
of the α and β structures of proteins, based on intramolecular backbone hydrogen
bonds [13, 14], and with Sanger’s determination of the amino-acid sequence [15]
and disulfide-bond location [16] of insulin, which demonstrated that a protein was
a macromolecule with a unique sequence and covalent structure. These landmark
results were soon followed by Perutz’s crystal structure of hemoglobin [17], and
Kendrew’s crystal structure of myoglobin [18], which clearly verified Pauling’s pro-
posal of the α-helix.
Whereas Pauling and co-workers had focused on the role of backbone hydrogen
bonds, Laskowski and Scheraga examined the effect of side-chain hydrogen bonds
on the pK’s of the ionizable groups of polar residues [19] and on the reactivity of
covalent bonds, such as peptide bonds and disulfide bonds [20]. The role of nonpolar
side chains in hydrophobic interactions, involving the critical importance of the
aqueous solvent, was later discussed in terms of a statistical mechanical model by
Némethy and Scheraga [21], with further improvements in the treatment by Griffith
and Scheraga [22]. Subsequently, with the development of molecular mechanics (see
Sect. 4), this model was verified by simulations of aqueous solutions of methane [23]
and of other nonpolar solutes [24]. These results confirm the statement of Kendrew
[25] that “it is the spatial relations between the side chains which determine the
chemical behaviour and biological specificity of the protein molecule as a whole”.
Several experiments [26] provided verification of the theoretically-computed ther-
modynamic parameters for hydrophobic interactions between nonpolar side chains
[21], and for the effect of side-chain hydrogen bonds on protein-protein association
of fibrin monomer in the blood-clotting process [27]. In further studies of the effects
of side-chain hydrogen bonds on pK’s, a series of physical chemical and biochemical
Simulations of the Folding of Proteins: A Historical Perspective 5

experiments on RNase A located three tyrosyl-aspartate hydrogen bonds [28] which


were found in the subsequently-determined crystal structure of this protein [29]. This
led to the attempt to use such distance constraints, initially to avoid steric overlaps
but ultimately to develop empirical force fields, to compute the three-dimensional
structures of proteins [30, 31]. At the same time, Ramachandran and coworkers [32]
developed their famous ϕ, ψ diagram, based on steric overlaps computed with a
hard-sphere potential. The initial computations to determine protein structure with
this potential [30, 31] produced a large set of possible three-dimensional structures
of an octapeptide loop of RNase A [30], and the size of this set was reduced by elim-
inating those with steric overlaps based on the Ramachandran diagram. This set of
remaining structures fell in the regions of the Ramachandran diagram for Pauling’s
β-structures and right- and left-handed α-helical structures.

3 Computational Results with a Hard-Sphere Potential

Several interesting conclusions about the conformations of polypeptide chains were


derived from computations based on the use of the hard-sphere potential [33–35];
for example, much of the structural character of proteins such as the distribution
of the torsional angles ϕ, ψ, χ1 for various residues result from steric repulsion
between pairs of atoms. Nevertheless, it was recognized very early that a hard-
sphere potential is inadequate to determine stable conformations of a macromolecule
[31]. Hence, much effort was devoted to develop more detailed empirical potential
energy functions for the interactions between pairs of atoms (an all-atom model) to
compute the three-dimensional structures of proteins, initially neglecting the role of
the aqueous solvent, but later including the effects of hydration. An early treatment of
synthetic polymers with an empirical potential was already introduced by de Santis
and coworkers [36].

4 Calculations with Empirical Potential Functions

This was followed by a series of attempts by Brant and Flory [37], Ooi et al. [38],
Gibson and Scheraga [39, 40], Scott et al. [41], Yan et al. [42, 43], Momany et al.
[44, 45], Levitt and Lifson [46], and Hagler et al. [47] to derive improved all-atom
potential functions. Our effort in this regard led to our Empirical Conformational
Energy Program for Peptides (ECEPP) [48], which was subsequently upgraded sev-
eral times as ECEPP/2 [49, 50], ECEPP/3 [51], and ECEPP-05 [52]. Several other
all-atom empirical potentials have since been introduced, for example, CHARMM
[53], AMBER [54], and GROMOS [55]. Efforts continue in many laboratories to
improve the current potentials. These potential functions are augmented by either
explicit or continuum treatments of hydration, e.g., those of Jorgensen et al. [56];
Ooi et al. [57] and Vila et al. [58].
6 H. A. Scheraga

Solvent-mediated electrostatic interactions, which are based on the Poisson-


Boltzmann method, have received considerable attention. This includes Honig’s [59,
60] and Vorobjev’s [61] algorithms for solving the Poisson-Boltzmann equation, as
well as the generalized Born model, which is used to implement an approximate
[62–66] solution of this equation.
The multi-dimensional all-atom conformational energy space of a protein con-
tains numerous metastable states with intervening barriers in addition to the global
minimum, which is considered as the representative of the native structure accord-
ing to Anfinsen’s thermodynamic hypothesis [8]. Consequently, global-energy-
minimization procedures, including a menu of algorithms to surmount high-energy
metastable states [67] were developed; Monte Carlo (MC) [67] and molecular dynam-
ics (MD) [68–70] searches were also introduced to identify native structures. To
obtain a large amount of computer time for these searches, Shirts and Pande co-
opted computers world-wide to use their otherwise-idle time [71]. Recently, a dedi-
cated machine (ANTON) has been developed for rapid calculations of MD folding
trajectories of proteins [72].
With the early-developed all-atom potential functions, several types of global
optimization calculations were carried out primarily to determine polypeptide and
protein properties. These include the identification of the right- or left-handed prefer-
ence of the α-helices [38, 42, 43], the structures of the linear pentapeptide methionine
enkephalin [73, 74], the cyclic decapeptide gramicidin S [75] with a ring-closing con-
straint [76], later validated by 2D NMR experiments of Mirau and Bovey [77], the
36-residue villin headpiece [78], the 46-residue protein A [79] with an interesting
result that the folding pathway includes a metastable mirror image, the origin of
which is currently under investigation in terms of 13 Cα chemical shifts [80, 81],
triple-helical collagen models with sequences poly (Gly-X-Y) [82–85] where X and
Y are largely proline and hydroxyproline, respectively, an enzyme-substrate complex
[86, 87], and crystalline cellulose [88].

5 Coarse-Grained Treatment of Proteins

Whereas an all-atom approach could be used for simulating the folding of protein A
[79], the presently-available computer facilities cannot be used for larger proteins.
Therefore, a coarse-grained approach is used [89, 90] to extend the computational
ability to proteins ranging in size of up to several hundred amino acid residues. Early
efforts to use such an approach, but applied to small proteins, are those of Levitt and
Warshel [91] and Pincus and Scheraga [92].
As cited by Sieradzan et al. [93], a UNited RESidue (UNRES) model was devel-
oped in our laboratory to compute the structures of large native proteins [93–114]. In
the UNRES model, a polypeptide chain is represented as a sequence of α-carbon (Cα )
atoms with attached united side chains (SC’s) and united peptide groups (p’s) posi-
tioned halfway between two consecutive Cα ’s. Only the united side chains and united
peptide groups act as interaction sites, while the Cα atoms assist only in the definition
Simulations of the Folding of Proteins: A Historical Perspective 7

of geometry (Fig. 1). The effective energy function is defined as the restricted free
energy (RFE) or the potential of mean force (PMF) of the chain constrained to a
given coarse-grained conformation along with the surrounding solvent [109]. This
effective energy function is expressed by Eq. (1).

Fig. 1 The UNRES model of polypeptide chains. The interaction sites are peptide-group centers
(p) and side-chain centers (SC) attached to the corresponding α carbons with different Cα …SC
bond lengths, d sc . The peptide groups are represented as gray circles, and the side chains are
represented as gray ellipsoids of different sizes. The α-carbon atoms are represented by small open
circles. The geometry of the chain can be described either by the virtual-bond vectors (dCi from
Ciα to Ci+1
α , i  1, 2, . . . , n − 1, and dX from Cα to SC , i  1, 2, . . . , n − 1), represented by thick
i i i
lines, where n is the number of residues, or in terms of virtual-bond lengths, backbone virtual-bond
angles θ i , i  1, 2, . . . , n − 2, backbone virtual-bond-dihedral angles γi , i  1, 2, . . . , n − 3, and
the angles αi and βi i  1, 2, . . . , n − 1, that describe the location of a side chain with respect to the
coordinate frame defined by Ci−1 α , Cα , and Cα . Reprinted with permission from J. Chem. Phys.,
i i+1
115, 2323–2347 (2001). Copyright 2001 American Institute of Physics
8 H. A. Scheraga
  pp
 
U  w SC U SCi SC j + w SC p U SCi p j + wV DW p j + w pp f 2 (T )
U pVi DW el
U peli p j
i< j i j i< j−1 j<i−1
 
+ wtor f 2 (T ) Utor (γi ) + wtor d f 3 (T ) Utor d (γi , γi+1 )
i i
    
+ wb Ub (θi ) + wr ot Ur ot α SCi , β SCi + wbond Ubond (di )
i i i

+ wcorr f 3 (T )Ucorr
(3) (3)
+ wcorr f 4 (T )Ucorr
(4) (4)
+ wcorr f 5 (T )Ucorr
(5) (5)
+ wcorr
(6)
f 6 (T )Ucorr
(6)

(3) (3) (4) (4) (6) (6)


+ wtur n f 3 (T )Utur n + wtur n f 4 (T )Utur n + wtur n f 6 (T )Utur n (1)

where the U’s are energy terms, θi is the backbone virtual-bond angle, γi is the back-
bone virtual-bond-dihedral angle, αi and βi are the angles defining the location of
the center of the united side chain of residue i (Fig. 1), and di is the length of the
ith virtual bond, which is either a Cα ···Cα virtual bond or Cα ···SC virtual bond. Each
energy term is multiplied by an appropriate weight, wx , and the terms corresponding
to factors of order higher than 1 are additionally multiplied by the respective tem-
perature factors (107) which reflect the dependence of the first generalized-cumulant
term in those factors on temperature, as discussed in refs. 107 and 115. The factors
f n are defined by Eq. (2).
 
ln exp(1) + exp(−1)
f n (T )    (2)
ln exp[T /T◦ ]n−1 + exp[−T /T◦ ]n−1

where To  300 K. The term USCiSCj represents the mean free energy of the hydropho-
bic (hydrophilic) interactions between the side chains, which implicitly contains the
contributions from the interactions of the side chain with the solvent. The term USCipj
denotes the excluded-volume potential of the side-chain—peptide-group interac-
tions. The peptide-group interaction potential is split into two parts: the Lennard-
Jones interaction energy between peptide-group centers (UVDW pipj ) and the average
electrostatic energy between peptide-group dipoles (U peli p j ); the second of these terms
accounts for the tendency to form backbone hydrogen bonds between peptide groups
pi and pj . The terms Utor , Utord, Ub , Urot , and Ubond are the virtual-bond-dihedral
angle torsional terms, virtual-bond dihedral angle double-torsional terms, virtual-
bond angle bending terms, side-chain rotamer, and virtual-bond-deformation terms;
these terms account for the local properties of the polypeptide chain. The terms
U(m)
corr represent correlation or multibody contributions from the coupling between
backbone-local and backbone-electrostatic interactions, and the terms U(m) turn are cor-
relation contributions involving m consecutive peptide groups; they are, therefore,
termed turn contributions. The multibody terms are indispensable for reproduction
of regular α-helical and β-sheet structures [98, 101, 116].
The energy-term weights are determined by force-field calibration to reproduce
the structure and folding thermodynamics of selected training proteins [107, 116].
Initially, the UNRES surface was searched with a conformational space anneal-
ing (CSA) algorithm [117] to identify the region of the global minimum at con-
Simulations of the Folding of Proteins: A Historical Perspective 9

stant temperature, i.e., with constant values of the f n (T ) terms of Eq. (1). Then the
UNRES representation of this region is converted to an all-atom one [118, 119], and
a global-minimization search is continued with an all-atom potential. Alternatively,
a procedure developed by Elber and coworkers [120, 121], in which the action is
minimized with appropriate constraints, can convert UNRES trajectories to all-atom
trajectories. Later, canonical molecular dynamics was used (see Sect. 7) to identify
the global-minima conformations in the UNRES representation. The f n (T ) terms
were also included in Eq. (1) [107, 122], based on Kubo’s cumulant series in powers
of (RT)−1 [123], in order to introduce the entropy and thereby evaluate thermody-
namic quantities and proper folding temperatures as well as the native structure and
stable intermediates leading to it.
Early calculations were carried out for single-chain proteins. Subsequently, the
methodology was extended to apply UNRES to molecular dynamics calculations
(see Sect. 7) of multi-chain proteins [124].

6 Application of UNRES for Computation of Structure

Successful early applications to compute structure with UNRES and CSA [95, 97,
117] encouraged us to submit predicted protein structures for evaluation in the CASP
(Critical Assessment of Structure Prediction) blind tests. An example of our initial
submissions [125] to CASP is shown in Fig. 2.

7 Application of UNRES for Computation of Folding


Pathways

In canonical MD applications to simulate protein folding, it is necessary to make


very small time steps, of the order of a femtosecond in advancing the time, to obtain
stable trajectories for evaluating atomic velocities and coordinates. Except for very
small proteins, larger ones fold on a time scale of milliseconds and greater. Currently
available algorithms and computer hardware, aside from the recently introduced
ANTON [72], cannot accommodate such long-time simulations. Therefore, with the
success in obtaining the structure of HDEA [125], MD simulations were attempted
with UNRES [105, 106]. With this coarse-grained force field, the fast degrees of
motion are averaged out, making it possible to maintain a stable trajectory further
into the time scale.
The Langevin equation (Eq. 3) is solved [105, 106] with a version of the stochastic
velocity Verlet algorithm [107] to obtain the generalized coordinates q and velocities
q̇ as a function of time, where the qs are composed of the virtual-bond vectors of
Fig. 1.
10 H. A. Scheraga

Fig. 2 Structure of HDEA from CASP blind test, Superposition of the crystal (red) and predicted
(yellow) structures. Top: Helices 3, 4 and 5 (between residues D25 and I85) are indicated as H-3,
H-4 and H-5, respectively. Bottom: Helices 2 and 3 (between residues W16 and K42) are indicated
as H-2 and H-3, respectively. Reproduced with permission from Figs. 1 and 2 of reference 125.
Copyright 1999 Proceedings of the National Academy of Sciences, U.S.A.

(AT MA + H)q̈  −∇q U (q) − AT Aq̇ + AT f rand (3)

The quantity (AT MA + H) is the inertia matrix, where A is the matrix of a linear
transformation from the space of generalized coordinates and velocities (q and q̇) to
the space of the Cartesian coordinates and velocities of the interacting sites, M is the
diagonal matrix of the masses of the interacting sites, and H is the part of the inertia
matrix that corresponds to the internal (stretching) motions of the virtual bonds. The
quantity U is the UNRES potential energy (Eq. 1), ∇q is the gradient of U,  is the
friction matrix (elements of Stokes law), and the random forces are

2RT T 1/2
f rand  A  N(0, 1) (4)
δt

where R is the universal gas constant, T is the absolute temperature, δt is the inte-
gration time step, and N(0, 1) is a 3D vector whose components are sampled inde-
pendently from a normal distribution with zero mean and unit variance. Together,
the last two terms (friction and random forces) of Eq. (3) constitute a thermostat that
maintains the average temperature at the preset value.
Simulations of the Folding of Proteins: A Historical Perspective 11

Applications of Eq. (3) to several single-chain [104] and multiple-chain [124] pro-
tein systems led to the conclusion [107] that the UNRES/MD approach can facilitate
microsecond and, possibly, millisecond simulations of protein folding and, conse-
quently, of the folding process of proteins in real time.
In addition to the computation of protein structure, UNRES/MD has been applied
to the calculation of folding kinetics [126] and, with the introduction of temperature
dependence in Eq. (1) [107], UNRES/MD has been applied to the calculation of
thermodynamic properties.
To speed up and extend the exploration of conformational space, replica exchange
molecular dynamics (REMD) and multiplexed replica exchange molecular dynamics
(MREMD) have been introduced.
In the REMD method [127–129], M canonical MD simulations are carried
out simultaneously, each one at a different temperature. Initially the temperatures
increase with the sequential number of the simulation (trajectory). After every M
steps, an exchange of temperatures between neighboring trajectories is attempted,
the decision about the exchange being made being based on the Metropolis crite-
rion . If  ≤ 0, the two temperatures are exchanged; otherwise, the exchange is
performed with probability exp(−).
The multiplexed variant (MREMD) developed by Rhee and Pande [130] differs
from the original REMD method in that several trajectories are run at a given tem-
perature. Each set of trajectories run at a different temperature constitutes a layer.
Exchanges are attempted not only within a single layer but also between layers. It has
been demonstrated [131] that MREMD increases the power of REMD considerably,
and convergence of the thermodynamic quantities is achieved much faster.

8 Application of UNRES to Biological Problems

Many biological problems involve proteins of large size that cannot be simulated with
an all-atom force field. For this reason, resort has been had to the use of UNRES to
treat the following biological problems: the aggregation of Aβ [132], the structure
of PICK1 [133], and the opening and closing of the Hsp70 chaperone [134].
Figure 3 illustrates the starting configuration in which one monomer was removed
from the native structure of a 7-chain fibril of Aβ and arranged in an extended confor-
mation. At various times in the simulation, the monomer undergoes conformational
changes, including formation of a partial α-helix, and ends up in a hairpin conforma-
tion bound as the native structure of the fibril. In Fig. 4, even though the PDZ domain
of PICK1 was started from opposite sides of the BAR domains in the simulation, both
starting structures end in the same stable configuration, namely, near the center of the
concave surface of the BAR domains. In simulations of the opening and closing of
an Hsp70 chaperone, UNRES does not include parameters for ADP and ATP which
participate in the configurational changes of the substrate binding domain (SBD) and
the nucleotide binding domain (NBD). Therefore, the simulations were carried out
by constraining the NBD domain to the structures with ADP and ATP, respectively,
12 H. A. Scheraga

Fig. 3 Selected snapshots, between t  0.01 and 14.70 ns, along a representative trajectory of an
initially fully-extended monomer at t  0 binding to a 7-chain fibril of Aβ. After forming an α-
helical portion at t  0.14 ns, the monomer docks at t  0.26 ns, with native orientation. At t 
1.8 ns, the N-terminal strand is locked into the template. Meanwhile the C-terminus, which is still
free to move, bends and makes a β strand with itself. This conformation is very stable but, at t 
14.49 ns, the β strand is finally disrupted. Shortly after that, at t  14.7 ns, the monomer binds with
the native conformation. Reprinted from J. Mol. Biol., 404 (3), A. Rojas, A. Liwo, D. Browne, H.A.
Scheraga, Mechanism of Fiber Assembly: Treatment of Aβ Peptide Aggregation with a Coarse-
Grained United-Residue Force Field, 537–552 (2010), with permission from Elsevier

bound. SBD is split into the β-sheet (SBD-β) and α-helical (SBD-α) subdomains,
and NBD consists of NBD-I and NBD-II subdomains. Binding of ATP to the NBD
facilitates a transformation (Fig. 5) in which a substrate protein is released from the
SBD.

9 Coarse-Grained Model of Nucleic Acids

In a landmark experiment, Paul Doty and Julius Marmur provided experimental sup-
port for the Watson and Crick double-helical model of DNA by demonstrating that
DNA could be unfolded and then re-folded thermally [135]. In order to simulate these
processes, and also to be able to treat protein-DNA interactions (in connection with
UNRES), it is necessary to have a coarse-grained model of nucleic acids. Several
coarse-grained models of nucleic acids have already been reported [136–141]. We
have also formulated a coarse-grained model of nucleic acid bases [142]. Each base
is represented by 3–5 interaction centers. The interactions between bases are divided
into a van der Waals component modeled with a Lennard-Jones 12-6 energy function
Simulations of the Folding of Proteins: A Historical Perspective 13

Fig. 4 Initial structures with


the PDZ domain pulled away
from the BAR domains,
selected for subsequent
UNRES simulations of the
PICK1 dual-BAR construct.
During the simulation, both
initial structures of the PDZ
domain end in the same
stable configuration near the
center of the concave surface
of the BAR domains.
Reprinted from J. Mol. Biol,
405 (1), Y. He, A. Liwo, H.
Weinstein, H.A. Scheraga,
PDZ Binding to the BAR
Domain of PICK1 is
Elucidated by
Coarse-Grained Molecular
Dynamics, 298–314 (2011),
with permission from
Elsevier

and electrostatic components modeled with a multipole-multipole interaction energy.


A model with a three-center cytosine, four-center guanine, four-center thymine, and
five-center adenine satisfactorily reproduces the canonical Watson-Crick hydrogen
bonding and stacking interaction energies of the all-atom AMBER model. The com-
putation time with the coarse-grained model is reduced seven times compared with
that of the all-atom model [143].
Further work has been carried out to develop a complete coarse-grained model of
nucleic acids, with which it has been possible to demonstrate that averaged interac-
tions between base dipoles together with backbone stiffness but without specific base
pairing, are sufficient to form double-helical structures of DNA and RNA molecules
[143]. Additionally, local interactions determine helix handedness and direction of
strand packing. This result, and earlier research on reduced protein models, suggests
14 H. A. Scheraga
Simulations of the Folding of Proteins: A Historical Perspective 15

Fig. 5 Illustration of the rotation of NBD-I with respect to NBD-II of Hsp70, which brings SBD-
β close to the backside of NBD-II. a Hsp70 structure in which NBD-I crosses NBD-II; b NBD
structure in which NBD-I moves closer to a parallel orientation with respect to NBD-II; c The
structure after 10 ps of simulations, in which NBD-I is nearly parallel to NBD-II. The “switch”
(SW) α-helix, which runs from E369 to G380, rotates with respect to NBD-II, following the rotation
of NBD-I, with which it is associated through interactions with the E171–Y179 “holder” (HO) α-
helix. The SW α-helix is connected to SBD-β by a linker. Consequently, its motion switches the
orientation of SBD-β from the top of NBD to the back side of this domain and brings it closer to
NBD-I; it also brings the linker segment closer to the β-sheet of NBD-II and enables it to join it
as a β-strand in the Hsp70 structure, thus fixing SBD at a short distance from the NBD. Reprinted
with permission from the Journal of Chemical Theory and Computation, 8, 1750–1764 (2012).
Copyright 2012 American Chemical Society

that mean-field multipole-multipole interactions are the principal factors responsible


for the formation of regular structure of biomolecules.

10 Simulation of Large Complexes

Other simulations of large structures besides Aβ, PICK1 and Hsp70 have been carried
out. Klaus Schulten has treated proteins in the lipid environment of a membrane, the
structure of tobacco mosaic virus, and several other viruses [144]. The feasibility
of such simulations was enhanced by the development in Schulten’s group of the
NAMD parallelized software [144].
Brooks and coworkers obtained an atomically detailed picture of functionally
important structural rearrangements that occur during translocation by combin-
ing structural data for a ribosome from X-ray crystallography and cryo-electron
microscopy with dynamic models based on elastic network normal mode analysis
[145].
Jernigan and coworkers used a coarse-grained elastic network model to explore
how well conformational transitions in proteins can be predicted by normal mode
motions [146]. They concluded that the applicability of an elastic network model to
explore conformational changes depends strongly on how collective is the transition.

11 Simulations of Structural Fluctuations

In view of the possible involvement of structural fluctuations during protein fold-


ing, molecular dynamics (MD) simulations have been carried out to determine the
time dependence, initially, of fluctuations of proteins in their native state [147–149].
From such MD simulations, it was found that the mean-square displacement of the
backbone dihedral angles γ of each amino-acid residue of Fig. 1 of a native protein
increases as a power law of time, tα , with an exponent α between 0.08 and 0.39 at
16 H. A. Scheraga

300 K [147], i.e., the motion is subdiffusive (α  1 corresponds to Brownian diffu-


sion). Residues with low exponents, e.g., 0.08, are located mainly in well-defined sec-
ondary elements, adopt one conformational substate, and move on a single-minimum
free-energy profile. Residues with high exponents, e.g., 0.39, are found in loops/turns
and chain ends and exist in multiple conformational substates, i.e., they move on
multiple-minima free-energy profiles. By also examining the time dependence of
the fluctuations of the backbone N–H bonds, it was possible to determine how the
fluctuations vary with the free-energy profiles along the amino acid sequence [148],
as observed in NMR studies of subdiffusion of fluorescent probes within a protein
molecule. Considerations of the correlations of the fluctuations of the backbone and
those of the β carbon of the side chain show that both types of subdiffusion motion
occur on similar free energy profiles [149]. These results are a possible indication
of the role of the flexible regions of proteins for biological function and folding,
and simulations are presently in progress to examine the time dependence of the
fluctuations along a folding trajectory.

12 Conclusions

The foregoing brief historical perspective has traced the development of our consider-
ations of the physical interactions in proteins, and the applications of this information
to the formulation of simulation approaches to treat biological processes. In the near
future, we may expect to see simulations of protein-protein and protein-DNA inter-
actions and the treatment of very large biological complexes that make up the living
cell. Hopefully, this will facilitate the treatment of many diseases that originate from
malfunction of protein systems.
Further reading about simulations of biological systems can be found in the rest
of this book, in Gregory Voth’s book [90], and in a new two-volume treatise edited
by Tamar Schlick [150].

References

1. Cohn, E.J., Edsall, J.T.: Proteins, Amino Acids and Peptides as Ions and Dipolar Ions. Reinhold
publishers, New York (1943)
2. Linderstrøm-Lang, K.U.: On the ionisation state of proteins. Compt. Rend. Trav. Lab. 15,
1–29 Carlsberg (1924)
3. Debye, P., Hückel, E.: Zur Theorie der Electrolyte. Phys Zeit 24, 185–206 (1923)
4. Svedberg, T., Pederson, K.O.: The Ultracentrifuge. Clarendon Press, Oxford (1940)
5. Neurath, H., Saum, A.M.: The denaturation of serum albumin: diffusion and viscosity mea-
surements of serum albumin in the presence of urea. J. Biol. Chem. 128, 347–362 (1939)
6. Oncley, J.L.: Evidence from physical chemistry regarding the size and shape of protein
molecules from ultra-centrifugation, diffusion, viscosity, dielectric dispersion, and double
refraction of flow. Annals N.Y. Acad. Sci. 41, 121–150 (1941)
Simulations of the Folding of Proteins: A Historical Perspective 17

7. Edsall, J.T.: On the laboratory that produced the book, proteins, amino acids and peptides.
AIChE J. 44, 949–953 (1995)
8. Anfinsen, C.B.: Principles that govern the folding of protein chains. Science 181, 223–230
(1973)
9. Scheraga, H.A., Edsall, J.T., Gadd Jr., J.O.: Double refraction of flow: numerical evaluation
of extinction angle and birefringence as a function of velocity gradient. J Chem Phys 19,
1101–1108 (1951)
10. Scheraga, H.A., Mandelkern, L.: Consideration of the hydrodynamic properties of proteins.
J. Am. Chem. Soc. 75, 179–184 (1953)
11. Flory, P.J., Fox Jr., T.G.: Treatment of intrinsic viscosities. J. Am. Chem. Soc. 73, 1904–1908
(1951)
12. Mandelkern, L., Krigbaum, W.R., Scheraga, H.A., Flory, P.J.: Sedimentation behavior of
flexible chain molecules: polyisobutylene. J Chem Phys 20, 1392–1397 (1952)
13. Pauling, L., Corey, R.B., Brauson, H.R.: The structure of proteins: two hydrogen-bonded
helical configurations of the polypeptide chain. Proc. Natl. Acad. Sci. U.S.A. 37, 205–211
(1951)
14. Pauling, L., Corey, R.B.: Configurations of polypeptide chains with favored orientations
around single bonds: Two new pleated sheets. Proc. Natl. Acad. Sci. U.S.A. 37, 729–740
(1951)
15. Sanger, F.: The arrangement of amino acids in proteins. Adv. Protein Chem. 7, 1–66 (1952)
16. Ryle, A.P., Sanger, F., Smith, I.F., Kitai, R.: The disulfide bonds of insulin. Biochem. J. 60,
542–556 (1955)
17. Perutz, M.F., Rossman, M.G., Cullis, A.F., Muirhead, H., Will, G., North, A.C.T.: Structure
of haemoglobin, a three-dimensional Fourier synthesis at 5.5 Å resolution, obtained by x-ray
analysis. Nature 185, 416–422 (1960)
18. Kendrew, J.C., Dickerson, R.E., Strandberg, B.E., Hart, R.G., Davies, D.R., Philips, D.C.,
Shore, V.C.: Structure of myoglobin, a three-dimensional Fourier synthesis at 2 Å resolution.
Nature 185, 422–427 (1960)
19. Laskowski Jr., M., Scheraga, H.A.: Thermodynamic considerations of protein reactions. I.
Modified reactivity of polar groups. J. Am. Chem. Soc. 76, 6305–6319 (1954)
20. Laskowski Jr., M., Scheraga, H.A.: Thermodynamic considerations of protein reactions. II.
Modified reactivity of primary valence bonds. J. Am. Chem. Soc. 78, 5793–5798 (1956)
21. Némethy, G., Scheraga, H.A. :The structure of water and hydrophobic bonding in proteins.
III. The thermodynamic properties of hydrophobic bonds in proteins. J. Phys. Chem. 66,
1773–1789 (1962). Erratum: J. Phys. Chem. 67:2888 (1963)
22. Griffith, J.H., Scheraga, H.A.: Statistical thermodynamics of aqueous solutions. I. Water
structure, solutions with non-polar solutes, and hydrophobic interactions. J. Mol. Struct. 682,
97–113 (2004)
23. Owicki, J.C., Scheraga, H.A.: Monte Carlo calculations in the isothermal isobaric ensemble.
2. Dilute aqueous solution of methane. J. Am. Chem. Soc. 99, 7413–7418 (1977)
24. Rapaport, D.C., Scheraga, H.A.: Hydration of inert solutes. A molecular dynamics study. J.
Phys. Chem. 86, 873–880 (1982)
25. Kendrew, J.C.: The structure of globular proteins. Comp. Biochem. Physiol. 4, 249–252 (1962)
26. Scheraga, H.A.: Theory of hydrophobic interactions. J. Biomol. Struct. Dyn. 16, 447–460
(1998)
27. Sturtevant, J.M., Laskowski Jr., M., Donnelly, T.H., Scheraga, H.A.: Equilibria in the
fibrinogen-fibrin conversion. III. Heats of polymerization and clotting of fibrin monomer.
J. Am. Chem. Soc. 77, 6168–6172 (1955)
28. Scheraga, H.A.: Structural studies of pancreatic ribonuclease. Fed. Proc. 26, 1380–1387
(1967)
29. Wlodawer, A., Svensson, L.A., Sjölin, L., Gilliland, G.L.: Structure of phosphate-free ribonu-
clease A refined at 1.26Å. Biochemistry 27, 2705–2717 (1988)
30. Némethy, G., Scheraga, H.A.: Theoretical determination of sterically allowed conformations
of a polypeptide chain by a computer method. Biopolymers 3, 155–184 (1965)
18 H. A. Scheraga

31. Scheraga, H.A.: Calculations of conformations of polypeptides. Adv. Phys. Org. Chem. 6,
103–184 (1968)
32. Ramachandran, G.N., Ramakrishnan, C., Sasisekharan, V.: Stereochemistry of polypeptide
chain configurations. J. Mol. Biol. 7, 95–99 (1963)
33. Scheraga, H.A., Leach, S.J., Scott, R.A., Némethy, G.: Intramolecular forces and protein
conformation. Disc Faraday Soc. 40, 268–277 (1965)
34. Némethy, G., Leach, S.J., Scheraga, H.A.: The influence of amino acid side chains on the free
energy of helix coil transitions. J. Phys. Chem. 70, 998–1004 (1966)
35. Leach, S.J., Némethy, G., Scheraga, H.A.: Computation of the sterically allowed conforma-
tions of peptides. Biopolymers 4, 369–407 (1966)
36. de Santis, P., Giglio, E., Liquori, A.M., Ripamonti, A.: Stability of helical conformations of
simple linear polymers. J. Polym. Sci. Part A 1, 1383–1404 (1963)
37. Brant, D.A., Flory, P.J.: The configuration of random polypeptide chains. II. Theory. J. Am.
Chem. Soc. 87, 2791–2800 (1965)
38. Ooi, T., Scott, R.A., Vanderkooi, G., Scheraga, H.A.: Conformational analysis of macro-
molecules. IV. Helical structures of poly-L-alanine, poly-L-valine, poly-β-methyl L-aspartate,
poly-γ-methyl-L-glutamate, and poly-L-tyrosine. J. Chem. Phys. 46, 4410–4426 (1967)
39. Gibson, K.D., Scheraga, H.A.: Minimization of polypeptide energy. II. Preliminary structures
of oxytocin, vasopressin and an octapeptide from ribonuclease. Proc. Natl. Acad. Sci. U.S.A.
58, 1317–1323 (1967)
40. Gibson, K.D., Scheraga, H.A.: Minimization of polypeptide energy. VII. Second derivatives
and statistical weights of energy minima for deca–L–alanine. Proc. Natl. Acad. Sci. U.S.A.
63, 242–245 (1969)
41. Scott, R.A., Vanderkooi, G., Tuttle, R.W., Shames, P.M., Scheraga, H.A.: Minimization of
polypeptide energy. III. Application of a rapid energy minimization technique to the calcula-
tion of preliminary structures of gramicidin–S. Proc. Natl. Acad. Sci. 58, 2204–2211 (1967)
42. Yan, J.F., Vanderkooi, G., Scheraga, H.A.: Conformational analysis of macromolecules. V.
Helical structures of poly–L–aspartic acid and poly–L–glutamic acid, and related compounds.
J. Chem. Phys. 49, 2713–2726 (1968)
43. Yan, J.F., Momany, F.A., Scheraga, H.A.: Conformational analysis of macromolecules. VI.
Helical Structures of o–, m–, and p–chlorobenzyl esters of poly–L–aspartic acid. J. Am. Chem.
Soc. 92, 1109–1115 (1970)
44. Momany, F.A., Vanderkooi, G., Scheraga, H.A.: Determination of intermolecular potentials
from crystal data. I. General theory and application to crystalline benzene at several temper-
atures. Proc. Natl. Acad. Sci. U.S.A. 61, 429–436 (1968)
45. Momany, F.A., McGuire, R.F., Yan, J.F., Scheraga, H.A.: Energy parameters in polypeptides.
IV. Semiempirical molecular orbital calculations of conformational dependence of energy and
partial charge in di– and tripeptides. J. Phys. Chem. 75, 2286–2297 (1971)
46. Levitt, M., Lifson, S.: Refinement of protein confirmations using a macromolecular energy
minimization procedure. J. Mol. Biol. 46, 269–279 (1969)
47. Hagler, A.T., Huler, E., Lifson, S.: Energy functions for peptides and proteins. I. Derivation
of a consistent force field including the hydrogen bond from amide crystals. J. Am. Chem.
Soc. 96, 5319–5327 (1974)
48. Momany, F.A., McGuire, R.F., Burgess, A.W., Scheraga, H.A.: Energy parameters in polypep-
tides. VII. Geometric parameters, partial atomic charges, nonbonded interactions, hydrogen
bond interactions, and intrinsic torsional potentials for the naturally occurring amino acids.
J. Phys. Chem. 79, 2361–2381 (1975)
49. Némethy, G., Pottle, M.S., Scheraga, H.A.: Energy parameters in polypeptides. 9. Updating
of geometrical parameters, nonbonded interactions, and hydrogen bond interactions for the
naturally occurring amino acids. J. Phys. Chem. 87, 1883–1887 (1983)
50. Sippl, M.J., Némethy, G., Scheraga, H.A.: Intermolecular potentials from crystal data. 6.
Determination of empirical potentials for O–H···O  C hydrogen bonds from packing config-
urations. J. Phys. Chem. 88, 6231–6233 (1984)
Simulations of the Folding of Proteins: A Historical Perspective 19

51. Némethy, G., Gibson, K.D., Palmer, K.A., Yoon, C.N., Paterlini, G., Zagari, A., Rumsey, S.,
Scheraga, H.A.: Energy parameters in polypeptides. 10. Improved geometrical parameters
and nonbonded interactions for use in the ECEPP/3 algorithm, with application to proline—
containing peptides. J. Phys. Chem. 96, 6472–6484 (1992)
52. Arnautova, Y.A., Jagielska, A., Scheraga, H.A.: A new force field (ECEPP-05) for peptides,
proteins and organic molecules. J. Phys. Chem. B 110, 5025–5044 (2006)
53. Brooks, B.R., Bruccoleri, R.E., Olafson, B.D., States, D.J., Swaminathan, S., Karplus, M.:
CHARMM: a program for macromolecular energy, minimization, and dynamics calculations.
J. Comput. Chem. 4, 187–217 (1983)
54. Cornell, W.D., Cieplak, P., Bayley, C.I., Gould, I.R., Merz Jr., K.M., Ferguson, D.M.,
Spellmeyer, D.C., Fox, T., Caldwell, J.W., Kollman, P.A.: A second generation force field
for the simulation of proteins, nucleic acids, and organic molecules. J. Am. Chem. Soc. 117,
5179–5197 (1995)
55. Scott, W.R.P., Huenenberger, P.H., Tironi, I.G., Mark, A.E., Billeter, S.R., Fennen, J., Torda,
A.E., Huber, T., Krueger, P., van Gusteren, W.F.: The GROMOS biomolecular simulation
program package. J. Phys. Chem. A 103, 3596–3607 (1999)
56. Jorgensen, W.L., Chandrasekhar, J., Madura, J.D., Impey, R.W., Klein, M.L.: Comparison of
simple potential functions for simulating liquid water. J. Chem. Phys. 79, 926–935 (1983)
57. Ooi, T., Oobatake, M., Némethy, G., Scheraga, H.A.: Accessible surface areas as a measure
of the thermodynamic parameters of hydration of peptides. Proc. Natl. Acad Sci. U.S.A. 84,
3086–3090 (1987). Erratum: ibid., 84, 6015 (1987)
58. Vila, J., Williams, R.L., Vasquez, M., Scheraga, H.A.: Empirical solvation models can be used
to differentiate native from near native conformations of bovine pancreatic trypsin inhibitor.
Proteins: Struct. Funct. Genet. 10, 199–218 (1991)
59. Nicholls, A., Honig, B.: A rapid finite difference algorithm, utilizing successive over-
relaxation to solve the Poisson-Boltzmann equation. J. Comp. Chem. 12, 435–445 (1991)
60. Nicholls, A., Sharp, K.A., Honig, B.: Protein folding and association: Insights from the inter-
facial and thermodynamic properties of hydrocarbons. Proteins: Struct. Funct. Genet. 11,
281–296 (1991)
61. Vorobjev, Y.N., Vila, J.A., Scheraga, H.A.: FAMBE-pH: a fast and accurate method to compute
the total solvation free energies of proteins. J. Phys. Chem. B 112, 11122–11136 (2008)
62. Still, W.C., Tempczyk, A., Hawley, R.C., Henderickson, T.: Semianalytical treatment of sol-
vation for molecular mechanics and dynamics. J. Am. Chem. Soc. 112, 6127–6129 (1990)
63. Bashford, D., Case, D.: Generalized Born models of macromolecular solvation effects. Annu.
Rev. Phys. Chem. 51, 129–152 (2000)
64. Ferrara, P., Apostolakis, J., Caflisch, A.: Evolution of a fast implicit solvent model for molec-
ular dynamics simulations. Proteins 46, 24–33 (2002)
65. Bursulaya, B., Brooks III, C.I.: Comparative study of folding free energy landscape of a
three-stranded-sheet protein with explicit and implicit solvent models. J. Phys. Chem. B 104,
12378–12383 (2002)
66. Im, W., Lee, M., Brooks, C.: Generalized Born model with a simple smoothing function. J.
Comp. Chem. 24, 1691–1702 (2003)
67. Scheraga, H.A., Pillardy, J., Liwo, A., Lee, J., Czaplewski, C., Ripoll, D.R., Wedemeyer, W.J.,
Arnautova, Y.A.: Evolution of physics-based methodology for exploring the conformational
energy landscape of proteins. J. Comput. Chem. 23, 28–34 (2002)
68. Alder, B.J., Wainwright, T.: Molecular dynamics by electronic computers. In: Prigogine, I.
(ed.) Proceedings of the International Symposium on Transport. Process in Statistical Mechan-
ics, pp. 97–131. Interscience, New York (1957)
69. McCammon, J.A., Gelin, B.R., Karplus, M.: Dynamics of folded proteins. Nature 267,
585–590 (1977)
70. Scheraga, H.A., Khalili, M., Liwo, A.: Protein folding dynamics: overview of molecular
simulation techniques. Annu. Rev. Phys. Chem. 58, 57–83 (2007)
71. Shirts, M., Pande, V.S.: Screen savers of the world unite! Science 290, 1903–1904 (2000)
20 H. A. Scheraga

72. Shaw, D.E., et al.: Structure and dynamics of an unfolded protein examined by molecular
dynamics simulation. J. Am. Chem. Soc. 336, 3787–3791 (2012)
73. Li, Z., Scheraga, H.A.: Monte Carlo–minimization approach to the multiple–minima problem
in protein folding. Proc. Natl. Acad. Sci. U.S.A. 84, 6611–6615 (1987)
74. Hansmann, U.H.E., Masuya, M., Okamoto, Y.: Characteristic temperatures of folding of a
small peptide. Proc. Natl. Acad. Sci. U.S.A. 94, 10652–10656 (1997)
75. Dygert, M., Go, N., Scheraga, H.A.: Use of a symmetry condition to compute the conformation
of gramicidin S. Macromolecules 8, 750–761 (1975)
76. Gō, N., Scheraga, H.A.: Ring closure in chain molecules with Cn, I or S2n symmetry. Macro-
molecules 6, 273–281 (1973)
77. Mirau, R.A., Bovey, F.A.: 2D and 3D NMR studies of polypeptide structure and function.
Abstracts, 199th ACS meeting. Polymer Division, Boston, 206 (1990)
78. Ripoll, D.R., Vila, J.A., Scheraga, H.A.: Folding of the villin headpiece subdomain from
random structures. Analysis of the charge distribution as a function of pH. J. Mol. Biol. 339,
915–925 (2004)
79. Vila, J.A., Ripoll, D.R., Scheraga, H.A.: Atomically detailed folding simulation of the B
domain of staphylococcal protein A from random structures. Proc. Natl. Acad. Sci. U.S.A.
100, 14812–14816 (2003)
80. Vila, J.A., Arnautova, Y.A., Martin, O.A., Scheraga, H.A.: Quantum-mechanics-derived 13 Cα
chemical shift server (Che Shift) for protein structure validation. Proc. Natl. Acad. Sci. U.S.A.
106, 16972–16977 (2009)
81. Vila, J.A., Scheraga, H.A.: Assessing the accuracy of protein structures by quantum mechan-
ical computations of 13 Cα chemical shifts. Acc. Chem. Res. 42, 1545–1553 (2009)
82. Miller, M.H., Scheraga, H.A.: Calculation of the structures of collagen models. Role
of interchain interactions in determining the triple–helical coiled–coil conformation. I.
Poly(glycyl–prolyl–prolyl). J. Polym. Sci.: Polym. Symp. 54, 171–200 (1976)
83. Miller, M.H., Némethy, G., Scheraga, H.A.: Calculation of the structures of collagen models.
Role of interchain interactions in determining the triple–helical coiled–coil conformation. 2.
Poly(glycyl–prolyl–hydroxyprolyl). Macromolecules 13, 470–478 (1980)
84. Miller, M.H., Némethy, G., Scheraga, H.A.: Calculation of the structures of collagen models.
Role of interchain interactions in determining the triple-helical-coiled coil conformation. 3.
Poly(glycyl-prolyl-alanyl). Macromolecules 13, 910–913 (1980)
85. Némethy, G., Miller, M.H., Scheraga, H.A.: Calculation of the structures of collagen models.
Role of interchain interactions in determining the triple–helical coiled–coil conformation. 4.
Poly(glycyl–alanyl–prolyl). Macromolecules 13, 914–919 (1980)
86. Pincus, M.R., Scheraga, H.A.: Conformational energy calculations of enzyme–substrate and
enzyme–inhibitor complexes of lysozyme. 2. Calculation of the structures of complexes with
a flexible enzyme. Macromolecules 12, 633–644 (1979)
87. Smith-Gill, S.J., Rupley, J.A., Pincus, M.R., Carty, R.P., Scheraga, H.A.: Experimental iden-
tification of a theoretically predicted “left–sided” binding mode for (GlcNAc)6 in the active
site of lysozyme. Biochemistry 23, 993–997 (1984)
88. Simon, I., Glasser, L., Scheraga, H.A., Manley, R.S.J.: Structure of cellulose. 2. Low–energy
crystalline arrangements. Macromolecules 21, 990–998 (1988)
89. Kolinski, A.: Protein modeling and structure prediction with a reduced representation. Acta
Biochim. Pol. 51, 349–371 (2004)
90. Voth, G.A.: Coarse-graining of Condensed Phase and Biomolecular Systems. CRC Press,
Boca Raton, FL (2009)
91. Levitt, M., Warshel, A.: Computer simulation of protein folding. Nature 253, 694–698 (1975)
92. Pincus, M.R., Scheraga, H.A.: An approximate treatment of long–range interactions in pro-
teins. J. Phys. Chem. 81, 1579–1583 (1977)
93. Sieradzan, A.K., Hansmann, U.H.E., Scheraga, H.A., Liwo, A.: Extension of UNRES force
field to treat polypeptide chains with D-amino-acid residues. J. Chem. Theory. Comput. 8,
4746–4757 (2006)
Simulations of the Folding of Proteins: A Historical Perspective 21

94. Liwo, A., Pincus, M.R., Wawak, R.J., Rackovsky, S., Scheraga, H.A.: Prediction of pro-
tein conformation on the basis of a search for compact structures; test on avian pancreatic
polypeptide. Protein Sci. 2, 1715–1731 (1993)
95. Liwo, A., Oldziej, S., Pincus, M.R., Wawak, R.J., Rackovsky, S., Scheraga, H.A.: A united-
residue force field for off-lattice protein-structure simulations. I. Functional forms and param-
eters of long-range side-chain interaction potentials from protein crystal data. J. Comput.
Chem. 18, 849–873 (1997)
96. Liwo, A., Pincus, M.R., Wawak, R.J., Rackovsky, S., Oldziej, S., Scheraga, H.A.: A united-
residue force field for off-lattice protein-structure simulations. II. Parameterization of short-
range interactions and determination of weights of energy terms by Z-score optimization. J.
Comput. Chem. 18, 874–887 (1997)
97. Liwo, A., Kazmierkiewicz, R., Czaplewski, C., Groth, M., Oldziej, S., Wawak, R.J., Rack-
ovsky, S., Pincus, M.R., Scheraga, H.A.: United-residue force field for off-lattice protein-
structure simulations. III. Origin of backbone hydrogen-bonding cooperativity in united-
residue potentials. J. Comput. Chem. 19, 259–276 (1998)
98. Liwo, A., Czaplewski, C., Pillardy, J., Scheraga, H.A.: Cumulant-based expressions for the
multibody terms for the correlation between local and electrostatic interactions in the united-
residue force field. J. Chem. Phys. 115, 2323–2347 (2001)
99. Liwo, A., Arlukowicz, P., Czaplewski, C., Ołdziej, S., Pillardy, J., Scheraga, H.A.: A method
for optimizing potential-energy functions by a hierarchical design of the potential-energy
landscape: Application to the UNRES force field. Proc. Natl. Acad. Sci. U.S.A. 99, 1937–1942
(2002)
100. Liwo, A., Ołdziej, S., Czaplewski, C., Kozłowska, U., Scheraga, H.A.: Parameterization of
backbone-electrostatic and multibody contributions to the UNRES force field for protein-
structure prediction from ab initio energy surfaces of model systems. J. Phys. Chem. B 108,
9421–9438 (2004)
101. Liwo, A., Arłukowicz, P., Ołdziej, S., Czaplewski, C., Makowski, M., Scheraga, H.A.: Opti-
mization of the UNRES force field by hierarchical design of the potential-energy landscape. 1.
Tests of the approach using simple lattice protein models. J. Phys. Chem. B 108, 16918–16933
(2004)
102. Ołdziej, S., Liwo, A., Czaplewski, C., Pillardy, J., Scheraga, H.A.: Optimization of the UNRES
force field by hierarchical design of the potential-energy landscape. 2. Off-lattice tests of the
method with single proteins. J. Phys. Chem. B 108, 16934–16949 (2004)
103. Ołdziej, S., Lagiewka, J., Liwo, A., Czaplewski, C., Chinchio, M., Nanias, M., Scheraga,
H.A.: Optimization of the UNRES force field by hierarchical design of the potential-energy
landscape. 3. Use of many proteins in optimization. J. Phys. Chem. B 108, 16950–16959
(2004)
104. Liwo, A., Khalili, M., Scheraga, H.A.: Ab initio simulations of protein-folding pathways by
molecular dynamics with the united-residue model of polypeptide chains. Proc. Natl. Acad.
Sci. U.S.A. 102, 2362–2367 (2005)
105. Khalili, M., Liwo, A., Rakowski, F., Grochowski, P., Scheraga, H.A.: Molecular dynamics
with the united-residue model of polypeptide chains. I. Lagrange equations of motion and
tests of numerical stability in the microcanonical mode. J. Phys. Chem. B 109, 13785–13797
(2005)
106. Khalili, M., Liwo, A., Jagielska, A., Scheraga, H.A.: Molecular dynamics with the united-
residue model of polypeptide chains. II. Langevin and Berendsen-bath dynamics and tests on
model α-helical systems. J. Phys. Chem. B 109, 13798–13810 (2005)
107. Liwo, A., Khalili, M., Czaplewski, C., Kalinowski, S., Ołdziej, S., Wachucik, K., Scheraga,
H.A.: Modification and optimization of the united-residue (UNRES) potential-energy function
for canonical simulations. I. Temperature dependence of the effective energy function and
tests of the optimization method with single training proteins. J.Phys. Chem. B 111, 260–285
(2007)
108. Kozlowska, U., Liwo, A., Scheraga, H.A.: Determination of virtual-bond-angle potentials
of mean force for coarse-grained simulations of protein structure and folding from ab initio
22 H. A. Scheraga

energy surfaces of terminally-blocked glycine, alanine, and proline. J. Phys.: Condens. Matter
19, 285203-1—285203-15 (2007)
109. Liwo, A., Czaplewski, C., Ołdziej, S., Rojas, A.V., Kazmierkiewicz, R., Makowski, M.,
Murarka, R.K., Scheraga, H.A.: Simulation of protein structure and dynamics with the coarse-
grained UNRES force field. In: Voth, G.A. (ed.) Coarse-Graining of Condensed Phase and
Biomolecular Systems, pp. 107–122. CRC Press, Boca Raton, FL (2008)
110. Ołdziej, S., Czaplewski, C., Liwo, A., Scheraga, H.A.: Towards temperature dependent coarse-
grained potential of side-chain interactions for protein folding simulations, BIBE. In: IEEE
International Conference on Bioinformatics and Bioengineering, pp 263–266 (2010)
111. Liwo, A., Ołdziej, S., Czaplewski, C., Kleinerman, D.S., Blood, P., Scheraga, H.A.: Imple-
mentation of molecular dynamics and its extensions with the coarse-grained UNRES force
field on massively parallel systems; towards millisecond-scale simulations of protein struc-
ture, dynamics, and thermodynamics. J. Chem. Theor. Comput. 6, 890–909 (2010)
112. Maisuradze, G.G., Senet, P., Czaplewski, C., Liwo, A., Scheraga, H.A.: Investigation of protein
folding by coarse-grained molecular dynamics with the UNRES force field. J. Phys. Chem.
A 114, 4471–4485 (2010)
113. Makowski, M., Liwo, A., Scheraga, H.A.: Simple physics-based analytical formulas for the
potentials of mean force of the interaction of amino-acid side chains in water. VI. Oppositely-
charged side chains. J Phys Chem 115, 6130–6137 (2011)
114. Sieradzan, A.K., Scheraga, H.A., Liwo, A.: Determination of effective potentials for the
stretching of Cα …Cα virtual bonds in polypeptide chains for coarse-grained simulations of
proteins from ab initio energy surfaces of N-methylacetamide and N-acetylpyrrolidine. J.
Chem. Theor. Comput. 8, 1334–1343 (2012)
115. Kolinski, A., Skolnick, J.: Discretized model of proteins: I. Monte Carlo study of cooperativity
in homopolypeptides. J. Chem. Phys. 97, 9412–9426 (1992)
116. He, Y., Xiao, Y., Liwo, A., Scheraga, H.A.: Exploring the parameter space of the coarse-
grained UNRES force field by random search: Selecting a transferable medium-resolution
force field. J. Comput. Chem. 30, 2127–2135 (2009)
117. Lee, J., Scheraga, H.A., Rackovsky, S.: New optimization method for conformational
energy calculations on polypeptides: conformational space annealing. J. Comput. Chem. 18,
1222–1232 (1997)
118. Kazmierkiewicz, R., Liwo, A., Scheraga, H.A.: Energy-based reconstruction of a protein
backbone from its α-carbon trace by a Monte-Carlo method. J. Comput. Chem. 23, 715–723
(2002)
119. Kazmierkiewicz, R., Liwo, A., Scheraga, H.A.: Addition of side chains to a known backbone
with defined side-chain centroids. Biophys. Chem. 100, 261–280 (2003). Erratum: Biophys.
Chem. 106, 91 (2003)
120. Elber, R., Ghosh, A., Cardenas, A.: Long time dynamics of complex systems. Acc. Chem.
Res. 35, 396–403 (2002)
121. Ghosh, A., Elber, R., Scheraga, H.A.: An atomically detailed study of the folding path-
ways of protein A with the stochastic difference equation. Proc. Natl. Acad. Sci. U.S.A. 99,
10394–10398 (2002)
122. Shen, H., Liwo, A., Scheraga, H.A.: An improved functional form for the temperature, scaling
factors of the components of the mesoscopic UNRES force field for simulations of protein
structure and dynamics. J. Phys. Chem. B 113, 8738–8744 (2009)
123. Kubo, R.: Generalized cumulant expansion method. J. Phys. Soc. Jpn. 17, 1100–1120 (1962)
124. Rojas, A.V., Liwo, A., Scheraga, H.A.: Molecular dynamics with the united-residue (UNRES)
force field. Ab initio folding simulations of multi-chain proteins. J. Phys. Chem. B 111,
293–309 (2007)
125. Liwo, A., Lee, J., Ripoll, D.R., Pillardy, J., Scheraga, H.A.: Protein structure prediction
by global optimization of a potential energy function. Proc. Natl. Acad. Sci. U.S.A. 96,
5482–5485 (1999)
126. Khalili, M., Liwo, A., Scheraga, H.A.: Kinetic studies of folding of the B-domain of staphylo-
coccal protein A with molecular dynamics and a united-residue (UNRES) model of polypep-
tide chains. J. Mol. Biol. 355, 536–547 (2006)
Simulations of the Folding of Proteins: A Historical Perspective 23

127. Swendsen, R.H., Wang, J.S.: Replica Monte Carlo simulations of spin-glasses. Phys. Rev.
Lett. 57, 2607–2609 (1986)
128. Sugita, Y., Okamoto, Y.: Replica-exchange molecular dynamics method for protein folding.
Chem. Phys. Lett. 314, 141–151 (1999)
129. Nanias, M., Chinchio, M., Ołdziej, S., Czaplewski, C., Scheraga, H.A.: Protein struc-
ture prediction with the UNRES force-field using Replica-Exchange Monte Carlo-with-
Minimization; Comparison with MCM, CSA and CFMC. J. Comput. Chem. 26, 1472–1486
(2005)
130. Rhee, Y.M., Pande, V.S.: Multiplexed-replica exchange molecular dynamics method for pro-
tein folding simulation. Biophys. J. 84, 775–786 (2003)
131. Czaplewski, C., Kalinowski, S., Liwo, A., Scheraga, H.A.: Application of multiplexed replica
exchange molecular dynamics to the UNRES force field: tests with α and α + β proteins. J.
Chem. Theor. Comput. 5, 627–640 (2009)
132. Rojas, A., Liwo, A., Browne, D., Scheraga, H.A.: Mechanism of fiber assembly; treatment
of Aβ-peptide aggregation with a coarse-grained united-residue force field. J. Mol. Biol. 404,
537–552 (2010)
133. He, Y., Liwo, A., Weinstein, H., Scheraga, H.A.: PDZ binding to the BAR domain of PICK1
is elucidated by coarse-grained molecular dynamics. J. Mol. Biol. 405, 298–314 (2011)
134. Golas, E., Maisuradze, G.G., Senet, P., Ołdziej, S., Czaplewski, C., Scheraga, H.A., Liwo,
A.: Simulation of the opening and closing of Hsp70 chaperones by coarse-grained molecular
dynamics. J. Chem. Theor. Comput. 8, 1750–1764 (2012)
135. Marmur, J., Doty, P.: Thermal renaturation of deoxyribonucleic acids. J. Mol. Biol. 3, 585–594
(1961)
136. Peyrard, M., Bishop, A.R.: Statistical mechanics of a nonlinear model for DNA denaturation.
Phys. Rev. Lett. 62, 2755–2758 (1989)
137. Olson, W.K.: Simulating DNA at low resolution. Curr. Opinion Struct. Biol. 6, 242–256 (1996)
138. Hyeon, C., Thirumalai, D.: Mechanical unfolding of RNA hairpins. Proc. Natl. Acad. Sci.
U.S.A. 102, 6789–6794 (2005)
139. Knotts 4th, T., Rathore, N., Schwartz, D.C., de Pablo, J.J.: A coarse grain model for DNA. J.
Chem. Phys. 126, 084901 (2007)
140. Voltz, K., Trylska, J., Tozzini, V., Kurkal-Siebert, V., Langowski, J., Smith, J.: Coarse-
grained force field for the nucleosome from self-consistent multiscaling. J. Comput. Chem.
29, 1429–1439 (2008)
141. Ouldridge, T.E., Louis, A.A., Doye, J.P.K.: DNA Nanotweezers studied with a coarse-grained
model of DNA. Phys. Rev. Lett. 104, 178101-1–178101-4 (2010)
142. Maciejczyk, M., Spasic, A., Liwo, A., Scheraga, H.A.: Coarse-grained model of nucleic acid
bases. J. Comput. Chem. 31, 1644–1655 (2010)
143. He, Y., Maciejczyk, M., Ołdziej, S., Scheraga, H.A., Liwo, A.: Mean-field interactions
between nucleic-acid-base dipoles drive formation of the double helix. Phys. Rev. Lett. 110,
098101 (2003)
144. Pollack, L.: Fashioning NAMD, a history of risk and reward: Klaus Schulten Reminisces.
In: Schlick, T. (ed.) Innovations in Biomolecular Modeling and Simulations, vol. 1. Royal
Society of Chemistry, Cambridge, UK (2012)
145. Tama, F., Valle, M., Frank, J., Brooks III, C.L.: Dynamic reorganization of the functionally
active ribosome explored by normal mode analysis and cryo-electron microscopy. Proc. Natl.
Acad. Sci. U.S.A. 100, 9319–9323 (2003)
146. Yang, L., Song, G., Jernigan, R.L.: How well can we understand large-scale protein motions
using normal modes of elastic network models? Biophys. J. 93, 920–929 (2007)
147. Senet, P., Maisuradze, G.G., Foulie, C., Delarue, P., Scheraga, H.A.: How main-chains of
proteins explore the free-energy landscape in native states. Proc. Natl. Acad. Sci. U.S.A. 105,
19708–19713 (2008)
148. Cote, Y., Senet, P., Delarue, P., Maisuradze, G.G., Scheraga, H.A.: Nonexponential decay of
internal rotational correlation functions of native proteins and self-similar structural fluctua-
tions. Proc. Natl. Acad. Sci. U.S.A. 107, 19844–19849 (2010)
24 H. A. Scheraga

149. Cote, Y., Senet, P., Delarue, P., Maisuradze, G.G., Scheraga, H.A.: Anomalous diffusion and
dynamical correlation between the side chains and the main chain of proteins in their native
state. Proc. Natl. Acad. Sci. 109, 10346–10351 (2012)
150. Schlick, T. (ed.): Innovations in Biomolecular Modeling and Simulations, vols. 1 and 2. Royal
Society of Chemistry, Cambridge, UK (2012)
Part II
Molecular Simulations: Methodology
Protein Structure Prediction Using
Coarse-Grained Models

Maciej Blaszczyk, Dominik Gront, Sebastian Kmiecik, Mateusz Kurcinski,


Michal Kolinski, Maciej Pawel Ciemny, Katarzyna Ziolkowska,
Marta Panek and Andrzej Kolinski

Abstract The knowledge of the three-dimensional structure of proteins is crucial


for understanding many important biological processes. Most of the biologically rel-
evant protein systems are too large for classical, atomistic molecular modeling tools.
In such cases, coarse-grained (CG) models offer various opportunities for efficient
conformational sampling and thus prediction of the three-dimensional structure. A
variety of CG models have been proposed, each based on a similar framework con-
sisting of a set of conceptual components such as protein representation, force field,
sampling, etc. In this chapter we discuss these components, highlighting ideas which
have proven to be the most successful. As CG methods are usually part of multistage
procedures, we also describe approaches used for the incorporation of homology
data and all-atom reconstruction methods.

1 Introduction

1.1 Why Do We Need CG Models?

Proteins are key components of all life processes. Thus, the development of rela-
tively cheap and automatic methods for determining amino acid sequences of proteins
raised hope for a breakthrough in many branches of science, including pharmacy and
biotechnology. However, the knowledge of sequence is insufficient for the majority of

M. Blaszczyk · D. Gront · S. Kmiecik · M. Kurcinski · M. P. Ciemny · K. Ziolkowska · M. Panek


A. Kolinski (B)
Faculty of Chemistry, Biological and Chemical Research Centre,
University of Warsaw, Pasteura 1, 02-093 Warsaw, Poland
e-mail: kolinski@chem.uw.edu.pl
M. Kolinski
Bioinformatics Laboratory, Mossakowski Medical Research Centre,
Polish Academy of Sciences, Warsaw, Poland
M. P. Ciemny
Faculty of Physics, University of Warsaw, Pasteura 5, 02-093 Warsaw, Poland
© Springer Nature Switzerland AG 2019 27
A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_2
28 M. Blaszczyk et al.

applications, and it is necessary to determine three-dimensional structures of proteins


and their complexes. Unfortunately, determining the three-dimensional structures has
turned out to be a much more demanding problem than studying their sequences. Cur-
rently, the Uniprot database [111, 137] contains over 60 million protein sequences,
while the total number of known protein structures, or their complexes, in RCSB Pro-
tein Data Bank [114] is over 130 thousand. Just a few years ago these numbers were
significantly smaller. In May 2012 the RefSeq database [110] contained less than 16
million sequences, while the total number of known structures in Protein Data Bank
[7] was about 80 thousand. The main reason for the increasing disproportion is that
experimental structure determination may be both expensive and time-consuming.
Additionally, in many cases it simply cannot succeed.
Theoretical methods are a potential alternative to the experimental techniques.
Over half a century ago, Anfinsen and coworkers [6] showed that the three-
dimensional structure of bovine pancreatic ribonuclease is exclusively determined
by its sequence. Later, this statement has been generalized for the great majority of
globular proteins. Thus, the problem of finding the native state of a protein can be
regarded as a problem of free energy minimization [5]. However, due to the number
of atoms in protein compounds, their conformational space is often defined by as
many as 104 –106 degrees of freedom (DOFs). Therefore, to be able to computation-
ally study biomolecule behaviors, such as large internal motions or conformational
changes, we have to reduce the number of DOFs by at least one order of magnitude.
This goal can be achieved by building models in which some of the atoms are ignored
or grouped to form united pseudo-atoms. Models that follow this paradigm are often
called reduced or coarse-grained (CG) models [58].
An ideal coarse graining approach should be able to simplify an atomistic system
without losing its important features, such as structural details, characteristic inter-
actions and internal dynamics. Although many CG methods have been created, none
of them fully satisfies these requirements. Currently, CG models can be regarded as
a part of a multi-scale procedures rather than stand-alone protein prediction methods
(Fig. 1). Nevertheless, the idea of CG models was a milestone in protein structure
prediction by computational methods [58].

1.2 History of CG Models

Coarse graining is used in biomolecular modeling since the very beginning of this
discipline. In their seminal work, Levitt and Warshel [87] started protein simulations
with a model where each residue in a peptide chain was represented by its alpha
carbon (Cα) and a united atom substituting its side chain. Since then, a huge variety
of models have been proposed that cover the whole range of complexity: from the
most simplistic Cα-only approaches to all-atom representation [3, 19, 22, 45, 68, 60,
65, 74, 113, 125, 126, 127]. Between these two extreme representations we can find
models with or without residue side chains. Each side chain may in turn be represented
by one or more interacting centers. A few different methods for the protein backbone
Protein Structure Prediction Using Coarse-Grained Models 29

Fig. 1 Protein structure prediction stages in CG modeling. The diagram presents a general pipeline
for multiscale modeling (CG merged with all-atom) and depicts major differences between easy
and difficult modeling tasks. Easy or medium-difficulty cases, if necessary, require only limited CG
sampling of the conformational space, usually to fill small gaps and quite small uncertainties in
available experimental or homology inference data. Extensive CG sampling is required for difficult
cases when knowledge about the expected structure is limited

have also been proposed. Finally, discretization may be used to impose additional
limits on the space of possible conformations the model can adopt. A quick glimpse
of the review articles [20, 41, 58, 67, 140] suggests that most likely all the choices
of CG representations have already been explored. This review certainly does not
describe all these solutions. Instead, we introduce several important concepts of CG
modeling of proteins and other biomolecules and describe the way they have evolved
in the past few decades.
One of such inspiring ideas was lattice models [71]. Restricting atomic coor-
dinates to a grid became a very straightforward and simple way to discretize the
conformational space. The search space size was greatly reduced and many of its
local minima vanished. Atomic coordinates became integer values, which opened
many possibilities for use of hash tables. Most importantly, the Cartesian space itself
could be stored in a three-dimensional array which resulted in the O(1) time com-
plexity (constant time) of collision detection. Due to these advantages lattice models
were at least one or two orders of magnitude faster than their continuous space
counterparts. Low resolution grids, however, have a few serious drawbacks. First of
all, simple lattice (cubic lattice, face-centered lattice, etc.) representations of protein
structures (usually limited to Cα traces, or at least to a few atom centers) were of
relatively low resolution, with average errors of such representation of 2–6 Å. Even
more risky aspects of low resolution lattice models are related to lattice anisotropy.
30 M. Blaszczyk et al.

Depending on the orientation in respect to the fixed lattice, the resulting model chains
changed local geometry and local resolution. Moreover, essentially for all types of
interactions the local energy may change with chain orientation. Consequently, the
models were highly degenerated, with changing energetic preferences for various
orientations of protein fragments.
The first detailed analysis of these problems was published by Godzik and cowork-
ers [34]. Higher resolution reduced protein models, for example Chess-Knight lattice
models, led to more accurate representations and smaller anisotropy effects. Proba-
bly the best functionality of lattice models was achieved by higher resolutions (more
lattice steps per residue) and allowed fluctuation of Cα-Cα distances in the mod-
els. This led to a large number of single-bead orientations, higher resolution (1–2 Å
in respect to experimental structures) and essentially negligible anisotropy. Obvi-
ously, the higher resolution of such models led to a much larger number of allowed
structures which caused somewhat higher computational cost. With increasing com-
puting power, the higher resolution lattice models could still deliver fast simulations
of protein folding, multibody interactions and related problems. The most impor-
tant advantage of high-resolution CG protein models (with slightly fluctuating fixed
distances in protein chains, usually the Cα-Cα distance) is their computational effi-
ciency. In comparison to continuous models the high-resolution lattice models could
be simulated much faster.
Another very prolific concept in protein structure modeling is the use of frag-
ments, that is, short peptides extracted from known protein structures. This concept
was originally introduced by Jones and Thirup [50] as a crystallographic method
for rapid model building based on experimental electron density. The authors also
discussed potential applications of short protein fragments in purely theoretical mod-
eling approaches, which in practice was applied in the late 1990s by J.R. Gunn as
well as by Baker and coworkers [13]. The latter application soon became the famous
Rosetta program [113], one of the most successful methods in ab initio structure pre-
diction. Later, fragment-based sampling was applied to numerous protein modeling
approaches [145, 147]. It should be noted, however, that fragment-based sampling
introduces a very strong bias in the dynamics of the sampled chain which makes it
unsuitable for studying numerous research problems.
Multibody Force Fields
Interactions within a CG model cannot be directly learned from a physical system.
Therefore, they have to be established in the form of a mean field potential. Such a
potential can be derived either from statistics extracted from known protein structures
[124] or from Molecular Dynamics simulations [90]. Knowledge-based force field
models have been actively developed for the past few decades, which resulted in the
remarkable improvement of their performance. Among the most important elements
we note the proper choice of the reference state [151] and multibody terms. It has
recently been shown that two-body potentials are not capable of recognizing all native
folds against large datasets of decoy structures [138]. They also cannot properly
mimic the cooperativity of the protein folding process [21]. Multibody potentials
Protein Structure Prediction Using Coarse-Grained Models 31

remediate these problems to some extent and perform significantly better than two-
body terms.

2 Necessary Components of CG Models

2.1 Protein Representation and Coordinates

When undertaking a biomolecular modeling study of a particular system, the level


of coarse graining must be defined. This includes defining atoms that are explicitly
present in the model, atoms that are grouped into united pseudo-atoms and finally
atoms that are ignored. By reducing the number of interacting centers we can reduce
the cost of energy evaluation. The number of modeled atoms is also related to the
number of degrees of freedom (DOFs) of the modeled system, although the depen-
dence is not straightforward.
In most cases, the Cα atom is explicitly defined and serves as the most important
point within a residue (perhaps the SICHO model [63, 72, 139] is the only medium-
resolution exception to this rule). As for the other backbone atoms, there are three
commonly used approaches: Cα only (with N, C and O neglected), all atoms present
[48, 113, 145] and a virtual point method. The major issue resulting from remov-
ing peptide plate atoms is the problem with the accurate definition of a hydrogen
bond between two residues. Cα-only backbone representations attempt to define the
hydrogen bonding potential on Cα coordinates; however, such definitions are rather
inaccurate [24]. At the same time, hydrogen bonds are crucial for maintaining the
proper local geometry of a backbone and for forming secondary structure elements.
Thus, a relatively accurate description of this interaction is required. In a virtual point
approach, originally proposed by Levitt [86], a point is defined in the geometric cen-
ter between two subsequent alpha carbons. This approach has been implemented in
numerous applications, both in intermediate resolution lattice models [62, 64, 66,
100] and in off-lattice Cartesian space models [12, 94]. Based on its coordinates, a
hydrogen bond can be defined with reasonable accuracy. Virtual point also describes
the excluded volume effect of the neglected backbone atoms; ϕ/ψ angles, however,
cannot be defined.
Various methods differ in the coarse graining of side chains. It may comprise just
one united Side Group (SG) atom, Cβ + united atom (Cβ + SG) or a few united atoms.
The simplest case, where the whole side chain is modeled just as a sphere, is also the
most inaccurate one. At the next accuracy level, the whole side chain is represented
by an ellipsoid [90] or by a Cβ and a united sphere of all the other side chain
atoms [68]. These two representations enable satisfactory accuracy with reasonable
computational cost. The advantage of the Cβ + SG approach is that of the 20 biogenic
amino acids four (G, A, S and C) residues are already accurately represented and for
a few others the approximations are rather small. The most challenging, however, are
the long side chains incorporating both polar and aliphatic moieties, such as LYS or
32 M. Blaszczyk et al.

TRP. For a better description of these entities, finer coarse graining must be defined
with more than two united atoms substituting a side chain [8, 11].
The choice of atoms used to represent a protein chain is strongly connected to
the set of the degrees of freedom to be sampled. In Cα-only, Cα + SG and similar
approaches, conformational search is done in the Cartesian space. In many cases,
however, the conformational space is smaller than 3 N DOFs, because conforma-
tions of some these atoms depend on the others. In CABS [68], for example, each
residue comprises up to 4 atoms, but only Cα atom is independent. All the remain-
ing atomic positions are unambiguously defined by the Cα trace. To the contrary, in
the SICHO model [63, 72, 139] with two interaction centers: Cα and SG, only SG
is independent and Cα coordinates based on them are back-calculated. In another
example the Rosetta model definition [113] is based on all backbone atoms and SG,
but the conformation of a peptide chain is defined by three degrees of freedom only,
dihedral angles of each residue: ϕ, ψ and ω.
Another method used to increase the computational efficiency of a computational
model is the discretization of the conformational space. It has been realized since
the early days of protein simulations [106] that even a small set of distinct states
allowed for a residue can result in the reasonable accuracy of a projected structure.
Such a set of selected states can be easily defined when a conformation is described
by its internal coordinates. For models defined in the Cartesian space a lattice (grid)
is used to limit the search space. In practice a set of basis vectors is defined to connect
any two Cα atoms that follow each other in a protein chain. This implies that any
conformation of a chain of N residues can be uniquely written as N-1 integer indexes
that refer to particular vectors in the basic set. Other atoms of the CG representation
(such as SG) may or may not be restricted to the grid.

2.2 Force Field

The already mentioned methods: SICHO, CABS and Rosetta [68, 60, 113] use only
three degrees of freedom per residue; however, they employ more than one center
to define the interactions of a particular residue. All these atoms, united atoms and
virtual points are used to calculate geometric properties, such as distances and planar
and dihedral angles. These properties underlie the definition of the energy function
of a system. The definition usually assumes a very complex mathematical form of
the function which we discuss in detail below. The mathematical formula must be
completed with a (possibly large) set of parameters, such as various constants, scaling
factors, etc. In the case of all-atom models used for biomolecular studies, the param-
eters can be derived from experimental data, such as small molecule measurements.
This is, however, not possible in the case of a CG model, simply because none of the
models reviewed here exists in the real world and many of their properties cannot be
measured. Therefore, the energy function for a CG model comprises at least partially
statistical potentials of mean force. The construction of such force fields has become
Protein Structure Prediction Using Coarse-Grained Models 33

a research discipline on its own. Here we provide a very basic description of mean
field force fields and focus on differences between particular CG approaches.
The evaluation of energy of a particular conformation requires computation of rel-
evant geometrical properties (for example distances or angles). This enforces recal-
culation of the Cartesian representation of a biomolecule if a CG model is defined in
internal coordinates [107]. Lattice models, on the other hand, use hashing and store
some local geometrical properties, such as planar or dihedral angles, vector products
etc., in look-up tables. Moreover, the energy function may be conveniently stored in
an array and indexed by a distance bin or vector indices.
The hydrogen bonding is one of the indispensable terms of the force field. There
were numerous actual attempts proposed in the literature based on different atom
types which capture local geometrical properties of the main chain in different ways
[24, 35, 73, 102]. For better recapitulation of the local geometry of secondary struc-
ture elements, correlations between neighboring hydrogen bonds may be modeled
explicitly by an additional potential [68, 88, 92].
Another very important energy component is the one corresponding to hard core
repulsion between atoms, often described as an excluded volume term. A rapidly
growing function may be used to model this interaction, such as the so-called “12”
Lennard-Jones potential term. Relevant radii for united atoms are computed as an
average over all the relevant conformations of a group that has been coarse grained
into a sphere. Hard core repulsion in low- to medium-resolution on-lattice models
may be evaluated instantly just by a single look-up in the 3D matrix that stores the
lattice space.
The attractive pairwise potential is established by the Boltzmann inversion of
relevant statistics extracted from known protein structures. The potential may depend
solely on the distance between interacting partners; in other approaches it takes into
account the mutual orientation of the groups and their neighborhood [12, 68].
Local backbone conformation and secondary structure formation is controlled
by mean force potentials encoding local correlations between degrees of freedom.
Typically, the potentials also depend on amino acid sequences and encode propen-
sities of particular amino acid types to form a given secondary structure. The actual
formulation of these potentials depends on how the main chain is represented in the
model. In the cases where all backbone atoms are available, Ramachandran-type
energy maps are utilized. Otherwise, local interactions depend on local distances,
for example between the ith and i + 2nd Cα, usually denoted as R13 , as well as on
R14 and R15 . Another choice is to define energy terms based on planar and dihedral
angles between successive Cα atoms.
CG force field is often completed by terms that mimic solvent-induced effects and
long range electrostatics. Examples of such terms include centrosymmetric (com-
pacting) potential and various environmental terms.
34 M. Blaszczyk et al.

2.3 Conformational Sampling

Even a CG reduced system is still characterized by a very large number of degrees


of freedom (approximately 102 –103 ). Due to the high density of a molecular system
and covalent bonds between atoms (or virtual bonds that connect pseudo- and united
atoms), the potential energy hypersurface is extremely rugged (see [142] and refer-
ence therein). The motion along many of the DOFs may be impossible due to high
energy barriers. A number of methods are used to explore this space, however, the
most common are Monte Carlo (MC) and Molecular Dynamics (MD) approaches.
Unlike other, specialized methods, these two are general and flexible. Both produce
low-energy ensembles, which elucidate protein dynamics.
The general idea of the Monte Carlo [99] may be applied to biomolecular systems
in numerous ways. Probably the simplest one is simulated annealing [57] which uses
the Metropolis criterion [98] to construct the Boltzmann ensemble of states at an
arbitrary temperature. Simulation starts from a high-temperature conformation where
the system undergoes large configurational changes but its energy is relatively high.
Using gradual cooling leads the system to adopting a conformation in a local energy
minimum. Repeating this process leads to the exploration of the energy landscape.
However, the chance of finding the global minimum of a biomolecular system is
relatively low. This problem can be alleviated to a great extent using the Replica
Exchange Monte Carlo (REMC), also known as Parallel Tempering [31, 38, 42, 132].
In this approach many (usually a few to tens of) simulations are run simultaneously.
Each simulation runs a separate non-interacting copy (called a replica) of the same
system using isothermal Metropolis MC. Occasionally two systems (X i and X j )
exchange their temperatures. The system X i which has been so far simulated at T i
goes to T j and X j goes from T j to T i . The probability p of this exchange is determined
by temperatures T j and T i as well as by the energies of the two systems:

E i and E j : p  min(1, exp()) (1)

where  is expressed by:


  
  1/Ti − 1/T j E i − E j (2)

Such an algorithm constructs a Markov chain over a number of Markov chain pro-
cesses. The exchange between the structures in different replicas facilitates relaxation
of structures that might otherwise be trapped in local energy minima. The density
of states of the sampled system can be recalculated by a histogram reweighting
technique [27, 28, 39, 80]. The Parallel Tempering algorithm can also be applied to
Molecular Dynamics simulations [131].
There are also many variants of Molecular Dynamics [51, 82, 101]. In its standard
formulation, the trajectory of a molecular system is calculated by solving the New-
ton’s equations of motion at each time step. The forces on the system are computed
as the gradient of the potential energy function (the force field) which is dependent
Protein Structure Prediction Using Coarse-Grained Models 35

on the positions of atoms. To reduce the computational complexity, some limitations


may be imposed on the range of interactions between atoms or united atoms.
As in Monte Carlo sampling, Cartesian coordinates may be substituted by general-
ized variables. Practical examples include all-atom [1, 97] and CG simulations [93].
This approach indeed allows for a significant increase of the integration time step.
However, the applicability of this approach is limited, since the forces evaluation
requires a recalculation of the Cartesian coordinates of the system (which involves
a time-consuming matrix inversion) at every time step.

3 Representative CG Methods

Above, we described all the major components of a coarse-grained model. Now let
us summarize a few well-established computational models with particular emphasis
on these elements. The models differ in the level of coarse graining and the number
of degrees of freedom utilized to define a polypeptide chain. For convenience, the
key features of the models are presented in Table 1.

Table 1 Comparison of selected CG methods


Method Protein Conformational Coordinates Sampling scheme
representation space system
Levitt and Cα, SG Continuous space Angular Molecular
Warsell model dynamics
CABS Cα, Cβ, SG and Cubic lattice with Cartesian Replica exchange
virtual point at 0.61 Å spacing Monte Carlo
the peptide bond which restricts
center Cα positions
SICHO SG, Cα Cubic lattice with Cartesian Monte Carlo
1.45 Å spacing
which restricts
SG positions
Rosetta All backbone Continuous space Angular Monte Carlo
atoms and SG or
all-atom
representation
UNRES Cα, SG and Continuous space Cartesian/angular Mesoscopic
peptide group molecular
dynamics and
Monte Carlo
36 M. Blaszczyk et al.

3.1 The Original Cα + SG Model

The model originally proposed by Levitt and Warshel in 1975 [87] uses two inter-
action centers per residue: the Cα atom, which is modeled explicitly, and the side
chain, represented by a SG sphere. Each residue is allowed one degree of freedom
only: the torsion angle between the 4 successive Cα atoms. Interactions between
side chains are modeled by a van der Waals type potential. The radius of each united
atom representing a side chain is calculated as the average radius of gyration of the
particular group. Another important contribution to the potential energy is the side
chain-solvent interaction estimated by the experimental free energy of transfer from
water to ethanol. The force field is completed by local interaction expressed as a
Fourier expansion function of the torsion angle defined by four Cα atoms. Classi-
cal molecular dynamics is used to sample the conformational space. Simulations of
the bovine pancreatic trypsin inhibitor sometimes produced structures resembling
the native fold, with the best structures having root-mean-square deviation from the
native in the range of 6.5 Å. In his later works [86], Levitt introduced an additional
degree of freedom for each residue, namely the planar angle between three adjoining
Cα atoms. A virtual atom has also been added in the middle of a Cα-Cα vector for a
more accurate definition of hydrogen bonding interactions.

3.2 CABS

The coarse-grained representation of the CABS model [68] uses up to four interaction
centers per residue: Cα, Cβ, the center of mass of the side group and a virtual point
placed at the center of each peptide bond (see Fig. 2a). The Cα trace of the model
is restricted to an underlying cubic lattice with a spacing of 0.61 Å. In lattice units,
the distance between consecutive Cα atoms varies from 291/2 to 491/2 . This implies
that the Cα-Cα distance is allowed to fluctuate between 3.29 and 4.27 Å. There are
800 possible orientations (lattice vectors) of the virtual Cα-Cα. Therefore, the model
essentially avoids any lattice-related artifacts. Cβ atoms and side chains are located
off-lattice, and their positions are calculated for each residue using the coordinates of
three consecutive Cα atoms as a reference frame. For each amino acid, two distinct
conformations are defined which mimic the averaged side chain position found in
helical and expanded conformations. The rotamer type is uniquely defined by the on-
lattice Cα trace; hence, a protein chain comprising N residues has 3 N independent
degrees of freedom.
Protein Structure Prediction Using Coarse-Grained Models 37

3.3 SICHO

The most unique feature of the Side Chain Only model [63, 72, 139] is the definition
of the polypeptide chain. Each residue is represented as a spherical united atom which
substitutes its side chain (see Fig. 2b). The united atoms are restricted to a cubic grid
with 1.45 Å spacing. The chain vectors representing virtual bonds between interac-
tion centers are of variable length, ranging from 91/2 to 301/2 lattice units. Unlike
other protein models, Cα atom positions are not independent degrees of freedom.
Conversely, they are uniquely defined in a local frame of three neighboring side
chains and are recalculated after any conformational change. The knowledge-based
force field is defined based on both Cα and side chain centers and includes a chain
stiffness potential, a secondary structure bias, short-range interactions, hydrogen-
bond interactions, and long-range interactions. Such deeply coarse-grained models

(a) (b)

(c) (d)

Fig. 2 Comparison of representations of four CG models


38 M. Blaszczyk et al.

are computationally very effective [130], and they can be effective in difficult tasks
of structure prediction and studies of large scale protein dynamics if the model
structures resolution allows for the atom-level reconstruction. Even lower resolution
realistic models of proteins can be designed if the crude structure representation can
be compensated by specific patterns of knowledge based statistical potentials [23].

3.4 Rosetta

Rosetta [113, 129] utilizes a library of short peptide fragments (typically 3 and 9
residue long) as a Monte Carlo moves set. In practice a fragment is defined by three
internal coordinates (ϕ, ψ and ω backbone dihedral angles) per residue. Each time a
fragment is inserted, a number of subsequent DOFs (9 or 18, for 3mers and 9mers,
respectively) are affected in the simulated polypeptide chain. The fragments them-
selves are extracted from known protein structures [40]. Such a sampling method
reduces the conformational space, changes the respective DOFs in a correlated man-
ner and introduces a strong bias toward protein-like geometries. Rosetta utilizes
two representations: a coarse-grained, termed “centroid” (shown in Fig. 2c) and an
all-atom one. In both representations the protein backbone is treated explicitly. In
the centroid mode, each side chain is represented by a united atom located at the
side-chain center of mass. In the high-resolution mode, atomic coordinates for all
side-chain atoms, including hydrogens, are utilized. Side chains are restricted to
discrete conformations as described by a backbone-dependent rotamer library. The
Rosetta energy function is different for the two representations and in both cases it
comprises numerous mean-field terms.

3.5 UNRES

In the UNited RESidue model [90] the protein backbone is reduced to a sequence
of Cα atoms and a united peptide group (p) connected by virtual bonds (Fig. 2d).
United side-chains are attached to the α-carbons (SG). In the most recent version of
UNRES, the positions of these atoms are defined by internal Cartesian coordinates
(vectors of the virtual bonds). Previously, planar and torsion angles were used as a set
of generalized coordinates [91]. UNRES employs a physics-based mean-field force
field for simulations of protein structure and dynamics. The energy function defini-
tion and conformational space sampling methods have evolved over time. Initially
the effective energy function was described as a restricted free energy (RFE) func-
tion or the potential of mean force (PMF) of polypeptide chains in water. Currently,
it is defined as an approximate cumulant expansion of restricted free temperature-
dependent energy whose calibration is based on protein-folding thermodynamic data.
UNRES is the only coarse-grained force field which explicitly depends on tempera-
ture and can compute thermodynamic quantities of protein folding.
Protein Structure Prediction Using Coarse-Grained Models 39

In UNRES the conformational space search was initially based on the global
optimization of the potential-energy function to find the lowest-energy conforma-
tion. It was performed by stochastic Monte Carlo-based algorithms, namely Monte
Carlo plus Energy Minimization (MCM) [89] and hybrid approaches, such as Con-
formational Space Annealing (CSA) [85] which turned out to be the most effective.
Later, UNRES was extended to mesoscopic Molecular Dynamics (MD) to study
pathways and kinetics of the protein folding process. This implementation of MD
reformulates the conformational sampling as a search for the most probable confor-
mational ensembles with the lowest free energy at temperatures below the folding
transition temperature. The UNRES extension of MD can also be used to simu-
late multichain proteins. To improve the conformational space search, UNRES can
use Replica Exchange Molecular Dynamics (REMD) and Replica Exchange Monte
Carlo (REMC) sampling.
The UNRES coarse-grained model has been successfully applied to the protein
structure prediction problem [76, 78, 105] to study folding trajectories [118] and to
investigate folding process thermodynamics [77, 152].

4 Reconstruction of an All-Atom Representation,


Post-processing and Analysis

A coarse grained computational model provides description of the modeled structure


at a limited resolution. To infer the biological function of the investigated system
or to use the produced model in virtual docking procedures, it is crucial to obtain
its atomistic representation. Figure 3 shows a common, two-step all-atom chain
reconstruction approach that consists of (i) generation of backbone coordinates and
(ii) reconstruction of residue side chains.
The first group of the backbone reconstruction methods [37, 46, 96, 115] relies
on an assembly of fragments derived from Protein Data Bank [7]. In this approach,
the most probable fragments are selected using energy-based, homology-based or
geometric criteria. Such algorithms can be fast and accurate. However, they have to

Fig. 3 Illustration of a two-step procedure of all-atom representation reconstruction


40 M. Blaszczyk et al.

maintain large and up-to-date collections of fragments. The second group of methods
utilizes averaged knowledge about backbone geometry. The computations are per-
formed based on the statistics of backbone atom positions derived from representative
known protein structures.
Methods for side chain position prediction [46, 75, 115] are based on sampling
the conformational space by a rotamer library. This involves statistical clustering
of observed side chain conformations in known structures. Other algorithms use
conformer libraries which contain samples of side chains from known protein struc-
tures. In both approaches a scoring function is required to evaluate the quality of the
sampled conformations.
The reconstruction to an all-atom representation from a reduced CG representa-
tion of the protein is an important part of structure modeling pipelines. Such all-atom
models may be directly used for further refinement with molecular mechanics pro-
grams [36] and are essential for later structural studies. Most of the post-processing
applications, such as structure quality assessment, protein-protein interaction pre-
diction, protein function analysis or ligand docking, require an all-atom model of
the protein [121, 143]. There are many tools available for such model conversion [2,
10, 43, 52, 53, 108], but only a small number of them is commonly used. Below, we
describe selected servers and applications freely accessible for use online. The time
in which all computations are performed by these methods is a matter of seconds to
minutes.

4.1 BBQ

The Backbone Building from Quadrilaterals program [37] is a stand-alone appli-


cation for protein backbone reconstruction from the α-carbon trace. It is available
for download from the BioShell website (bioshell.pl). The method uses statistics of
backbone atoms positions extracted from a non-redundant database of protein chains
to determine backbone coordinates. In this approach, the Cα trace is divided into four
residue fragments—quadrilaterals. The quadrilateral conformation is described by
three internal coordinates: distances between the four Cα atoms. The coordinates for
all four-residue sets of a protein trace are discretized with a mesh size of 0.2 Å. These
three distances define a three-dimensional grid in which the average positions of C,
O, and N atoms are measured in a local Cartesian coordinate system. The protein
sequence is not taken into account in the reconstruction process. The BBQ package
was designed to be a fast, robust and as accurate as possible tool for backbone atom
reconstruction.
Protein Structure Prediction Using Coarse-Grained Models 41

4.2 SABBAC

This online service provides Structural Alphabet-based protein BackBone recon-


struction from Alpha-Carbon trace [96]. It is based on a specific approach to protein
structure fragment selection and assembly.
The Cα trace is encoded in the Hidden Markov Model-derived structural alphabet
which describes conformations of four-residue long fragments [14]. Then, candidate
fragments at each position of the structure are chosen from sets of coordinates pre-
computed in a local reference frame. A full-protein backbone reconstruction is done
by joining fragments using a greedy algorithm and searching for the best combination
of fragments compatible with the Cα trace. The energy criterion is used to determine
the optimality of the combination of fragments.
The SABBAC service has been proven to be fast owing to its fragment library
of reduced size. It can be accessed at http://bioserv.rpbs.jussieu.fr/SABBAC.html.
During computation, side chains can also be added to the model using Scite [30]
which is conveniently combined with SABBAC service.

4.3 SCWRL4

The SCWRL4 algorithm [75] is a method that reconstructs sidechains, based on an


input all-atom protein backbone. For each residue type, the input rotamer library
provides statistics such as rotamer frequencies and average dihedral angles. Firstly,
the input backbone coordinates are checked and side-chain coordinates are built
for all rotamers and subrotamers (conformations with dihedral angles ± one stan-
dard deviation from the library). Then, self and pairwise energies are computed and
rotamers with high self-energy are removed from the reconstruction. To represent the
side-chain placement problem, SCWRL4 uses an interaction graph, where vertices
represent residues and edges indicate nonzero interactions between them. A Dead
End Elimination method is used to find the best rotamer assignment. SCWRL4 is
available at http://dunbrack.fccc.edu/scwrl4/.

4.4 MaxSprout

This automatic database procedure [46] for generating the all-atom representation of
a protein requires the input Cα trace and amino acid sequence. The computations are
split into two basic steps: backbone reconstruction using the Cα trace and side-chain
coordinates prediction using the reconstructed backbone.
During backbone construction, a protein structure database is scanned for frag-
ments that locally fit the alpha carbon trace and candidates for a complete overlap-
ping cover of the chain are matched. The optimal continuous path is then found by a
42 M. Blaszczyk et al.

dynamic programming algorithm which minimizes the mismatch at protein fragment


joints. Final backbone coordinates are taken from fragments superposed on the Cα
trace.
Side chain construction starts by generating sets of plausible coordinates from
a library of frequently occurring rotamers based on backbone coordinates. Sub-
sequently, all the rotamer-rotamer interaction energies are calculated. To minimize
intramolecular energy by an optimized choice of the rotamer, a simple and fast Monte
Carlo procedure with simulated annealing is used. When the lowest energy config-
uration is found, the program returns the coordinates of all-atom representation of
the protein. The MaxSprout algorithm is available on-line at http://www.ebi.ac.uk/
Tools/maxsprout/.

4.5 PULCHRA

PULCHRA (“Protein Chain Reconstruction Algorithm”) [115] is a standalone pro-


gram for the reconstruction of full-atom protein models from input α-carbon trace
and amino acid sequence. The backbone reconstruction step in this approach is very
similar to BBQ as both PULCHRA and BBQ implement the same algorithm. BBQ
uses the backbone and side chain rotamer libraries, which have been generated from
representative protein crystallographic structures.
The side-chain reconstruction procedure uses the same set of distances and coor-
dinates as the backbone reconstruction method. There is a list of possible side-chain
conformations which is sorted by the decreasing probability of occurrence in the PDB
database, for each combination of calculated distances. The procedure places side-
chain heavy atoms on the backbone and optimizes their positions to avoid clashes. In
the final step, hydrogen atoms can optionally be added to the full-atom representa-
tion. PULCHRA is freely available for download at http://cssb.biology.gatech.edu/
PULCHRA.

5 Combining CG Models with Comparative Modeling


Methods

For very small proteins, CG methods of structure prediction may provide satisfactory
models. However, for the great majority of targets it is necessary to use additional
sources of information. The databases of known protein structures are the most easily
available among them—e.g., the Protein Data Bank (PDB) [114].
As during the evolution protein structure has become much more strongly con-
served than sequence [47], the most straightforward approaches use comparison of
sequences of known protein structures (templates) with the query sequence. How-
ever, the inability to detect sequence similarity with any of the known structures does
Protein Structure Prediction Using Coarse-Grained Models 43

not exclude the existence of a good template. The solution in such cases can be so-
called threading methods which compare predicted structural features (for example
the secondary structure, burial) of the target and the template [36, 122, 128]. Regard-
less which approach is chosen to detect homology, the aim of this method is to create
an alignment, which highlights the similarities between the query and templates.
Obviously, the level of similarity affects the correctness of template selection and
the quality of the alignments. For easy cases (high similarity), classical approaches
such as PsiBlast [4] almost always provide sufficiently accurate alignment. Therefore,
it is relatively easy to build a good model for the query. However, even in those
cases CG methods can be useful for the local sampling of some regions, such as
loops, which are not defined by the alignment. The difficulty of the problem rapidly
increases with the decreasing level of similarity, not only due to the ambiguity of
the alignments, but also because of differences in the geometry of correctly aligned
regions or suboptimal template selection. One of the most effective approaches to
those problems is the incorporation of CG models. Below we present certain strategies
that incorporate the information obtained with comparative modeling into the CG-
based protocols.

5.1 Reduction of the Sampled Conformational Space

In one of the most straightforward approaches, the query chain is allowed to move in
a tube formed by a chain of spheres surrounding the template structure [70]. In this
method, the query chain is confined within the tube by imposing energetic penalties
for any excursion outside. Thus, the disadvantage of this approach is the limited
degree of possible improvement of the initial model.
The answer to this limitation was the application of a more complex set of restraints
within GENECOMP [60], a method in which the energy function is constructed in
a way which allows two-residue shifts of the target chain along the template. This
feature enables changing the initial alignment, and thus correction of possible errors.
Additionally, the GENECOMP restraining scheme includes two types of restraints:
(i) based on the predicted contacts in the target and (ii) target distances predicted
from the fragment threading procedure.
In the more recent studies, the pairwise distances observed in the templates are a
source for deriving restraints for the CABS modeling tool [68]. For the number of
templates given by comparative modeling procedures, distances between all pairs of
Cα atoms are calculated and the minimum and maximum distances between equiv-
alent pairs of residues are taken as limits of the restraint. The restraints are included
in the CABS energy function as trapezoid-shaped potential wells, where the gradi-
ent of the lateral sides depends on the weight of the restraint. The spatial restraints
significantly reduce conformational space, which decreases computation time and
increases the probability of obtaining a successful model (see Fig. 4a).
44 M. Blaszczyk et al.

Fig. 4 Sample strategies of combining comparative modeling methods with CG models. a T0592
target from CASP9, templates (in gray) define conformational space sampled with CABS. The final
model (navy) is more similar (in terms of GDT_TS) to the native (green) than any of the templates.
b The idea of TRACER. Template scaffold is represented as spheres, query Cα trace as red lines.
Query residues within the gray sphere satisfy the free criteria of the query-template pseudo energy,
while those within the navy sphere satisfy the additional secondary structure identity criterion (see
the text for details)

5.2 Application of the Probability Density Function

A more sophisticated technique was originally used in the Modeller method [117].
In this approach, spatial restraints are defined in terms of a probability density func-
tion (PDF). The PDF used for restraining a certain feature x (distance or angle, for
instance) can be written as P (x|A, B … C). This formula gives a probability density
for x when A, B … C are known. For instance, in Rosetta [134], the feature which
is restrained is the distance between pairs of Cα atoms (r) and PDF is given as a
Gaussian and defined as P(r|G, L, B, D), where G, L, B and D are predictor variables
(see Table 2).
As we know, Gaussian can be defined by two parameters: mean and standard
deviation. The latter was calculated using a non-redundant database of nearly 8,000
known protein structures. The HHSearch algorithm [128] was employed to align all
pairs of proteins. The standard deviations of r were computed for 10,000 combina-
tions of different G, L, B, D based on differences in the equivalent atoms distances in
the aligned structures and put into the four-dimensional table spanned by the values
of the predictor variables.
Such a table of standard deviations enables prediction of restraints for a query
sequence aligned with the template. For each pair of Cα atoms (apart from those
closer than 10 Å or separated by less the 10 residues along the query sequence) the
values of four predicting variables are calculated. Then, pairwise distance Gaussian
Protein Structure Prediction Using Coarse-Grained Models 45

Table 2 Predictor variables used for deriving restraints for the ROSETTA modeling tool
Feature Value
G Global alignment quality −log(E) where E is HHsearch e-value
L Residue-pair alignment quality Blosum62 [44] score
B Burial in the template structure Number of Cβ’s within 8 Å of the
template residue Cβ
D Average distance to an alignment gap Distances in a number of residues from
the aligned pair to the nearest gap in the
sequence alignment
L, B and D are averaged over the pairs of aligned residues, G is constant for the given alignment

restraints are assigned: the mean is given by a distance between the equivalent atoms
in the template structure, and the standard deviation is taken from the table according
to the calculated predictor variables.
It is also possible to combine prediction from the multiple templates as weighted
mixture of the Gaussians. Such restraints can be combined with the Rosetta energy
function by adding a component equal to i, j − ln(P(di, j )) where summation is
done over pairs of residues, and P(di, j ) is the probability of the distance di, j given
by the calculated PDF.

5.3 Unification of Comparative Modeling Methods with CG


Models

In the above-mentioned strategies homology inference data (usually in the form of


distance restraints) are used as input for CG methods. TRACER [69, 136] is an
approach which unifies those two steps. The method uses CABS representation of
the protein conformational space and its force field. The most important extension
of the model is incorporation of the α-carbon trace template, represented as a fuzzy
three-dimensional scaffold with assigned multi-featured properties (Fig. 3b). The
query chain is forced to “align” with the template by an additional query-template
similarity pseudo energy component introduced to the CABS energy function. This
component is a sum over pairs of residues of the query chain and the template that
are not further apart than a certain cut-off. The value of query-template similarity
pseudo energy for the ith query residue and the jth template residue depends on:

amino acid similarity (quarter of the negative value of the BLOSUM62 substitution
matrix [44]; cut-off: 4 Å)
similarity of hydrophobic/hydrophilic features (quarter of the negative value of the
product of Kyte-Dollitle indexes [83]; cut-off: 4 Å)
46 M. Blaszczyk et al.

similarity of the orientation and directions of the chains in the vicinity of the ith and
jth residue. (−1 if the angle between the flanking Cα-Cα vector is smaller than 90°;
cut-off: 4 Å)
identity of the secondary structures (helical or extended) of fragments consisting ith
and jth residues (−1 if identical; cut-off: 2.5 Å)
As in the CABS model, the conformational space is sampled by the REMC
scheme. The conformational updates include those originally applied in CABS mod-
ifications of small fragments (2–4 pseudo-bonds of the Cα trace) and, additionally,
rearrangements of larger parts of the chain consisting of up to 22 residues. These
larger-scale modifications enable effective sampling of the scaffold, which corre-
sponds to changing the alignments between the query and the template.
TRACER significantly extends the application of comparative modeling methods,
especially to regions of very low or even undetectable sequence identity. However,
the major drawback of the current version of TRACER, in comparison to some other
methods described in this section, is inability to use more than one template.

6 Evaluation of CG Models in CASP9

CASP (Critical Assessment of Techniques for Protein Structure Prediction) [104] is


a unique opportunity to evaluate the performance of computational methods in pro-
tein structure prediction. CASP is a blind experiment, since the target structures are
not published until the end of the prediction period. Therefore, it is possible to fairly
assess and compare different prediction methods under the same conditions. The most
successful groups, which took part in the CASP9 experiment, employed multistage
methodologies, which typically utilize several independent methods, such as: con-
sensus homology detecting tools, modeling methods, quality assessment procedures,
optimization and refinement methods. For example, the top ranked group in CASP9,
MUFOLD [148], used techniques such as consensus constraints-based model con-
struction and the Multi Dimensional Scaling Technique—a machine learning method
for quality assessment—and, finally, model refinement by the combination of model
and template information.
On the other hand, if one uses the number of the best models to rank the meth-
ods (among all predictions submitted to the CASP9 as the top model—each group
may send up to 5 models), the best four methods are based on CG modeling tools
described before in Sects. 1, 2 and 3 of this chapter (see Table 3). Below we attempt
to briefly evaluate the CG-based methods performance dependence on the difficulty
of the targets. Figure 5 shows the comparison of single-method groups presented
in Table 3 (except for PRLMS which also utilizes a non-CG Modeller method)
taking into account target annotation into categories: FM (Free Modeling), TBM
(Template-Based Modeling) and FM/TBM. In the FM category the leading groups
(ZHANG_AB_INITIO, BAKER) use fragment-assembling approaches (Rosetta,
QUARK). In the intermediate difficulty category, FM/TBM, the LTB (CABS) group
Protein Structure Prediction Using Coarse-Grained Models 47

Table 3 Top groups in CASP9 in terms of the number of models with the highest GDT_TS score
submitted to CASP9 as first models
Group name Method Number of Mean GDT_TS Rank in CASP9
(number) models with the for all
highest server/human
GDT_TS targets
PRMLS (65) Rosetta/Modeller 7 54.10 12
LTB (400) CABS 5 51.86 28
BAKER (172) Rosetta 5 51.77 29
ZHANG_AB_INITIO QUARK 4 52.95 18
(418)
14

BAKER
LTB
12

ZHANG_AB_INITIO
10
ΔGDT_TS
8
6
4
2
0

FM FM/TBM TBM
Category

Fig. 5 Differences for three difficulty categories between mean GDT_TS for a particular group
and the mean GDT_TS for models submitted to CASP9 by all groups

significantly outperforms two other methods; however, it is necessary to note that


this category contains only three targets.
The statistics of the TBM category show that using CG methods in easy cases
of comparative modeling is not the best choice. In this category the highest mean
quality of the targets was achieved by the ZHANG_AB_INITIO group, which used
models provided earlier from automated prediction servers instead of using QUARK.
However, it does not mean that CG methods cannot provide successful models. For
instance, all the five targets, for which the best model was submitted by the LTB
group, belong to the TBM category. The lower mean quality of the models is the
effect of the low consistency of prediction quality.
CASP is a biannual experiment initiated over 20 years ago. One of the most
intriguing questions regarding this undertaking is the progress in the field. Unfortu-
nately, the evaluation of the progress is not an easy task due to the differences in the
48 M. Blaszczyk et al.

difficulty of the targets in various CASP editions. However, a general tendency can
be observed that after dramatic improvements in early editions, in the last ones the
progress is modest [79].
The latest CASP experiments confirm this relatively slow, however permanent
progress in theoretical structure prediction [26, 55, 56, 103]. Combinations of coarse-
grained modeling strategies with careful bioinformatics analysis of sequence simi-
larities and final selection/refinement prove to be the most efficient [54, 133, 146,
149].

7 Example Case: CG Prediction of Loop Conformations

The so-called loop closure problem has been a focus of research from the earliest days
of computational protein modeling [32, 33, 144]. The prediction of loop structure is
often the most difficult challenge in comparative modeling efforts [36]. The accuracy
of homology models is usually the lowest in loop regions. Since loop regions often
exhibit very low sequence conservation, they have to be modeled without a structural
template. In that case, simple homology modeling methods cannot be used. To illus-
trate some of the applications of the CG approach to protein structure prediction,
we briefly review recent modeling efforts using the CABS CG model toward the
accurate prediction of protein loops conformation.
In the benchmark study of loop modeling methods [49] the performance of the
following tools was compared: MODELLER, ROSETTA, CABS and a combination
of MODELLER with CABS. MODELLER [25] is commonly considered a standard
comparative modeling package. It employs explicitly designed loop modeling strate-
gies relying on the optimization-based approach (conjugate gradients and molecu-
lar dynamics with simulated annealing). ROSETTA and CABS, in turn, employ a
knowledge-based driven search of the discretized conformational space. These meth-
ods were tested on a large set of loops of various lengths (4–25 residues). The tests
showed that classical modeling with MODELLER gives more accurate predictions
for short loops, while CG de novo modeling by CABS performs better for longer
loops. In the cases of long gaps in protein structures (~20 residues), loops were pre-
dicted by CABS with medium or medium-low resolution (RMSD on the level of
2–6 Å from the native). Results of similar quality were obtained for the structure
prediction of three extracellular loops of 13 G-protein coupled receptors (GPCRs)
by a de novo CABS procedure [59, 61]. This modelling task was particularly chal-
lenging for the de novo blind prediction method, as all three extracellular loops were
fully flexible during the prediction procedure. Still, the best resulting conformations
showed RMSD values lower than 3 Å from the experimental structure (see Fig. 6).
Previous benchmark studies, aimed at the prediction of missing protein structure
fragments, also indicated that the CG models (an early version of CABS and two
other tools based on similar principles) performed relatively well in the range of large
fragments [11].
Protein Structure Prediction Using Coarse-Grained Models 49

Fig. 6 Structure prediction of GPCR loops using the de novo CABS method [59]. The picture
shows the best models for second extracellular loop (EL2) in muscarinic acetylcholine receptor M2
(CHRM2), neurotensin receptor type 1 (NTSR1) and mu-type opioid receptor (OPMR1). The pre-
dicted loops are shown in red (EL2) and green (EL1 and EL3), the reference loops (crystallographic
structure) and the adjacent intracellular receptor structures in silver. The resulting conformations
of the longest EL2 exhibited the following RMSD values from the experimental conformation:
2.65 Å for CHRM2 (15 residue long), 2.99 Å for NTSR1 (21 residues) and 1.92 Å for OPMR1 (17
residues)

As shown by Jamroz and Kolinski [49] CG models can be effectively used for
the prediction of loop structures in combination with other techniques. Namely,
top ranked models generated by MODELLER were used as multiple templates for
CABS modeling. As a result of such a hybrid procedure, the predicted models were
on average more accurate than those from the single individual methods.

8 Example Case: CG Molecular Docking of Peptides


to Proteins Receptors

Molecular docking is a challenging problem of structural biology and medicine [18,


29]. The subtle energetic effects usually play the main role in docking small ligands
to protein or biomolecular complexes. In these cases, a straightforward application
of CG models may be difficult or not practical. Docking of large molecules, however,
in which the conformational effects are the most important, seems to be a perfect
task for CG and/or multiscale modeling strategies. A good example is the flexible
and unrestrained docking of peptides to protein receptors. It was possible to allow for
significant fluctuations of protein structures, unlimited flexibility of peptide ligands
and unrestrained search for docking sites by employing the CABS-based modeling
scheme [9, 17, 16, 81, 141]. The CABS-dock protocol is both very efficient and
allows for higher flexibility of entire modeled structures than other available tools
[84, 95, 112, 120, 119, 135]. The CABS-dock method generates moderate resolution
protein-peptide structures for significant fraction of test cases [81]. The resulting
lower resolution, coarse-grained structures can be easily refined by classical MD
simulations or local docking methods. An example of peptide docking using CABS-
dock is illustrated in Fig. 7.
50 M. Blaszczyk et al.

Fig. 7 The figure presents the results of docking nuclear receptor coactivator 1 (sequence:
HKLVQLLTTT) to peroxisome proliferator-activated receptor gamma (PDB code 2FVJ:A) without
using prior knowledge about the binding site. The docking was performed with CABS-dock method
[81]. Panel a shows 1000 lowest energy models (light blue, best model RMSD to native pose is
1.43 A) while panel b shows the top scored model (dark blue, RMSD to native pose is 3.46 A)
together with the experimental structure of bound peptide (light blue) in the close up frame (native
complex PDB code: 2FVJ). The protein receptor is presented in surface representation

Protein-peptide docking strategies can also serve as powerful supporting tools


for protein-protein docking. Providing a contacting structural fragment from one
of the complex components can be predicted with a reasonable fidelity, it may be
extracted as a “peptide” fragment. This short linear interacting motif may be docked
to the second complex component with protein-peptide docking tools. In some of the
modeling cases, this fragmentary template may be successfully used to reconstruct
the entire complex [15]. This strategy for hierarchical protein-protein docking is
now being intensively studied [109], since protein-protein docking can be of great
importance for new directions in drug design [123].

9 Conclusions and Perspectives

One of the main purposes of this chapter was to demonstrate that the most inter-
esting CG models are based on quite complex sets of assumptions, such as protein
representation, force field, coordination system and sampling scheme. Obviously,
the accuracy of particular assumptions of CG protein models defines the range of
applicability of modeling procedures. It seems to be reasonable to state that the future
development of CG models will focus on a more accurate reconstruction of real phys-
ical effects. Increasing computational power should lead to a considerable decrease
Protein Structure Prediction Using Coarse-Grained Models 51

in the assumed simplifications of the existing models, and, therefore, provide a more
accurate description of the observed physics of biomacromolecules.
Another promising direction of the development of CG models is a more effective
combination of existing CG methods with comparative modeling approaches [58,
116, 147]. Perhaps, the term “unification” would be more accurate as we believe
that the incorporation of comparative modeling methods should go further than mere
utilization of information provided by stand-alone comparative modeling tools. Such
a precursor approach has been shown in Sect. 5.3.
Finally, we expect that the development of integrative approaches which use exper-
imental data from various sources together with different computational techniques,
as well CG models, will be critical. The most recent (and spectacular) examples of
the integrative structure determination include the use of Cryo-Electron Microscopy
(cryo-EM) in combination with CG modeling techniques. One of the biggest advan-
tage of Cryo-EM experiments is the fact that, contrary to the popular X-ray crystal-
lography, specimens can be observed in their native environment, which enables the
exploration of conformational states. The main problem for Cryo-EM maps is their
low resolution which can be solved by the application of CG computational tech-
niques for fitting high-resolution protein structures [150]. Probably, such integrative
approaches will become widespread in the near future.

Acknowledgements Maciej Blaszczyk, Sebastian Kmiecik, Katarzyna Ziolkowska and Marta


Panek acknowledge support from Foundation for Polish Science TEAM project (TEAM/2011-7/6)
co-financed by the European Regional Development Fund operated within the Innovative Econ-
omy Operational Program. We also acknowledge support from the National Science Center (NCN
Poland) Grant (MAESTRO2014/14/A/ST6/00088).

References

1. Abagyan, R.A., Mazur, A.K.: New methodology for computer-aided modelling of biomolecu-
lar structure and dynamics. 2. Local Deformations Cycles J. Biomol. Struct. Dyn. 6, 833–845
(1989). doi: citeulike-article-id:673543
2. Adcock, S.A.: Peptide backbone reconstruction using dead-end elimination and a knowledge-
based forcefield. J. Comput. Chem. 25, 16–27 (2004). https://doi.org/10.1002/jcc.10314
3. Altschul, M., Simpson, K.W., Dykes, N.L., Mauldin, E.A., Reubi, J.C., Cummings, J.F.:
Evaluation of somatostatin analogues for the detection and treatment of gastrinoma in a dog.
J. Small Anim. Pract. 38, 286–291 (1997)
4. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.:
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Res. 25, 3389–3402 (1997b). doi: gka562 [pii]
5. Anfinsen, C.B.: Principles that govern the folding of protein chains. Science 181, 223–230
(1973)
6. Anfinsen, C.B., Haber, E., Sela, M., White Jr., F.H.: The kinetics of formation of native
ribonuclease during oxidation of the reduced polypeptide chain. Proc. Natl. Acad. Sci. USA
47, 1309–1314 (1961)
7. Berman, H., Henrick, K., Nakamura, H.: Announcing the worldwide protein data bank. Nat.
Struct. Biol. 10, 980 (2003). https://doi.org/10.1038/nsb1203-980 nsb1203-980 [pii]
52 M. Blaszczyk et al.

8. Betancourt, M.: A reduced protein model with accurate native-structure identification ability.
Proteins 53, 889–907 (2003). doi: citeulike-article-id:5200969
9. Blaszczyk, M., Kurcinski, M., Kouza, M., Wieteska, L., Debinski, A., Kolinski, A., Kmiecik,
S.: Modeling of protein-peptide interactions using the CABS-dock web server for binding
site search and flexible docking. Methods 93, 72–83 (2016). https://doi.org/10.1016/j.ymeth.
2015.07.004
10. Blundell, T., et al.: 18th Sir Hans Krebs lecture. Knowl.-Based Protein Model. Design Eur. J.
Biochem. 172, 513–520 (1988)
11. Boniecki, M., Rotkiewicz, P., Skolnick, J., Kolinski, A.: Protein fragment reconstruction
using various modeling techniques. J. Comput. Aided Mol. Des. 17, 725–738 (2003). doi:
citeulike-article-id:668480
12. Buchete, N.V., Straub, J.E., Thirumalai, D.: Orientation-dependent coarse-grained potentials
derived by statistical analysis of molecular structural databases Polymer 45, 597–608 (2004).
doi: citeulike-article-id:10750645
13. Bystroff, C., Baker, D.: Prediction of local structure in proteins using a library of sequence-
structure motifs. J. Mol. Biol. 281, 565–577 (1998). doi: citeulike-article-id:669894
14. Camproux, A.C., Gautier, R., Tuffery, P.: A hidden markov model derived structural alpha-
bet for proteins. J. Mol. Biol. 339, 591–605 (2004). https://doi.org/10.1016/j.jmb.2004.04.
005s0022283604004085 [pii]
15. Ciemny, M.P., Kurcinski, M., Blaszczyk, M., Kolinski, A., Kmiecik, S.: Modeling EphB4-
EphrinB2 protein-protein interaction using flexible docking of a short linear motif. Biomed.
Eng. Online 16, 71 (2017). https://doi.org/10.1186/s12938-017-0362-7
16. Ciemny, M.P., Kurcinski, M., Kozak, K.J., Kolinski, A., Kmiecik, S.: Highly flexible protein-
peptide docking using CABS-Dock. Methods Mol. Biol. 1561, 69–94 (2017). https://doi.org/
10.1007/978-1-4939-6798-8_6
17. Ciemny, M.P., Debinski, A., Paczkowska, M., Kolinski, A., Kurcinski, M., Kmiecik, S.:
Protein-peptide molecular docking with large-scale conformational changes: the p 53-MDM2
interaction. Sci. Rep. 6, 37532 (2016). https://doi.org/10.1038/srep37532
18. Ciemny, M., Kurcinski, M., Kamel, K., Kolinski, A., Alam, N., Schueler-Furman, O.,
Kmiecik, S.: Protein–peptide docking: opportunities and challenges. Drug Discov. Today
23(8), 1530–1537, ISSN 1359-6446 (2018). https://doi.org/10.1016/j.drudis.2018.05.006
19. Covell, D.G.: Folding protein alpha-carbon chains into compact forms by Monte Carlo meth-
ods. Proteins 14, 409–420 (1992). https://doi.org/10.1002/prot.340140310
20. Czaplewski, C., Liwo, A., Makowski, M., Ołdziej, S., Scheraga, H.A.: Coarse-grained models
of proteins: theory and applications. In: Kolinski, A. (ed.) Multiscale approaches to protein
modeling, pp. 85–109. Springer, New York (2011)
21. Czaplewski, C., Rodziewicz-Motowidlo, S., Liwo, A., Ripoll, D.R., Wawak, R.J., Scheraga,
H.A.: Molecular simulation study of cooperativity in hydrophobic association. Protein Sci. 9,
1235–1245 (2000). https://doi.org/10.1110/ps.9.6.1235
22. Dashevskii, V.G.: [Lattice model for globular protein three-dimensional structure] Mol. Biol.
(Mosk) 14, 105–117 (1980)
23. Dawid, A.E., Gront, D., Kolinski, A.: SURPASS low-resolution coarse-grained protein model-
ing. J. Chem. Theor. Comput. 13, 5766–5779 (2017). https://doi.org/10.1021/acs.jctc.7b00642
24. De Sancho, D., Rey, A.: Evaluation of coarse grained models for hydrogen bonds in proteins.
J. Comput. Chem. 28 (2007). doi: citeulike-article-id:1127406
25. Eswar, N., Eramian, D., Webb, B., Shen, M.Y., Sali, A.: Protein structure modeling with MOD-
ELLER. Methods Mol. Biol. 426, 145–159 (2008). https://doi.org/10.1007/978-1-60327-058-
8_8
26. Feig, M., Mirjalili, V.: Protein structure refinement via molecular-dynamics simulations: what
works and what does not? Proteins 84(Suppl 1), 282–292 (2016). https://doi.org/10.1002/prot.
24871
27. Ferrenberg, A., Landau, D.P., Swendsen, R.: Statistical errors in histogram reweighting. Phys.
Rev. E 51, 5092 (1995). doi:citeulike-article-id:875595
Protein Structure Prediction Using Coarse-Grained Models 53

28. Ferrenberg, A., Swendsen, R.: Optimized Monte Carlo data analysis. Phys. Rev. Lett. 63,
1195–1198 (1989). doi:citeulike-article-id:774372
29. Fosgerau, K., Hoffmann, T.: Peptide therapeutics: current status and future directions. Drug
Discov. Today 20, 122–128 (2015). https://doi.org/10.1016/j.drudis.2014.10.003
30. Gautier, R., Camproux, A.C., Tuffery, P.: SCit: web tools for protein side chain conformation
analysis. Nucleic Acids Res. 32, W508–511 (2004). https://doi.org/10.1093/nar/gkh38832/
suppl_2/w508 [pii]
31. Geyer, C.J.: Markov chain Monte Carlo maximum likelihood. In: Computing Science and
Statistics: Proceedings of 23rd Symposium on the Interface Interface Foundation. Fairfax
Station, pp. 156–163 (1991). doi: citeulike-article-id:606345
32. Go, N., Scheraga, H.: Ring closure and local conformational deformations of chain molecules.
Macromolecules 3, 178–187 (1970)
33. Go, N., Scheraga, H.A.: Ring-Closure in Chain Molecules with Cn, I, or S2n Symmetry.
Macromolecules 6, 273–281 (1973)
34. Godzik, A., Kolinski, A., Skolnick, J.: Lattice representations of globular proteins: how good
are they? J. Comput. Chem. 14, 1194–1202 (1993). https://doi.org/10.1002/jcc.540141009
35. Grishaev, A., Bax, A.: An empirical backbone–backbone hydrogen-bonding potential in pro-
teins and its applications to NMR structure refinement and validation. J. Am. Chem. Soc. 126,
7281–7292 (2004). doi: citeulike-article-id:1896684
36. Gront, D., Kmiecik, S., Blaszczyk, M., Ekonomiuk, D., Koliński, A.: Optimization of protein
models Wiley interdisciplinary reviews: computational molecular. Science 2, 479–493 (2012).
https://doi.org/10.1002/wcms.1090
37. Gront, D., Kmiecik, S., Kolinski, A.: Backbone building from quadrilaterals: a fast and accu-
rate algorithm for protein backbone reconstruction from alpha carbon coordinates. J. Comput.
Chem. 28, 1593–1597 (2007). https://doi.org/10.1002/jcc.20624
38. Gront, D., Kolinski, A., Skolnick, J.: Comparison of three Monte Carlo conformational search
strategies for a proteinlike homopolymer model: Folding thermodynamics and identification of
low-energy structures. J. Chem. Phys. 113, 5065–5071 (2000). doi: citeulike-article-id:606324
39. Gront, D., Kolinski, A., Skolnick, J.: A new combination of replica exchange Monte Carlo and
histogram analysis for protein folding and thermodynamics. J. Chem. Phys. 115, 1569–1574
(2001). doi: citeulike-article-id:876359
40. Gront, D., Kulp, D., Vernon, R., Strauss, C., Baker, D.: Generalized fragment picking in
rosetta: design, protocols and applications. PLoS ONE 6, e23294 (2011). doi: citeulike-article-
id:9705043
41. Guardiani, C., Livi, R., Cecconi, F.: Coarse Grained Modeling and Approaches to Protein
Folding. Curr. Bioinform. 5, 217–240 (2010)
42. Hansmann, U.: parallel tempering algorithm for conformational studies of biological
molecules. Chem. Phys. Lett. 281, 140–150 (1997). doi: citeulike-article-id:715765
43. Heath, A.P., Kavraki, L.E., Clementi, C.: From coarse-grain to all-atom: toward multiscale
analysis of protein landscapes. Proteins 68, 646–661 (2007). https://doi.org/10.1002/prot.
21371
44. Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc.
Natl. Acad. Sci. USA 89, 10915–10919 (1992)
45. Hinds, D.A., Levitt, M.: A lattice model for protein structure prediction at low resolution.
Proc. Natl. Acad. Sci. USA 89, 2536–2540 (1992)
46. Holm, L., Sander, C.: Database algorithm for generating protein backbone and side-chain
co-ordinates from a C alpha trace application to model building and detection of co-ordinate
errors. J. Mol. Biol. 218, 183–194 (1991). doi: 0022-2836(91)90883-8 [pii]
47. Illergard, K., Ardell, D.H., Elofsson, A.: Structure is three to ten times more conserved than
sequence–a study of structural response in protein cores. Proteins 77, 499–508 (2009). https://
doi.org/10.1002/prot.22458
48. Irbäck, A., Mohanty, S.: PROFASI: A Monte Carlo simulation package for protein folding
and aggregation. J. Comput. Chem. 27, 1548–1555 (2006). doi: citeulike-article-id:7290910
54 M. Blaszczyk et al.

49. Jamroz, M., Kolinski, A.: Modeling of loops in proteins: a multi-method approach. BMC
Struct. Biol. 10, 5+ (2010)
50. Jones, T.A., Thirup, S.: Using known substructures in protein model building and crystallog-
raphy. EMBO J. 5, 819–822 (1986). doi: citeulike-article-id:705742
51. Karplus, M., McCammon, J.A.: Molecular dynamics simulations of biomolecules. Nat. Struct.
Biol. 9, 646–652 (2002). https://doi.org/10.1038/nsb0902-646nsb0902-646 [pii]
52. Kazmierkiewicz, R., Liwo, A., Scheraga, H.A.: Energy-based reconstruction of a protein back-
bone from its alpha-carbon trace by a Monte-Carlo method. J. Comput. Chem. 23, 715–723
(2002). https://doi.org/10.1002/jcc.10068 [pii]
53. Kazmierkiewicz, R., Liwo, A., Scheraga, H.A.: Addition of side chains to a known
backbone with defined side-chain centroids. Biophys. Chem. 100, 261–280 (2003). doi:
S0301462202002855 [pii]
54. Kelley, L.A., Mezulis, S., Yates, C.M., Wass, M.N., Sternberg, M.J.: The Phyre2 web portal
for protein modeling, prediction and analysis. Nat. Protoc. 10, 845–858 (2015). https://doi.
org/10.1038/nprot.2015.053nprot.2015.053 [pii]
55. Kim, H., Kihara, D.: Protein structure prediction using residue- and fragment-environment
potentials in CASP11. Proteins 84(Suppl 1), 105–117 (2016). https://doi.org/10.1002/prot.
24920
56. Kinch, L.N., Li, W., Monastyrskyy, B., Kryshtafovych, A., Grishin, N.V.: Evaluation of free
modeling targets in CASP11 and ROLL. Proteins 84(Suppl 1), 51–66 (2016). https://doi.org/
10.1002/prot.24973
57. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science
220, 671–680 (1983). doi: citeulike-article-id:379797
58. Kmiecik, S., Gront, D., Kolinski, M., Wieteska, L., Dawid, A.E., Kolinski, A.: Coarse-grained
protein models and their applications. Chem. Rev. 116, 7898–7936 (2016). https://doi.org/10.
1021/acs.chemrev.6b00163
59. Kmiecik, S., Jamroz, M., Kolinski, M.: Structure prediction of the second extracellular loop
in G-protein-coupled receptors. Biophys. J. 106, 2408–2416 (2014). https://doi.org/10.1016/
j.bpj.2014.04.022
60. Kolinski, A., Betancourt, M.R., Kihara, D., Rotkiewicz, P., Skolnick, J.: Generalized compar-
ative modeling (GENECOMP): a combination of sequence comparison, threading, and lattice
modeling for protein structure prediction and refinement. Proteins 44, 133–149 (2001)
61. Kolinski, M., Filipek, S.: Study of a structurally similar kappa opioid receptor agonist and
antagonist pair by molecular dynamics simulations. J. Mol. Model. 16, 1567–1576 (2010).
https://doi.org/10.1007/s00894-010-0678-8
62. Kolinski, A., Galazka, W., Skolnick, J.: Computer design of idealized beta-motifs. J. Chem.
Phys. 103, 10286–10297 (1995)
63. Kolinski, A., Ilkowski, B., Skolnick, J.: Dynamics and thermodynamics of beta-hairpin assem-
bly: insights from various simulation techniques. Biophys. J. 77, 2942–2952 (1999)
64. Kolinski, A., Milik, M., Rycombel, J., Skolnick, J.: A reduced model of short-range interac-
tions in polypeptide-chains. J. Chem. Phys. 103, 4312–4323 (1995)
65. Kolinski, A., Milik, M., Skolnick, J.: Static and dynamic properties of a new lattice model of
polypeptide-chains. J. Chem. Phys. 94, 3978–3985 (1991)
66. Kolinski, A., Skolnick, J.: Monte Carlo simulations of protein folding. I. Lattice model and
interaction scheme. Proteins 18, 338–352 (1994). https://doi.org/10.1002/prot.340180405
67. Kolinski, A., Skolnick, J.: Reduced models of proteins and their applications. Polymer 45,
511–524 (2004). https://doi.org/10.1016/j.polymer.2003.10.064
68. Kolinski, A.: Protein modeling and structure prediction with a reduced representation. Acta
Biochimica. Polonica 51, 349–371 (2004). doi: citeulike-article-id:606304
69. Kolinski, A., Gront, D.: Comparative modeling without implicit sequence alignments. Bioin-
formatics 23, 2522–2527 (2007). doi: btm380 [pii]https://doi.org/10.1093/bioinformatics/
btm380
70. Kolinski, A., Rotkiewicz, P., Ilkowski, B., Skolnick, J.: A method for the improvement
of threading-based protein models. Proteins 37, 592–610 (1999b). https://doi.org/10.1002/
(sici)1097-0134(19991201)37:4%3c592::aid-prot10%3e3.0.co;2-2 [pii]
Protein Structure Prediction Using Coarse-Grained Models 55

71. Kolinski, A., Skolnick, J.: Lattice Models of Protein Folding, Dynamics and Thermodynamics.
Landes (1996). doi: citeulike-article-id:877252
72. Kolinski, A., Skolnick, J.: Assembly of protein structure from sparse experimental data: an
efficient Monte Carlo model. Proteins 32, 475–494 (1998). https://doi.org/10.1002/(sici)1097-
0134(19980901)32:4%3c475::aid-prot6%3e3.0.co;2-f [pii]
73. Kortemme, T., Morozov, A.V., Baker, D.: An orientation-dependent hydrogen bonding poten-
tial improves prediction of specificity and structure for proteins and protein-protein complexes.
J. Mol. Biol. 326, 1239–1259 (2003). doi: citeulike-article-id:556189
74. Krigbaum, W.R., Lin, S.F.: Monte-Carlo simulation of protein folding using a lattice model.
Macromolecules 15, 1135–1145 (1982)
75. Krivov, G.G., Shapovalov, M.V., Dunbrack Jr., R.L.: Improved prediction of protein side-
chain conformations with SCWRL4. Proteins 77, 778–795 (2009). https://doi.org/10.1002/
prot.22488
76. Krupa, P., Mozolewska, M.A., Joo, K., Lee, J., Czaplewski, C., Liwo, A.: Prediction of protein
structure by template-based modeling combined with the UNRES force field. J. Chem. Inf.
Model. 55, 1271–1281 (2015). https://doi.org/10.1021/acs.jcim.5b00117
77. Krupa, P., Sieradzan, A.K., Mozolewska, M.A., Li, H., Liwo, A., Scheraga, H.A.: Dynamics
of disulfide-bond disruption and formation in the thermal unfolding of ribonuclease A. J.
Chem. Theor. Comput. 13, 5721–5730 (2017). https://doi.org/10.1021/acs.jctc.7b00724
78. Krupa, P., et al.: Performance of protein-structure predictions with the physics-based UNRES
force field in CASP11. Bioinformatics 32, 3270–3278 (2016). doi:btw404 [pii]https://doi.org/
10.1093/bioinformatics/btw404
79. Kryshtafovych, A., Fidelis, K., Moult, J.: CASP9 results compared to those of previous CASP
experiments. Proteins 79(Suppl 10), 196–207 (2011). https://doi.org/10.1002/prot.23182
80. Kumar, S., Rosenberg, J., Bouzida, D., Swendsen, R., Kollman, P.: Multidimensional free-
energy calculations using the weighted histogram analysis method. J. Comput. Chem. 16,
1339–1350 (1995). doi: citeulike-article-id:774417
81. Kurcinski, M., Jamroz, M., Blaszczyk, M., Kolinski, A., Kmiecik, S.: CABS-dock web server
for the flexible docking of peptides to proteins without prior knowledge of the binding site.
Nucleic Acids Res. 43, W419–424 (2015). https://doi.org/10.1093/nar/gkv456gkv456 [pii]
82. Kwak, W., Hansmann, U.H.: Efficient sampling of protein structures by model hopping. Phys.
Rev. Lett. 95, 138102 (2005). https://doi.org/10.1103/PhysRevLett.95.138102
83. Kyte, J., Doolittle, R.F.: A simple method for displaying the hydropathic character of a protein.
J. Mol. Biol. 157, 105–132 (1982). doi: 0022-2836(82)90515-0 [pii]
84. Lee, H., Heo, L., Lee, M.S., Seok, C.: GalaxyPepDock: a protein-peptide docking tool based on
interaction similarity and energy optimization. Nucleic Acids Res. 43, W431–W435 (2015).
https://doi.org/10.1093/nar/gkv495
85. Lee, J., Scheraga, H.A., Rackovsky, S.: New optimization method for conformational
energy calculations on polypeptides: conformational space annealing. J. Comput. Chem. 18,
1222–1232 (1997)
86. Levitt, M.: A simplified representation of protein conformations for rapid simulation of protein
folding. J. Mol. Biol. 104, 59–107 (1976). doi: citeulike-article-id:4000523
87. Levitt, M., Warshel, A.: Computer simulation of protein folding. Nature 253, 694–698 (1975).
doi: citeulike-article-id:4275709
88. Levy-Moonshine, A., Amir, E-a. D., Keasar, C.: Enhancement of beta-sheet assembly by
cooperative hydrogen bonds potential. Bioinformatics 25, 2639–2645 (2009). doi: citeulike-
article-id:7012194
89. Li, Z., Scheraga, H.A.: Monte Carlo-minimization approach to the multiple-minima problem
in protein folding. Proc. Natl. Acad. Sci. USA 84, 6611–6615 (1987)
90. Liwo, A., He, Y., Scheraga, H.A.: Coarse-grained force field: general folding theory. Phys.
Chem. Chem. Phys. 13, 16890–16901 (2011). https://doi.org/10.1039/c1cp20752k
91. Liwo, A., et al.: Simulation of Protein Structure and Dynamics with the Coarse-Grained
UNRES Force Field. Coarse-Graining of Condensed Phase and Biomolecular Systems. CRC
Press (2008). doi: citeulike-article-id:3822586
56 M. Blaszczyk et al.

92. Liwo, A., Czaplewski, C., Pillardy, J., Scheraga, H.: Cumulant-based expressions for the
multibody terms for the correlation between local and electrostatic interactions in the united-
residue force field. J. Chem. Phys. 115, 2323–2347 (2001). doi: citeulike-article-id:715745
93. Liwo, A., Khalili, M., Scheraga, H.: Ab initio simulations of protein-folding pathways by
molecular dynamics with the united-residue model of polypeptide chains. Proc. Natl. Acad.
Sci. U.S.A. 102, 2362–2367 (2005). doi: citeulike-article-id:1365687
94. Liwo, A., Pincus, M.R., Wawak, R.J., Rackovsky, S., Scheraga, H.A.: Prediction of protein
conformation on the basis of a search for compact structures: test on avian pancreatic polypep-
tide. Protein Sci.: Publ. Protein Soc. 2, 1715–1731 (1993). doi: citeulike-article-id:7558759
95. London, N., Raveh, B., Cohen, E., Fathi, G., Schueler-Furman, O.: Rosetta FlexPepDock
web server–high resolution modeling of peptide-protein interactions. Nucleic Acids Res. 39,
W249–W253 (2011). https://doi.org/10.1093/nar/gkr431
96. Maupetit, J., Gautier, R., Tuffery, P.: SABBAC: Online structural alphabet-based protein
BackBone reconstruction from alpha-carbon trace. Nucleic Acids Res. 34, W147–151 (2006).
doi: 34/suppl_2/W147 [pii]https://doi.org/10.1093/nar/gkl289
97. Mazur, A.K., Dorofeev, V.E., Abagyan, R.A.: Derivation and testing of explicit equations of
motion for polymers described by internal coordinates. J. Comput. Phys. 92, 261–272 (1991).
doi: citeulike-article-id:10750684
98. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., Teller, E.: Equation of state cal-
culations by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953). doi: citeulike-
article-id:531300
99. Metropolis, N., Ulam, S.: The Monte Carlo method. J. Am. Stat. Assoc. 44, 335–341 (1949).
doi: citeulike-article-id:1886002
100. Milik, M., Kolinski, A., Skolnick, J.: Algorithm for rapid reconstruction of protein backbone
from alpha carbon coordinates. J. Comput. Chem. 18, 80–85 (1997)
101. Mitsutake, A., Sugita, Y., Okamoto, Y.: Generalized-ensemble algorithms for molecu-
lar simulations of biopolymers. Biopolymers 60, 96–123 (2001). https://doi.org/10.1002/
1097-0282(2001)60:2%3c96::aid-bip1007%3e3.0.co;2-f [pii]https://doi.org/10.1002/1097-
0282(2001)60:2%3c96::AID-BIP1007%3e3.0.CO;2-F
102. Morozov, A., Lin, S.: Accuracy and convergence of the Wang-Landau sampling algorithm.
Phys. Rev. E (Statistical, Nonlinear, and Soft Matter Physics) 76 (2007). doi: citeulike-article-
id:3802626
103. Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T., Tramontano, A.: Critical assessment
of methods of protein structure prediction (CASP)-round XII. Proteins (2017). https://doi.
org/10.1002/prot.25415
104. Moult, J., Fidelis, K., Kryshtafovych, A., Tramontano, A.: Critical assessment of methods
of protein structure prediction (CASP)–round IX. Proteins 79(Suppl 10), 1–5 (2011). https://
doi.org/10.1002/prot.23200
105. Mozolewska, M.A., Krupa, P., Zaborowski, B., Liwo, A., Lee, J., Joo, K., Czaplewski, C.: Use
of restraints from consensus fragments of multiple server models to enhance protein-structure
prediction capability of the UNRES force field. J. Chem. Inf. Model. 56, 2263–2279 (2016).
https://doi.org/10.1021/acs.jcim.6b00189
106. Park, B.H., Levitt, M.: The complexity and accuracy of discrete state models of protein
structure. J. Mol. Biol. 249, 493–507 (1995). doi: citeulike-article-id:5845728
107. Parsons, J., Holmes, B., Rojas, M., Tsai, J., Strauss, C.: Practical conversion from torsion
space to Cartesian space forin silico protein synthesis. J. Comput. Chem. 26, 1063–1068
(2005). doi: citeulike-article-id:1036763
108. Payne, P.W.: Reconstruction of protein conformations from estimated positions of the C-alpha
coordinates. Protein Sci. 2, 315–324 (1993)
109. Peterson, L.X., et al.: Modeling the assembly order of multimeric heteroprotein com-
plexes. PLoS Comput. Biol. 14, e1005937 (2018). https://doi.org/10.1371/journal.pcbi.
1005937pcompbiol-d-17-00872 [pii]
110. Pruitt, K.D., Tatusova, T., Maglott, D.R.: NCBI reference sequences (RefSeq): a curated non-
redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35,
D61–65 (2007). doi: gkl842 [pii]https://doi.org/10.1093/nar/gkl842
Protein Structure Prediction Using Coarse-Grained Models 57

111. Pundir, S., Martin, M.J., O’Donovan, C.: UniProt protein knowledgebase methods. Mol. Biol.
1558, 41–55 (2017). https://doi.org/10.1007/978-1-4939-6783-4_2
112. Raveh, B., London, N., Zimmerman, L., Schueler-Furman, O.: Rosetta FlexPepDock ab-initio:
simultaneous folding, docking and refinement of peptides onto their receptors. PLoS ONE 6,
e18934 (2011). https://doi.org/10.1371/journal.pone.0018934
113. Rohl, C., Strauss, C., Misura, K., Baker, D.: Protein structure prediction using Rosetta. In:
Numerical Computer Methods, Part D, vol. 383, pp. 66–93. Elsevier (2004). doi: citeulike-
article-id:441859
114. Rose, P.W., et al.: The RCSB protein data bank: integrative view of protein, gene and 3D
structural information. Nucleic Acids Res. 45, D271–D281 (2017). https://doi.org/10.1093/
nar/gkw1000
115. Rotkiewicz, P., Skolnick, J.: Fast procedure for reconstruction of full-atom protein models
from reduced representations. J. Comput. Chem. 29, 1460–1465 (2008). https://doi.org/10.
1002/jcc.20906
116. Sali, A., et al.: Outcome of the first wwPDB hybrid/integrative methods task force workshop.
Structure 23, 1156–1167 (2015). https://doi.org/10.1016/j.str.2015.05.013
117. Sali, A., Blundell, T.L.: Comparative protein modelling by satisfaction of spatial restraints. J.
Mol. Biol. 234, 779–815 (1993). doi: S0022-2836(83)71626-8 [pii] https://doi.org/10.1006/
jmbi.1993.1626
118. Scheraga, H.A., Khalili, M., Liwo, A.: Protein-folding dynamics: overview of molecular
simulation techniques. Annu. Rev. Phys. Chem. 58, 57–83 (2007). https://doi.org/10.1146/
annurev.physchem.58.032806.104614
119. Schindler, C.E., de Vries, S.J., Zacharias, M.: iATTRACT: simultaneous global and local
interface optimization for protein-protein docking refinement. Proteins 83, 248–258 (2015).
https://doi.org/10.1002/prot.24728
120. Schindler, C.E., de Vries, S.J., Zacharias, M.: Fully blind peptide-protein docking with pepAT-
TRACT. Structure 23, 1507–1515 (2015a). https://doi.org/10.1016/j.str.2015.05.021s0969-
2126(15)00224-5 [pii]
121. Shenoy, S.R., Jayaram, B.: Proteins: sequence to structure and function–current status. Curr.
Protein Pept. Sci. 11, 498–514 (2010)
122. Shi, J., Blundell, T.L., Mizuguchi, K.: FUGUE: sequence-structure homology recognition
using environment-specific substitution tables and structure-dependent gap penalties. J. Mol.
Biol. 310, 243–257 (2001). https://doi.org/10.1006/jmbi.2001.4762s0022-2836(01)94762-x
[pii]
123. Shin, W.H., Christoffer, C.W., Kihara, D.: In silico structure-based approaches to dis-
cover protein-protein interaction-targeting drugs. Methods 131, 22–32 (2017). doi: S1046-
2023(17)30208-6 [pii]https://doi.org/10.1016/j.ymeth.2017.08.006
124. Sippl, M.J.: Boltzmann’s principle, knowledge-based mean fields and protein folding. An
approach to the computational determination of protein structures. J. Comput. Aided Mol.
Des. 7, 473–501 (1993)
125. Skolnick, J., Kolinski, A.: Dynamic Monte Carlo simulations of globular protein fold-
ing/unfolding pathways. I. Six-member, Greek key beta-barrel proteins. J. Mol. Biol. 212,
787–817 (1990a). doi:0022-2836(90)90237-G [pii]
126. Skolnick, J., Kolinski, A.: Simulations of the folding of a globular protein. Science 250,
1121–1125 (1990b). doi: 250/4984/1121 [pii]https://doi.org/10.1126/science.250.4984.1121
127. Skolnick, J., Kolinski, A., Brooks III, C.L., Godzik, A., Rey, A.: A method for predicting
protein structure from sequence. Curr. Biol. 3, 414–423 (1993). doi:0960-9822(93)90348-R
[pii]
128. Soding, J.: Protein homology detection by HMM-HMM comparison. Bioinformatics 21,
951–960 (2005). doi: bti125 [pii]https://doi.org/10.1093/bioinformatics/bti125
129. Stein, A., Kortemme, T.: Improvements to robotics-inspired conformational sampling in
rosetta. PLoS ONE 8, e63090 (2013). https://doi.org/10.1371/journal.pone.0063090pone-d-
13-06862 [pii]
58 M. Blaszczyk et al.

130. Stumpff-Kane, A.W., Maksimiak, K., Lee, M.S., Feig, M.: Sampling of near-native pro-
tein conformations during protein structure refinement using a coarse-grained model, normal
modes, and molecular dynamics simulations. Proteins 70, 1345–1356 (2008). https://doi.org/
10.1002/prot.21674
131. Sugita, Y., Okamoto, Y.: Replica-exchange molecular dynamics method for protein folding.
Chem. Phys. Lett. 314, 141–151 (1999). doi:citeulike-article-id:197524
132. Swendsen, R., Wang, J.: Replica Monte Carlo simulation of spin-glasses. Phys. Rev. Lett. 57,
2607–2609 (1986). doi: citeulike-article-id:773436
133. Tai, C.H., Bai, H., Taylor, T.J., Lee, B.: Assessment of template-free modeling in CASP10
and ROLL. Proteins 82(Suppl 2), 57–83 (2014). https://doi.org/10.1002/prot.24470
134. Thompson, J., Baker, D.: Incorporation of evolutionary information into Rosetta comparative
modeling. Proteins 79, 2380–2388 (2011). https://doi.org/10.1002/prot.23046
135. Trabuco, L.G., Lise, S., Petsalaki, E., Russell, R.B.: PepSite: prediction of peptide-binding
sites from protein surfaces. Nucleic Acids Res. 40, W423–W427 (2012). https://doi.org/10.
1093/nar/gks398
136. Trojanowski, S., Rutkowska, A., Kolinski, A.: TRACER. A new approach to comparative
modeling that combines threading with free-space conformational sampling. Acta Biochim.
Pol. 57, 125–133 (2010)
137. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45, D158-D169 (2017)
https://doi.org/10.1093/nar/gkw1099
138. Vendruscolo, M., Najmanovich, R., Domany, E.: Can a pairwise contact poten-
tial stabilize native protein folds against decoys obtained by threading? Proteins
38, 134–148 (2000). https://doi.org/10.1002/(sici)1097-0134(20000201)38:2%3c134::aid-
prot3%3e3.0.co;2-a [pii]
139. Vinals, J., Kolinski, A., Skolnick, J.: Numerical study of the entropy loss of dimerization and
the folding thermodynamics of the GCN4 leucine zipper. Biophys. J. 83, 2801–2811 (2002).
doi: S0006-3495(02)75289-2 [pii]https://doi.org/10.1016/s0006-3495(02)75289-2
140. Voth, G. (ed): Coarse-Graining of Condensed Phase and Biomolecular Systems. CRC Press
Taylor & Francis, Farmington, CT (2008)
141. Wabik, J., Kurcinski, M., Kolinski, A.: Coarse-grained modeling of peptide docking associated
with large conformation transitions of the binding protein: Troponin I fragment-Troponin C
system. Molecules 20, 10763–10780 (2015). https://doi.org/10.3390/molecules200610763
142. Wales, D.: Energy Landscapes: Applications to Clusters, Biomolecules and Glasses (Cam-
bridge Molecular Science). Cambridge University Press (2004). doi: citeulike-article-
id:755112
143. Wang, T., Wu, M.B., Zhang, R.H., Chen, Z.J., Hua, C., Lin, J.P., Yang, L.R.: Advances in
computational structure-based drug design and application in drug discovery. Curr. Top Med.
Chem. 16, 901–916 (2016). doi: CTMC-EPUB-69847 [pii]
144. Wedemeyer, W.J., Scheraga, H.A.: Exact analytical loop closure in proteins using polynomial
equations. J. Comput. Chem. 20, 819–844 (1999)
145. Xu, D., Zhang, J., Roy, A., Zhang, Y.: Automated protein structure modeling in CASP9
by I-TASSER pipeline combined with QUARK-based ab initio folding and FG-MD-based
structure refinement. Proteins 79(Suppl 10), 147–160 (2011). https://doi.org/10.1002/prot.
23111
146. Yan, C.H., et al.: Minimal residual disease- and graft-vs.-host disease-guided multiple consol-
idation chemotherapy and donor lymphocyte infusion prevent second acute leukemia relapse
after allotransplant. J. Hematol. Oncol. 9, 87 (2016). https://doi.org/10.1186/s13045-016-
0319-5
147. Yang, J., Yan, R., Roy, A., Xu, D., Poisson, J., Zhang, Y.: The I-TASSER Suite: protein
structure and function prediction. Nat. Methods 12(1), 7–8 (2015). https://doi.org/10.1038/
nmeth.3213
148. Zhang, J., He, Z., Wang, Q., Barz, B., Kosztin, I., Shang, Y., Xu, D.: Prediction of protein
tertiary structures using MUFOLD methods. Mol. Biol. 815, 3–13 (2012). https://doi.org/10.
1007/978-1-61779-424-7_1
Protein Structure Prediction Using Coarse-Grained Models 59

149. Zhang, Y.: Interplay of I-TASSER and QUARK for template-based and ab initio protein
structure prediction in CASP10. Proteins 82(Suppl 2), 175–187 (2014). https://doi.org/10.
1002/prot.24341
150. Zheng, W.: Accurate flexible fitting of high-resolution protein structures into cryo-electron
microscopy maps using coarse-grained pseudo-energy minimization. Biophys. J. 100,
478–488 (2011). doi: S0006-3495(10)05186-6 [pii]
151. Zhou, H., Zhou, Y.: Distance-scaled, finite ideal-gas reference state improves structure-derived
potentials of mean force for structure selection and stability prediction. Protein Sci. 11,
2714–2726 (2002). https://doi.org/10.1110/ps.0217002
152. Zhou, R., et al.: Folding kinetics of WW domains with the united residue force field for
bridging microscopic motions and experimental measurements. Proc. Natl. Acad. Sci. U.S.A.
111, 18243–18248 (2014). https://doi.org/10.1073/pnas.14209141111420914111 [pii]
Protein Dynamics Simulations Using
Coarse-Grained Models

Sebastian Kmiecik, Jacek Wabik, Michal Kolinski, Maksim Kouza


and Andrzej Kolinski

Abstract Simulations of protein dynamics may work on different levels of


molecular detail. The levels of simplification (coarse-graining) may concern different
simulation aspects, including protein representation, interaction schemes or models
of molecular motion. So-called coarse-grained (CG) models offer many advantages,
unreachable by classical simulation tools, as demonstrated in numerous studies of
protein dynamics. Followed by a brief introduction, we present example applica-
tions of CG models for efficient predictions of biophysical mechanisms. We discuss
the following topics: mechanisms of chaperonin action, mechanical properties of
proteins and their complexes, membrane proteins, protein-protein interactions and
intrinsically unfolded proteins. These areas illustrate the opportunities for practical
applications of CG simulations.

1 Introduction

The steady increase in computational power constantly sets new limits in simula-
tions of biomolecular dynamics [164]. Nevertheless, the majority of biologically
relevant protein dynamic processes remain beyond the reach of atomistic Molec-
ular Dynamics (MD), the classical simulation tool. In such cases, the introduction
of properly designed simplifications that capture relevant physical features can be
the only option, or incomparably cheaper than atomistic MD, to better understand
macromolecular processes [64].
A variety of purely theoretical models for analyzing the dynamic properties of
proteins have been proposed [109, 171]. Nevertheless they appeared to be rather

S. Kmiecik (B) · J. Wabik · M. Kouza · A. Kolinski


Faculty of Chemistry, Biological and Chemical Research Centre,
University of Warsaw, Warsaw, Poland
e-mail: sekmi@chem.uw.edu.pl
M. Kolinski
Bioinformatics Laboratory, Mossakowski Medical Research Centre
Polish Academy of Sciences, Warsaw, Poland

© Springer Nature Switzerland AG 2019 61


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_3
62 S. Kmiecik et al.

limited in their predictions. This is due to the complicated nature of proteins and
rules governing their structure. Compared to purely analytical methods, the molec-
ular simulation approach is better suited to handling protein complexity. Presently,
molecular simulations represent a powerful and the most widely used theoretical
approach for the understanding of protein dynamics [64, 99, 117].

1.1 CG Simulation Models

The most direct computational approach to protein dynamics prediction is simulation


of a dynamic system of interest. A simplified simulation model of proteins is probably
the earliest example of CG approach in structural biology, developed in the mid-
1970s [91]. Since that time the field has grown tremendously, branching out in many
variants of protein representation, models of interactions and sampling techniques
(Fig. 1). Interestingly, recent estimates indicate a noticeable increase in the number
of studies that rely on CG simulations [155]. This significant rise is perhaps related
to a growing number of experimentally solved structures of large biomolecules (or
their complexes), too big to be reasonably addressed by all-atom simulations.

Fig. 1 Conceptual components of CG protein simulation models and their variants: a protein
representation, b interaction schemes (Go-like potentials are protein specific, i.e., native interactions
are favored to assure the lowest energy for the native conformation, and are used individually or in
combination with non-protein specific: physics- or knowledge based schemes), c sampling model.
This diagram applies either to continuous-space or discrete (lattice) models. For detailed review of
these variants and coarse-graining levels, refer to [64]
Protein Dynamics Simulations Using Coarse-Grained Models 63

A number of mean-resolution CG models have been developed for protein struc-


ture prediction [64, 71]. Some of them enable efficient simulation of dynamic pro-
cesses. A typical example is the CABS model [72], which acronym stands for the
united atoms representing a single residue in a protein chain (CA—alpha carbon of
the main chain, B—beta carbon, and S—the center of side group). Thus, in the CABS
model, a single amino acid is represented by 2–4 (depending on the side-chain size)
interaction centers, and one of them (C-alpha) is placed into a high-resolution lattice.
The interaction scheme is based on mean-force potentials derived by the statistical
analysis of known protein structures. In spite of the fact that the interaction scheme is
obtained only from known crystallographic structures related to completely random
protein chains, protein dynamics processes (folding, unfolding, folding upon bind-
ing, diffusion, flexibility of folded structures etc.) simulated by the CABS method are
qualitatively correct [23, 51, 52, 63, 65, 66, 67, 85]. These qualitatively correct results,
not trivial to obtain, show that interaction patterns of unfolded (or partially unfolded)
proteins are quite similar to the interactions seen in fully folded structures. The lattice
representation of CABS proteins significantly increases the speed of conformational
updates. Simulation processes are controlled by the Monte Carlo (MC) scheme: ran-
dom series of local conformational transitions. This pseudo-random Monte Carlo
process does not describe accurately ultra-short-time motions and ranges of a few
angstroms, although longer-time (and space) dynamics is essentially identical with
the dynamics observed for continuous space models [52, 53, 72]. The coarse-graining
of CABS enables very fast derivation of its low resolution structures from high res-
olution atomistic coordinates, and what is more important quite accurate all-atom
structures could be very rapidly re-computed from CABS coordinates [63, 65, 87,
86].

1.2 From CG to All-Atom Structures: Multiscale Modeling

One of the major future tasks of CG dynamics studies is the design of methods
for the reliable and efficient transition between simplified and atomic resolution
levels [132], as the element of multiscale methodologies. The idea of multiscale
modeling is efficient computation on a CG scale to send it to the detailed all-atom
simulation, or vice versa [68]. Obviously, the CG model used in the multiscale
methods must produce physically realistic coarse-grain protein structures. Even if
it is fulfilled, it is a non-trivial problem to add all-atom details to CG structures to
produce physically realistic all-atom counterparts [63]. It has been demonstrated in
applications to protein folding CG trajectories that reliable and efficient transitions
between CG and atomic resolution are feasible [46, 65]. Finally, it is accepted that one
of the most promising future directions is to develop approaches that can minimize
the difference between the simplified and atomic models [58].
64 S. Kmiecik et al.

2 Applications in Structural Biology

In this section, we discuss several recent examples of CG modeling, including our


reports and other published literature. The section covers the following actively stud-
ied tasks of protein dynamics: mechanisms of chaperonin action, mechanical prop-
erties of proteins, protein-protein interactions, membrane proteins and intrinsically
unfolded proteins. Most of our own examples of CG simulations described below
were done using the CABS CG modeling tool [72]. Numerous CABS applications
have been also reviewed elsewhere [64, 68].

2.1 Testing Mechanisms of Macromolecular Dynamics


via Simple Models: Chaperonin Action

Complex macromolecular processes can be generalized to very simple concepts


and tested computationally on a very general level. This is the case of the studies
of chaperonin action. Chaperonin and its protein substrate is a very large protein
complex whose dynamic processes are way beyond the reach of classical dynamics
simulation models. Over the past 20 years a significant number of studies, both
experimental and theoretical, have been pursued to understand how chaperonins
(like GroEL) facilitate protein folding processes in the cell.
Many theoretical models have been proposed focusing either on the passive
(aggregation prevention) or active (folding promotion) possible roles of chaperonins
[55]. A number of CG simulation studies investigated the effect of confinement on
protein folding using very simplistic [178], simple lattice [11], off lattice C-alpha
based and Go-like [121, 156] or more realistic [5] models. Another aspect of chaper-
onin action, namely the effect of interactions of the protein substrate with the surface
of chaperonin cavity, was also a subject of numerous CG simulations studies using
lattice [11] and off-lattice [54] models. For broad, recent reviews covering the use
of CG models in chaperonin action studies, see [55, 102].
Probably the most popular theoretical model which provides explanation of the
chaperonin active role is the Iterative Annealing Model (IAM). In this model chap-
eronin promotes the protein substrate folding by sequential unfolding of misfolding
traps through their hydrophobic interactions with cage walls. Just very recently,
we have attempted to test the IAM hypothesis using a de novo CABS modeling
approach employing a non-specific (without the Go-like approximation) knowledge-
based interaction scheme [67]. Importantly, in most (if not all, as described by Lucent
et al. [102]), simulation studies testing various chaperonin models on real (i.e. not
too much simplified) protein substrates, a common simplified interaction model was
used: the Go-like model. Therefore, in contrast to earlier simulation studies, the
CABS model did not preclude transient conformers stabilized by non-native inter-
actions (Fig. 2).
Protein Dynamics Simulations Using Coarse-Grained Models 65

Fig. 2 A simple chaperonin model used in protein folding studies with the CG CABS model [67].
The chaperonin cage was simulated as a sphere with a thick wall of variable hydrophobicity. In
the basic state the walls are inert for 9/10 of the simulation time. Periodically (see the simulation
timescale above in the Figure) the walls became hydrophobic, attracting the encapsulated protein
chain with a strength typical for hydrophobic interactions within folded proteins (according to the
CABS force field)

The CABS simulation results showed that periodic distortion of the simulated
proteins by hydrophobic chaperonin interactions promotes rapid folding and leads to
a decrease in folding temperature. According to the observed mechanism of folding
promotion, chaperonin prevents kinetically trapped conformations. This is contrary
to the so far accepted interpretation of the IAM model suggesting not the prevention
but rather the unfolding action from already trapped conformations. Interestingly,
the analysis of the folding trajectories enables general observation of chaperonin-
induced modulation of the observed folding mechanisms from nucleation–condensa-
tion to more framework-like. All these observations are in good agreement with the
experimental data on chaperonin-bound protein substrates, generally indicating an
ensemble of compact and locally expanded states lacking stable tertiary interactions.
It is worth to mention that theoretical studies of chaperonin-mediated folding may
have important conceptual applications in other fields [102] e.g. in the development
of structure-refinement software or in the construction of chaperonin-like molecules
designed for specific biotech and medical applications. We have to emphasize that
we are only at the beginning of the understanding of how chaperonins work. As
pointed out by Lucent et al. [102], so far most theoretical and experimental research
focused on GroEL, a specific prokaryotic chaperonin. Since chaperonins exhibit
apparently different modes of action in prokaryotic and eukaryotic organisms, the
investigation of these differences may be essential for the complete understanding of
underlying mechanisms and protein folding itself. This challenging issue has already
been addressed by a very simple lattice model [50].
66 S. Kmiecik et al.

2.2 Mechanical Unfolding and Refolding of Proteins


and Their Complexes

One of the functional features of proteins is response to a wide range of applied


forces. Being subjected to an applied load, proteins play key roles in cytoskeletal
organization [33], mechanics [37], cellular transport [139], signaling [149] and pro-
tein degradation [44]. The required external force to unfold protein is in the order of
pico-Newtons. Since the atomic force microscopy (AFM) and laser optical tweezers
(LOT) techniques [105, 124, 145] detect forces in the pico-Newton range, they are
useful tools for studying mechanical unfolding of biomolecules. In studies of these
processes two major strategies are used. In the first technique, protein is pulled by a
force ramped linearly with time, while monitoring the force (mechanical resistance)
as a function of the end-to-end distance. The second strategy is based on the appli-
cation of a constant force through force clamp devices. In experiments at a constant
pulling speed, the total force experienced by protein is F  k(vt−x), where k, v, t
and x are respectively: the spring constant (stiffness) of cantilever, pulling speed,
time, and displacement of the pulled amino acid from its original position. Typi-
cally, in AFM experiments k and v are in the range of 10–1000 pN and 10−11 –10−7
Å/ps, respectively. In LOT, the velocity range is similar to that of AFM, whereas
the typical values of spring constant, k  0.001–0.1 pN/nm. Stiffness defines the
force resolution of experiment. Thus, AFM can probe unfolding of strong proteins
with required Fmax of about few hundreds of pN (such as titin [124] or ubiquitin
[16]), while LOT is precise enough for studying biomolecules with few tens of pN
mechanical resistance (weaker proteins as well as DNA and RNA molecules [98,
147]).
Figure 3a shows the force-extension profile obtained by constant velocity stretch-
ing experiments for Ig8 titin fragment. The peaks in Fig. 3a are associated with
breaking hydrogen bonds (HBs) between strands A and G (Fig. 3d) in single titin
domains of the multidomain construction. Apart from molecular interactions studies,
AFM technique can also be used to investigate the mechanical stability of proteins
measured by Fmax in the force-extension profile (note that Fmax depends on the pulling
speed logarithmically, Fmax ~ ln(v) [31]). Measuring the mechanical stability in dif-
ferent solutions, one can also probe the effect of environment on hydrogen bonding
[101]. The mechanical unfolding studies provide also insights into many other impor-
tant issues, including: forces that drive biological processes, ligand binding affinity to
proteins/receptors [34], force-induced intermediate states [35, 93, 135], the mechan-
ical unfolding free energy landscape (FEL) of proteins [14]. The problem of FEL is
considered in more detail below.
A major limitation of AFM experiments is that it cannot give the detailed char-
acterization, at the atomic level, of conformational changes under the applied force.
Computer simulations may be employed as a tool to complement experimental stud-
ies. Schulten’s group used the all-atom models with an explicit water to study the
mechanical unfolding of the I27 protein [100]. They deciphered in a great detail the
unfolding pathway of I27 and demonstrated the existence of hump due to breaking
Protein Dynamics Simulations Using Coarse-Grained Models 67

Fig. 3 a Force-extension profile obtained by stretching of Ig8 titin fragment (adapted from Ref.
[124]). Each peak corresponds to unfolding of a single domain with maximum resisting force to
stretching, Fmax . Smooth curves are fits to the wormlike chain model. b Conceptual plot for the free
energy landscape of protein unfolding without (red) and under (blue) the external force. An applied
force lowers unfolding barrier by Fxu increasing exponentially the unfolding rate constant (ku ),
but decreasing exponentially the folding rate constant(kf ). xu is the distance between native and
transition state and xf is the distance between transition and denatured state. c Distance to transition
state, xf in two different regimes for titin protein (pdb ID 1tit). The crossover from the low- to middle-
force regimes occurs at f switch  ~5 pN. d Cartoon representation of native state conformation of
I27 domain (PDB code: 1tit) with eight β-strands labeled: A(4–8), A (11–15), B(18–25), C(32–36),
D(47–52), E(55–61), F(69–75), G(78–88). Importance of HBs between beta-strands marked by red
color is described in the text

HBs between beta strands A and B (Fig. 3d) [101]. Mechanical unfolding of a num-
ber of proteins has been also probed by all-atom simulations with implicit solvent
[115]. The major shortcoming of all-atom MD simulations is that the pulling speed is
about 6 orders of magnitude higher than that used in AFM experiments. It is unclear
if in silico results obtained in such extreme conditions are meaningful to understand
experiments (strong forces may considerably disturb FEL), although recent studies
claimed that unfolding pathways are not sensitive to pulling forces and speeds [90,
97].
The time scale discrepancy (and the related discrepancy in stretching forces
required to induce unfolding) between AFM experiments and simulation can be
reduced by the usage of CG models. Nowadays GPU technique allows reaching
experimental pulling speeds by CG Go models [176]. CG Go models have been suc-
cessfully used by many groups to study mechanical properties of proteins [2, 9, 24,
170]. Despite their simplicity, in many cases they correctly capture unfolding path-
68 S. Kmiecik et al.

ways, FEL and mechanical stability of proteins. For example, a complete description
of mechanical unfolding pathways of single and multidomain Ubiquitin at the level
of secondary structure was obtained [95]. It was shown that thermal and mechanical
pathways for fibronectin type III and I27 domain are different [115]. This is because
the thermal fluctuations have more global effect on entire protein and unfold the most
unstable part of protein while the force should propagate protein unwinding from
the points to which force is applied. Having used Go-model, mechanical unfolding
pathways of protein DDFLN4 [94] and two slipknotted proteins (pdb codes—1e2i
and 1p6x) [150], were shown to depend on the pulling speed.
The CG Go-models may be suitable for deciphering the FEL (Fig. 3b). Consider-
ing FEL as a function of end-to-end distance, one can use Bell-approximation [7] to
estimate the distance between the native state (NS) and transition state (TS), xu , using
either the dependences of unfolding rates on the external force [7] or the dependence
of force on pulling speed [31]. The distance between the NS and TS xu (Fig. 3b),
estimated by the C-alpha Go-like model [25], was in excellent agreement with exper-
imental results [15, 76]. Furthermore, Li showed that xu (Fig. 3b) is defined by the
secondary structure content and approximately depends linearly on the contact order
[83, 92], thus the helix proteins have larger distances from the native state to the
transition state than beta proteins. It should be noted that the phenomenological Bell
theory is based on the assumption that xu is not moving under stretching. Recently,
applying Kramers theory [81] and assuming that the distance between NS and TS is
force-dependent, Dudko et al. [29] have gone beyond the Bell assumption. With the
help of proposed non-linear kinetic theory [29] one can estimate not only intrinsic
rate coefficient, ku , and the distance between NS and TS, xu , but also the unfolding
barrier, G ++
u (Fig. 3d).
One of the most successful application areas of CG Go models were estimations
of the mechanical stability of proteins [13, 92, 144, 152]. It has been found that helix
proteins are less stable than beta proteins and unfolding force Fmax may be expressed
as a linearly function of the contact order [119]. This is understandable because
beta proteins have a larger fraction of long-range residue-residue contacts leading to
higher resistance to external perturbation [83]. Using the Go models, Cieplak et al.
computed Fmax for thousands of proteins [144, 152] and have found that the mechan-
ical clamp (resistance-determining region of a protein) of the top strongest proteins
is not only consisted of hydrogen bonded β-strands being sheared during the pulling.
Structures tied by disulfide bonds were found to contribute to significantly larger
mechanical stability than shear-based mechanical clamps. Novel mechanical clamps
were identified and classified [143, 144]. Later on, the high resistance to stretching
of top 13 proteins (cysteine-slipknots) was confirmed by all-atom steered molecular
dynamics (SMD) simulations [116] and observed experimentally [163]. Recently CG
model was successfully applied even for proteins with non-trivial structures [150,
151], which was confirmed by experiment [45]. For a more detailed review of protein
mechanostability, see Chap. 10 of this book entitled “Mechanostability of proteins
and virus capsids.”
The success of CG Go models is possibly associated with the fact that the pulling
starts from the native state and that these models are based on topology of the native
Protein Dynamics Simulations Using Coarse-Grained Models 69

state. However, in particular cases one has to be careful with predictions emerging
from these simple models. In the case of DDFLN4 protein, the Go model did not give
the peak in the force-extension curve observed in the experiment. It was shown that
the occurrence of that peak is due to non-native interactions neglected in Go model
[77]. Thus, in certain cases the non-native interactions are important because non-
native contacts appear in intermediate state during the unfolding process. To avoid
possible artifacts associated with neglecting non-native interactions, CG models with
more realistic potentials may be used. Using the CABS model [78] it was shown that
non-native interactions have led to an additional intermediate state along mechanical
unfolding pathway, which was previously detected in the AFM experiments [134] and
in explicit-solvent all-atom simulation, but not in CG Go-model. Another example
of such case is the force-induced intermediate of Ubiquitin, which was neglected in
CG Go-model simulations [95], but detected by the Lund force-field [49].
Recently, Steered Molecular Dynamics (SMD) simulations have become a pow-
erful tool to assess the strength of the molecular interactions. The idea behind using
SMD simulations is that the mechanical stability, or rupture force (measured as a
peak in force-extension profile), required to unbind a ligand from a receptor is related
to the strength of the interaction between them [8, 26, 38, 39, 74, 75, 96, 128]. Over
the last 5 years, SMD method has been implemented in many CG protein simula-
tion packages including CABS [78], UNRES [142], AWSEM [41] and many others.
With the ability of simplified models to sample longer timescales, when compared
to atomistic models, application of CG models is a promising direction for studies
of mechanical stability of large biomolecular complexes.
SMD simulations have been used for a wide variety of applications in the studies
of biological processes and various biomolecules [90]. Going forward, SMD tech-
niques can be used to study cell functions, where proteins are exposed to their native
(crowded) environment [167]. One of the recent applications of SMD is to understand
the mechanism of virus binding to its host cell [141]. Another issue of great interest is
the application of SMD for studying the response of protein to periodic forces [154].
It is also worth to mention some important problems for further studies. For instance,
it remains unclear if the distance between the native and transition states (distance xu
(Fig. 3c) followed from the non-linear theory [29]) depends linearly on contact order
(as it was obtained in the linear Bell approximation). Generally, the deciphering FEL
is done by its projection onto one-dimensional space, usually end-to-end distance.
However, the validity of such approximate mapping is not always true [10], thus this
issue requires further investigation.
In addition to mechanical unfolding studies, CG models can be used to charac-
terize the refolding kinetics of proteins in a presence of external force [80, 131].
Many proteins in human body that are being subjected to a wide range of mechanical
forces face challenges to reach their native states. The question of how an external
force affects the protein refolding remains to be clarified. Single-molecular manipu-
lation experiments have demonstrated that the refolding of protein under small force
can be probed by force-clamp technique [32]. If the quenched force is smaller than
equilibrium critical force separating folded and unfolded states, protein refolds into
native state. Typical time scales for protein folding in the absence of applied external
70 S. Kmiecik et al.

force varies from microseconds to hours [82]. Note that underlying dynamics of the
protein refolding process under force can occur on timescales that are a few orders
of magnitude slower compared to conventional folding process. This is because in
the presence of external force, f , the refolding times exponentially increase with f
[7]. Thus, only CG models can be effectively used to study refolding process under
external load. Using CG Go-model, Kouza and coworkers [80] studied the impact of
the external force from 0 to 14 pN on protein refolding pathways of several proteins.
It was found that there are two force regimes for refolding of titin with different
distance to transition state, x f (Figs. 1b and 1c). In the first or low force regime,
the refolding pathways were in close agreement with the thermal ones. However, the
simulation values of x f obtained in this force range did not agree with the experimen-
tal ones. The results obtained for x f in the second force regime are in good agreement
with experiments (Fig. 1c) [80]. This implies that force-clamp experiments are being
carried out in the second force regime (Fig. 1c) where the pathways are not the same
as thermal ones. Only if the quench force is smaller than f switch , the thermal folding
pathway can be probed by force-clamp experiments. This result calls for a caution
in interpreting results of single-molecular manipulation experiments.

2.3 Dynamics of Protein-Protein Interactions

Dynamics of protein-protein interactions is extremely demanding in terms of com-


putational power, when using classical atomistic modeling tools. As demonstrated in
numerous works, CG models allow for efficient exploration of the thermodynamics
and kinetics of protein complexes [36, 60, 62, 114, 130]. For example, Kim and Hum-
mer [60] investigated binding affinities of Vps27 complexes with ubiquitin attached to
the membrane, where folded domains were rigid and linkers between them were flex-
ible. They used a C-alpha model with various variants of potentials for interactions
between domains, linker movement and the protein-membrane complex. Predicted
binding affinities, for various modeled complexes, were in good agreement with the
experimental data. Furthermore, conformations of some ubiquitin complexes were
predicted with very good precision (DRMS < 2 Å).
Interestingly, accurate values of binding affinities could also be determined with
a more simplified model [36]. In this case the Brownian dynamics of the Barnase-
Barstar complex was derived with a model in which three amino acids were repre-
sented by one bead. Computed kinetic data of the association process corresponded
well with the experiment.
The binding of a protein coactivator to an enzyme is a class of extremely important
processes, usually difficult to study due to relatively high protein rearrangements.
The characterization of such processes has recently been the aim of CG simulation
studies [28, 84]. In one of them, Kurcinski and Kolinski [84] applied the CABS
model to describe the activation of the Retinoid X Receptor (RXR) by 9-cis retinoic
acid and the TRAP220 coactivator. They focused on specific transition states. The
results agreed well with the experimental data and a two-stage sequential reaction
Protein Dynamics Simulations Using Coarse-Grained Models 71

mechanism could be suggested. Interestingly, the simulations were conducted with


a fully flexible peptide coactivator (11 residues) and a moderately flexible receptor
(238 residues) whose conformation was restrained to the vicinity of its experimental
structure (see Fig. 4 for the scheme of the multistage procedure). The resulting extent
of conformational sampling was incomparably larger than with any classic all-atom
simulations.
Apart from the possibility to use restraints from experimental structures to main-
tain the protein fold we can also use an elastic network model (ENM) as was done by
Hall and Sansom [43]. In this study proper structures of the Cohesin (162 residues)-
Dockerin (60 residues) complex were predicted with a CG-Molecular Dynamics
(CG-MD) model in which each amino acid was represented by four beads. Ca. 80%
of interfacial residues were identified correctly and two various ways of ligand bind-
ing were identified which agreed well with the results of experimental data.
With regard to the large-scale dynamics of protein systems, another promising
and presently active field is CG dynamics of actin filaments [12, 21]. Because of the
scale of the system, it is extremely challenging to simulate myosin binding to actin
filaments by all-atom MD. A multiscale model [157] enabled the observation of the

Fig. 4 Multiscale procedure for the description of binding between the Retinoid X Receptor (RXR)
and the peptide (TRAP220) cofactor using CABS CG dynamics [84]. The procedure starts from
the generation of input data for a receptor and a protein cofactor. In the next step, the receptor and
the cofactor are put together in many random configurations, subsequently subjected to CABS CG
simulation. Various types of data stored along the procedure are shown in bold frames, while the
applied computational methods in thin frames
72 S. Kmiecik et al.

myosin motor and an insight into its action. In this case, three levels of coarse-graining
were introduced: chains of secondary structure elements, domains and molecules.
The movement of each component was simulated by Brownian Dynamics. A more
detailed, physicochemical view of the myosin-actin complex was recently obtained
with a CG simulation model [114] in which each bead represented a single amino
acid. In this case conclusions regarded also more general thermodynamic aspects of
protein-protein association.
Another popular and important protein-protein dynamics issue, in which diverse
levels of coarse-graining are applied, is protein aggregation. All-atom MD simu-
lations in explicit solvent can provide insights about early stages of aggregation
process of short peptides derived from full-length amyloidogenic proteins [6, 73,
79, 111, 158]. Larger complexes and longer timescales can be accessed using CG
models. In the simplest CG models, a single unit (cuboid [175] or tube [4]) represents
the whole peptide, while in the most detailed models each amino acid consists of
a few pseudo-atoms [20, 103, 107, 110, 125, 168]. Many practical applications of
CG models have been outlined in recent reviews [64, 108]. Dramatic progress has
been recently achieved in the CG modeling of large polyprotein complexes (made
up of many copies of the same or different proteins) [130]. In their review, Saunders
and Voth present two general classes of CG methods: mapping methods that transfer
information from one level to another only during parameterization and bridging
methods that connect different scales of representation during simulation.
The major challenge in modeling of protein interaction dynamics seems to be
as that outlined in the reviews of the performance of protein docking techniques
[162, 174, 22]. Namely, it is the treatment of substantial conformational changes.
CG simulation models offer perhaps the most prospective means for modeling of
extensive backbone dynamics in the nearest future.

2.4 Dynamics of Membrane Proteins

Membrane proteins play an important role in cell biology. They are responsible for
signaling, molecular transport across lipid bilayers, maintaining cell structural stabil-
ity and control of cell-cell interactions. Although 20 to 30% of all ORFs are predicted
to encode membrane proteins, less than 1% of all known 3D protein structures account
for membrane proteins [112]. Moreover, those proteins are embedded in different
types of lipid bilayers. The interaction with lipids is essential for both protein function
(e.g. can affect integral membrane protein activity [89]) and membrane properties
such as hydrophobic thickness or lipid composition [48]. The complex nature of
membrane-protein systems makes CG Molecular Dynamics (CG-MD) simulations
a valuable approach to the investigation of dynamics, structure-function relationship
and stability of membrane–protein systems [64]. One of the best performing, and
probably the most recognized, CG-MD approaches is based on the MARTINI force
field [104] that uses four-to-one atom mapping. Only four main types of interaction
sites are defined: polar (P), non-polar (N), apolar (C), and charged (Q). Each particle
Protein Dynamics Simulations Using Coarse-Grained Models 73

type has a number of subtypes allowing accurate representation of solvent, protein


and membrane structures. This approach enables treatment of very large systems (cor-
responding to systems consisting of more than 500,000 atoms) and offers timescales
above 100 μs which are far beyond the scope of classical all-atom-MD. The method
was successfully applied by Sansom and co-workers for the prediction of protein
positions within lipid bilayers [136]. Self-assembly CG-MD simulations, starting
from a protein surrounded by randomly positioned water and lipid molecules, were
conducted for 91 different protein systems. The resulting structures gave insights
into direct protein-lipid interactions, membrane distortion around different proteins
and localization of proteins in the lipid bilayers, in agreement with experimental data
(see Fig. 5).
CG-MD simulations applying the MARTINI force field were also used for the
investigation of helix associations and dimerization of membrane proteins. Sengupta
and coworkers conducted a set of CG-MD simulations, each lasting 25 μs, to study
the association mechanism of glycophorin A and two disruptive mutants, T87F and

Fig. 5 Final structures from self-assembly CG-MD simulations, starting from a protein surrounded
by randomly positioned water and lipid molecules [129]. The figure presents the results of four sim-
ulations: A—cytochrome bc1 complex, B—putative metal-chelating ABC transporter, C—quinol-
fumarate reductase and D—Mg2+ transporter. Water, ion and DPPC lipid tail particles are excluded
for clarity. The backbone trace of the protein is shown in blue. The particle colors are: phosphate in
DPPC lipid headgroups: red; glycerol linker in the lipid: yellow; choline in PC headgroups: blue.
Picture created based on materials available in the CG Database [129]
74 S. Kmiecik et al.

a triple mutant of the GxxxG motif (G79LG83LG86L), embedded in a DPPC lipid


membrane model [138]. In each case, dimers formed within the first 5 μs. The wild-
type dimer packed in a right-handed manner, and the structure was consistent with
the native structures defined by NMR studies [146]. The analysis of free energy
profiles reveals that two dimers formed by mutated peptides were less stable, by
about 8–10 kJ mol−1 as a result of the disruption of a lipid bilayer surrounding
the protein and less efficient helix-helix packing [138]. The observed differences
became only apparent after extensive sampling, which indicates the importance of
long microsecond simulation time scales.
A multiscale MD approach (combining CG-MD and all-atom-MD simulations)
was used by Kalli and coworkers [57] to explore the formation of an aIIb/b3 integrin
TM helix hetero-dimer in the DPPC membrane model. CG-MD simulations were
performed using high-throughput methodology [42] which enabled automatic run-
ning of multiple self-assembly simulations and statistical analysis over an ensemble
of approximately 100 structures. Dimer formation usually occurred within a few hun-
dred nanoseconds of CG-MD. The resulting dimers were submitted to further assess-
ment and refinement using all-atom-MD simulation. Comparing the final structure
of the modeled dimer with the available aIIb/b3 integrin NMR structure (PDB ID:
2K9 J [88]) yields a Cα rmsd of 2.2 Å for the TM region, a similar crossing angle of
30 ± 3° and a helix-helix interface created by the same residues. The results indicate
that a purely computational based approach may result in hetero-dimer formation
with an accuracy similar to the NMR method.
Recently, Periole and coworkers applied large-scale CG-MD simulations to study
the energetics of the receptor-receptor dimer interface of the G protein coupled recep-
tor (GPCR), rhodopsin [118]. The procedure involved self-assembly simulations of
multiple copies of rhodopsin embedded into a lipid membrane over time scales
ranging from 10 to 100 μs. During the simulations the potentials of mean force
(PMFs) were computed for pairs of rhodopsin molecules along different interfaces.
The resulting data pointed to the most stable rhodopsin-rhodopsin conformation
involving a symmetrical Helix1/Helix8 interface. The observed interface was also in
agreement with recent cross-linking experiments [69] and EM density maps [126].
This approach based on extensive CG-MD simulations may also be used to investi-
gate homo- and hetero-dimer interfaces of other members of the GPCR family.
The above examples illustrate some of the CG-MD methods which use the MAR-
TINI force field applied in the studies of membrane-protein systems. A wide variety
of other CG methods currently exist and new force fields are being developed [3,
59, 61, 120, 148, 179]. Recently, an extension of CABS coarse-grained model to
modeling the effect of membrane environment (CABS-membrane [120]) has been
applied to ab initio folding simulations of 10 short helical membrane proteins. The
CABS-membrane simulations started from random protein conformations, situated
outside the membrane environment, and allowed for full flexibility of the modeled
proteins during their spontaneous insertion into the membrane. In the obtained trajec-
tories, models close to the experimental membrane structures were found (see Fig. 6).
Another class of approaches is based on combination of CG models with the Gaus-
sian network model (GNM) and/or the anisotropy elastic network model (ANM).
Protein Dynamics Simulations Using Coarse-Grained Models 75

Fig. 6 Membrane insertion and folding of 1A91 protein observed in CABS-membrane ab initio
simulations [120]. a example simulation snapshots illustrating the insertion and folding mechanism,
b evolution of the RMSD values (reflecting similarity to the experimental structure) vs simulation
time, c comparison of the highest accuracy model obtained in the simulations (RMSD  2.2 Å)
with the experimental structure (colored in green)

These methods were used to investigate the mechanism of L-arginine (Arg)/agmatine


(Agm) antiporter (AdiC) [18] and prediction of functional motions of outer mem-
brane transporter and signal transducer FecA [137].
Despite the limitations of CG models, united-atom representation and simplifi-
cation of energy function, there is a growing need for improving CG computational
methods to be used to study the function and dynamics of large and complex protein-
membrane systems. Nevertheless, CG based methods are rapidly advancing and may
become invaluable tools for the exploration of some fundamental events that are oth-
erwise still not reachable by biochemical experiments.

2.5 Intrinsically Unfolded Proteins

Over the last decades, the thermodynamically stable conformation of a protein was
usually treated as the state responsible for biological functions. Nevertheless, at the
end of the 20th century the research community realized that intrinsically disordered
proteins (IDP) or proteins with intrinsically disordered regions (IDR) are ubiquitous
in nature and they can retain their functionality [40, 106, 160, 161, 172]. Confor-
mational studies of these proteins are experimentally extremely challenging [30],
particularly due to their large structural heterogeneity and aggregation tendency.
With the boom of IDP studies, computer simulation models have emerged as use-
ful tools for the description of IDP conformational ensembles [17, 122, 123]. As
76 S. Kmiecik et al.

the effective search of the conformational space is the major advantage of the CG
models, they can be used as methods of choice for possibly the broadest sampling
of conformational disorder.
Owing to their flexibility, disordered proteins have increased tendency of forming
protein-protein complexes. During binding, as compared to folded structures, they
can form a far larger number of interaction contacts. This theory is called the “fly-
casting mechanism” and it was illustrated by Shoemaker et al. [140] who investigated
the kinetics of IDP binding to the receptor using their free energy functional based
on a simplified scheme of amino acid contacts.
Nevertheless, CG simulations of pKID-KIX complexes [47] indicated that the
increased binding affinity can be caused not only by the greater capture radius of
IDPs. The kinetic analysis of this process was based on simulations using the CG Go
model with the continuum C-alpha chain representation and compared with available
experimental data for various ordered and disordered complexes. Interestingly, it was
found that the coupling of folding with binding of IDPs leads to a significant reduction
in the binding free-energy barrier. This work also discusses roles of other structural
factors important for this particular association.
Abeln and Frenkel analyzed other aspects of how intrinsically disordered regions
(IDRs) can influence the protein association process using Monte Carlo (MC) simula-
tion on cubic lattice with C-alpha representation [1]. The simulation results provided
intriguing insights into the effect of IDRs on protein structure. The authors indicated
that proteins with hydrophobic binding motifs without neighboring IDRs tend to
aggregate and consequently form amyloids.
The ability to fold upon binding of some IDPs has been extensively studied using
CG simulation models [27, 159, 165, 166, 169]. A multiscale model was used to
generate the pathway of IDP folding induced by binding to its receptor [169]. The
method included a step of CG simulation with C-alpha representation and optimal
path calculation at an atomic level. The binding process was simulated as fully
flexible and the role of non-native interactions was stressed. In other studies [165,
166] the authors characterized an ensemble of transition states of p27Kip1 protein
binding to a rigid structure of a cyclin A—Cdk2 complex. In this case a knowledge-
based potential was utilized to investigate some aspects of the folding mechanism of
this protein. Intrinsically disordered proteins frequently serve as flexible linkers of
protein domains. CG modeling of such systems was reviewed by Zhou [177].
Similarly to protein structure prediction, IDP modeling approaches can be divided
into de novo methods (based on the prediction power of the method) and those uti-
lizing sparse experimental data. The CG C-alpha model of Norgaard et al. [113] was
designed to simulate disordered proteins and parametrized using data from nuclear
magnetic resonance spin-labeling experiments on the 131 fragment of Staphy-
lococcal nuclease. Importantly, such an approach can be used by utilizing data from
MD trajectories or other experiments.
Interestingly, 2D lattice models have been recently used to explain the worse
performance of sequence-based disorder prediction methods for smaller proteins (or
segments) than for larger ones. Such a simple simulation model enabled a novel
Protein Dynamics Simulations Using Coarse-Grained Models 77

insight into the basic determinants of protein disorder: amino acid composition and
chain length [153].
As shown above, CG models, even very simplistic ones, provided many important
facts for the description of IDP and IDR dynamics. However, the potential of CG
modeling does not seem to be sufficiently exploited in the field [64], perhaps because
of the relatively recent interest in the area.

3 Conclusions and Perspectives

An obvious advantage of CG protein simulations is that larger protein systems can


be studied and longer timescales can be assessed than it is possible using atomic-
resolution MD [64, 117]. Apart from expanding the limits, the speed-up benefit of
CG models brings many new opportunities for the design of some extensive ‘in
silico experiments’ [155], such as: comparative dynamics for a large set of proteins
[156], comprehensive mutation analysis [70], scanning parameters of a simulation
model to see how it affects simulation results [173] or construction of databases by
high-throughput simulation protocols [19].
CG protein modeling has already a history of a few decades. The last decade
showed a dramatic increase in CG modeling studies of large biomolecules [64, 155].
We can expect that this trend will continue in the foreseeable future, since atomic-
resolution MD is far too slow for studies of many practical problems. The current
need for computer-enhanced studies of large biomolecules is mostly due to the recent
growth of experimental data of structural biology that require rapid interpretation and
validation of emerging hypotheses [56, 117].
In this chapter, we described recent applications of CG simulations to some repre-
sentative and important topics of protein dynamics. The work demonstrates the utility
of CG modeling in understanding real biological problems. As shown, there are many
variants of CG simulation tools and many successful strategies in which CG models
are an important component. Future developments are expected to include CG models
in unified/integrative structure modeling procedures utilizing a wide range of exper-
imental and computational techniques [64, 127, 133]. Consequently, the integration
of protein CG models together with CG models for other molecules (lipids, nucleic
acids, carbohydrates) as well as CG models with atomic MD (so-called multiscale
approach) should be the focus of further research.

Acknowledgements We thank Dr. Joanna Sulkowska for critical reading of the section “Mechan-
ical Unfolding and Refolding of Proteins and their Complexes” of the manuscript. We acknowl-
edge partial support from: Foundation for Polish Science TEAM project (TEAM/2011-7/6) co-
financed by the European Regional Development Fund operated within the Innovative Economy
Operational Program; Polish National Science Center (NCN) on the basis of a decision DEC-
2011/01/D/NZ2/05314; Polish National Science Center (NCN) Grant No. NN301071140, Polish
Ministry of Science and Higher Education Grant No. IP2011024371, Polish National Science Center
(NCN) Grant (MAESTRO 2014/14/A/ST6/00088). M. Kouza acknowledges the Polish Ministry
78 S. Kmiecik et al.

of Science and Higher Education for financial support through “Mobilnosc Plus” Program No.
1287/MOB/IV/2015/0.

References

1. Abeln, S., Frenkel, D.: Disordered flanks prevent peptide aggregation. PLoS Comput. Biol.
4, e1000241 (2008). https://doi.org/10.1371/journal.pcbi.1000241
2. Arad-Haase, G., et al.: Mechanical unfolding of acylphosphatase studied by single-molecule
force spectroscopy and MD simulations. Biophys. J. 99, 238–247 (2010). https://doi.org/10.
1016/j.bpj.2010.04.004
3. Arkhipov, A., Freddolino, P.L., Schulten, K.: Stability and dynamics of virus capsids described
by coarse-grained modeling. Structure 14, 1767–1777 (2006). https://doi.org/10.1016/j.str.
2006.10.003
4. Auer, S., Meersman, F., Dobson, C.M., Vendruscolo, M.: A generic mechanism of emergence
of amyloid protofilaments from disordered oligomeric aggregates. PLoS Comput. Biol. 4,
e1000222 (2008). https://doi.org/10.1371/journal.pcbi.1000222
5. Baumketner, A., Jewett, A., Shea, J.E.: Effects of confinement in chaperonin assisted protein
folding: rate enhancement by decreasing the roughness of the folding energy landscape. J.
Mol. Biol. 332, 701–713 (2003). https://doi.org/10.1016/S0022-2836(03)00929-X
6. Baumketner, A., Shea, J.E.: The structure of the Alzheimer amyloid beta 10–35 peptide probed
through replica-exchange molecular dynamics simulations in explicit solvent. J. Mol. Biol.
366, 275–285 (2007)
7. Bell, G.I.: Models for the specific adhesion of cells to cells. Science 200, 618–627 (1978)
8. Bernetti, M., Cavalli, A., Mollica, L.: Protein-ligand (un)binding kinetics as a new paradigm
for drug discovery at the crossroad between experiments and modelling. Medchemcomm 8,
534–550 (2017). https://doi.org/10.1039/c6md00581k
9. Best, R.B., Hummer, G.: Protein folding kinetics under force from molecular simulation. J.
Am. Chem. Soc. 130, 3706–3707 (2008). https://doi.org/10.1021/ja0762691
10. Best, R.B., Paci, E., Hummer, G., Dudko, O.K.: Pulling direction as a reaction coordinate
for the mechanical unfolding of single molecules. J. Phys. Chem. B 112, 5968–5976 (2008).
https://doi.org/10.1021/Jp075955j
11. Betancourt, M.R., Thirumalai, D.: Exploring the kinetic requirements for enhancement of
protein folding rates in the GroEL cavity. J. Mol. Biol. 287, 627–644 (1999). https://doi.org/
10.1006/jmbi.1999.2591
12. Bindschadler, M.: Modeling actin dynamics. Wiley Interdisciplinary Rev. Syst. Biol. Med. 2,
481–488 (2010). https://doi.org/10.1002/wsbm.62
13. Brockwell, D.J., et al.: Pulling geometry defines the mechanical resistance of a beta-sheet
protein (vol 10, pg 731, 2003). Nat. Struct. Biol. 10, 872–872 (2003). https://doi.org/10.1038/
nsb1003-872b
14. Bustamante, C., Chemla, Y.R., Forde, N.R., Izhaky, D.: Mechanical processes in biochem-
istry. Annu. Rev. Biochem. 73, 705–748 (2004). https://doi.org/10.1146/annurev.biochem.72.
121801.161542
15. Caraglio, M., Imparato, A., Pelizzola, A.: Pathways of mechanical unfolding of FnIII(10): low
force intermediates. J. Chem. Phys. 133, 065101 (2010). https://doi.org/10.1063/1.3464476
16. Carrion-Vazquez, M., Li, H., Lu, H., Marszalek, P.E., Oberhauser, A.F., Fernandez, J.M.: The
mechanical stability of ubiquitin is linkage dependent. Nat. struct. Biol. 10, 738–743 (2003).
https://doi.org/10.1038/nsb965
17. Chan, H.S., Zhang, Z., Wallin, S., Liu, Z.: Cooperativity, local-nonlocal coupling, and nonna-
tive interactions: principles of protein folding from coarse-grained models. Annu. Rev. Phys.
Chem. 62, 301–326 (2011). https://doi.org/10.1146/annurev-physchem-032210-103405
Protein Dynamics Simulations Using Coarse-Grained Models 79

18. Chang, S., Hu, J.P., Lin, P.Y., Jiao, X., Tian, X.H.: Substrate recognition and transport behavior
analyses of amino acid antiporter with coarse-grained models. Mol. BioSyst. 6, 2430–2438
(2010). https://doi.org/10.1039/c005266c
19. Chetwynd, A.P., Scott, K.A., Mokrab, Y., Sansom, M.S.: CGDB: a database of membrane
protein/lipid interactions by coarse-grained molecular dynamics simulations. Mol. Membr.
Biol. 25, 662–669 (2008). https://doi.org/10.1080/09687680802446534
20. Chiricotto, M., Tran, T.T., Nguyen, P.H., Melchionna, S., Sterpone, F., Derreumaux, P.:
Coarse-grained and all-atom simulations towards the early and late steps of amyloid fibril
formation. Isr. J. Chem. 57, 564–573 (2017)
21. Chu, J.W., Voth, G.A.: Coarse-grained modeling of the actin filament derived from atomistic-
scale simulations. Biophys. J. 90, 1572–1582 (2006). https://doi.org/10.1529/biophysj.105.
073924
22. Ciemny, M.P., Kurcinski, M., Kamel, K., Kolinski, A., Nawsad, A., Schueler-Furman, O.,
Kmiecik, S.: Protein–peptide docking: opportunities and challenges. Drug Discov. Today. 23,
1530–1537 (2018)
23. Ciemny, M.P., Debinski, A., Paczkowska, M., Kolinski, A., Kurcinski, M., Kmiecik, S.:
Protein-peptide molecular docking with large-scale conformational changes: the p 53-MDM2
interaction. Sci. Rep. 6, 37532 (2016). https://doi.org/10.1038/srep37532
24. Cieplak, M., Hoang, T.X., Robbins, M.O.: Folding and stretching in a go-like model of titin.
Proteins 49, 114–124 (2002). https://doi.org/10.1002/prot.10087
25. Clementi, C., Nymeyer, H., Onuchic, J.N.: Topological and energetic factors: what determines
the structural details of the transition state ensemble and “en-route” intermediates for protein
folding? An investigation for small globular proteins. J. Mol. Biol. 298, 937–953 (2000).
https://doi.org/10.1006/jmbi.2000.3693
26. Colizzi, F., Perozzo, R., Scapozza, L., Recanatini, M., Cavalli, A.: Single-molecule pulling
simulations can discern active from inactive enzyme inhibitors. J. Am. Chem. Soc. 132,
7361–7371 (2010). https://doi.org/10.1021/ja100259r
27. De Sancho, D., Best, R.B.: Modulation of an IDP binding mechanism and rates by helix
propensity and non-native interactions: association of HIF1alpha with CBP. Mol. BioSyst. 8,
256–267 (2012). https://doi.org/10.1039/c1mb05252g
28. Di Fenza, A., Rocchia, W., Tozzini, V.: Complexes of HIV-1 integrase with HAT proteins:
multiscale models, dynamics, and hypotheses on allosteric sites of inhibition. Proteins: Struct.
Funct. Bioinf. 76, 946–958 (2009). https://doi.org/10.1002/prot.22399
29. Dudko, O.K., Hummer, G., Szabo, A.: Intrinsic rates and activation free energies from single-
molecule pulling experiments. Phys. Rev. Lett. 96, 108101 (2006). https://doi.org/10.1103/
PhysRevLett.96.108101
30. Eliezer, D.: Biophysical characterization of intrinsically disordered proteins. Curr. Opin.
Struct. Biol. 19, 23–30 (2009). https://doi.org/10.1016/j.sbi.2008.12.004
31. Evans, E., Ritchie, K.: Dynamic strength of molecular adhesion bonds. Biophys. J. 72,
1541–1555 (1997). https://doi.org/10.1016/S0006-3495(97)78802-7
32. Fernandez, J.M., Li, H.B.: Force-clamp spectroscopy monitors the folding trajectory of a
single protein. Science 303, 1674–1678 (2004). https://doi.org/10.1126/science.1092497
33. Fletcher, D.A., Mullins, R.D.: Cell mechanics and the cytoskeleton. Nature 463, 485–492
(2010). https://doi.org/10.1038/nature08908
34. Florin, E.L., Moy, V.T., Gaub, H.E.: Adhesion forces between individual ligand-receptor pairs.
Science 264, 415–417 (1994). https://doi.org/10.1126/science.8153628
35. Fowler, S.B., et al.: Mechanical unfolding of a titin Ig domain: structure of unfolding
intermediate revealed by combining AFM, molecular dynamics simulations, NMR and
protein engineering. J. Mol. Biol. 322, 841–849 (2002). https://doi.org/10.1016/S0022-
2836(02)00805-7
36. Frembgen-Kesner, T., Elcock, A.H.: Absolute protein-protein association rate constants from
flexible, coarse-grained brownian dynamics simulations: the role of intermolecular hydrody-
namic interactions in Barnase-Barstar association. Biophys. J. 99, L75–L77 (2010). https://
doi.org/10.1016/j.bpj.2010.09.006
80 S. Kmiecik et al.

37. Granzier, H.L., Labeit, S.: The giant protein titin: a major player in myocardial mechan-
ics, signaling, and disease. Circ. Res. 94, 284–295 (2004). https://doi.org/10.1161/01.res.
0000117769.88862.f8
38. Grubmuller, H., Heymann, B., Tavan, P.: Ligand binding: Molecular mechanics calculation of
the streptavidin biotin rupture force. Science 271, 997–999 (1996). https://doi.org/10.1126/
science.271.5251.997
39. Gu, J.F., Li, H.X., Wang, X.C.: A self-adaptive steered molecular dynamics method based
on minimization of stretching force reveals the binding affinity of protein-ligand complexes.
Molecules 20, 19236–19251 (2015). https://doi.org/10.3390/molecules201019236
40. Habchi, J., Tompa, P., Longhi, S., Uversky, V.N.: Introducing protein intrinsic disorder. Chem.
Rev. 114, 6561–6588 (2014). https://doi.org/10.1021/cr400514h
41. Habibi, M., Rottler, J., Plotkin, S.S.: As simple as possible but not simpler: on the reliability
of protein coarse-grained models. Biophys. J. 112, 176a–176a (2017)
42. Hall, B.A., Chetwynd, A.P., Sansom, M.S.: Exploring peptide-membrane interactions with
coarse-grained MD simulations. Biophys. J. 100, 1940–1948 (2011). https://doi.org/10.1016/
j.bpj.2011.02.041
43. Hall, B.A., Sansom, M.S.P.: Coarse-grained MD simulations and protein–protein interactions:
the Cohesin–Dockerin system. J. Chem. Theory Comput. 5, 2465–2471 (2009). https://doi.
org/10.1021/ct900140w
44. Hanson, P.I., Whiteheart, S.W.: AAA + proteins: have engine, will work. Nat. Rev. Mol. Cell
Biol. 6, 519–529 (2005). https://doi.org/10.1038/nrm1684
45. He, C., Genchev, G.Z., Lu, H., Li, H.: Mechanically untying a protein slipknot: multiple
pathways revealed by force spectroscopy and steered molecular dynamics simulations. J.
Am. Chem. Soc. 134, 10428–10435 (2012). https://doi.org/10.1021/ja3003205
46. Heath, A.P., Kavraki, L.E., Clementi, C.: From coarse-grain to all-atom: toward multiscale
analysis of protein landscapes. Proteins 68, 646–661 (2007). https://doi.org/10.1002/prot.
21371
47. Huang, Y., Liu, Z.: Kinetic advantage of intrinsically disordered proteins in coupled folding-
binding process: a critical assessment of the “Fly-Casting” mechanism. J. Mol. Biol. 393,
1143–1159 (2009). https://doi.org/10.1016/j.jmb.2009.09.010
48. Hunte, C.: Specific protein-lipid interactions in membrane proteins. Biochem. Soc. Trans. 33,
938–942 (2005). https://doi.org/10.1042/BST20050938
49. Irback, A., Mitternacht, S., Mohanty, S.: Dissecting the mechanical unfolding of ubiqui-
tin. Proc Natl Acad Sci U S A 102, 13427–13432 (2005). https://doi.org/10.1073/pnas.
0501581102
50. Jacob, E., Horovitz, A., Unger, R.: Different mechanistic requirements for prokaryotic and
eukaryotic chaperonins: a lattice study. Bioinformatics 23, i240–i248 (2007). https://doi.org/
10.1093/bioinformatics/btm180
51. Jamroz, M., Kolinski, A., Kmiecik, S.: CABS-flex: server for fast simulation of protein struc-
ture fluctuations. Nucleic Acids Res. 41, W427–W431 (2013). https://doi.org/10.1093/nar/
gkt332
52. Jamroz, M., Kolinski, A., Kmiecik, S.: CABS-flex predictions of protein flexibility com-
pared with NMR ensembles. Bioinformatics 30, 2150–2154 (2014). https://doi.org/10.1093/
bioinformatics/btu184
53. Jamroz, M., Orozco, M., Kolinski, A., Kmiecik, S.: Consistent view of protein fluctuations
from all-atom molecular dynamics and coarse-grained dynamics with knowledge-based force-
field. J. Chem. Theor. Comput. 9, 119–125 (2013). https://doi.org/10.1021/ct300854w
54. Jewett, A.I., Baumketner, A., Shea, J.E.: Accelerated folding in the weak hydrophobic envi-
ronment of a chaperonin cavity: creation of an alternate fast folding pathway. Proc Natl Acad
Sci U S A 101, 13192–13197 (2004). https://doi.org/10.1073/pnas.0400720101
55. Jewett, A.I., Shea, J.E.: Reconciling theories of chaperonin accelerated folding with experi-
mental evidence. Cell. Mol. Life Sci. 67, 255–276 (2009)
Protein Dynamics Simulations Using Coarse-Grained Models 81

56. Jung, J., Mori, T., Kobayashi, C., Matsunaga, Y., Yoda, T., Feig, M., Sugita, Y.: GENESIS: a
hybrid-parallel and multi-scale molecular dynamics simulator with enhanced sampling algo-
rithms for biomolecular and cellular simulations. Wiley Interdisciplinary Rev. Comput. Mol.
Sci. 5, 310–323 (2015). https://doi.org/10.1002/wcms.1220
57. Kalli, A.C., Hall, B.A., Campbell, I.D., Sansom, M.S.: A helix heterodimer in a lipid bilayer:
prediction of the structure of an integrin transmembrane domain via multiscale simulations.
Structure 19, 1477–1484 (2011). https://doi.org/10.1016/j.str.2011.07.014
58. Kamerlin, S.C., Vicatos, S., Dryga, A., Warshel, A.: Coarse-grained (multiscale) simulations
in studies of biophysical and chemical systems. Annu. Rev. Phys. Chem. 62, 41–64 (2011).
https://doi.org/10.1146/annurev-physchem-032210-103335
59. Kar, P., Gopal, S.M., Cheng, Y.M., Panahi, A., Feig, M.: Transferring the PRIMO coarse-
grained force field to the membrane environment: simulations of membrane proteins and
helix-helix association. J. Chem. Theor. Comput. 10, 3459–3472 (2014). https://doi.org/10.
1021/ct500443v
60. Kim, Y.C., Hummer, G.: Coarse-grained models for simulations of multiprotein complexes:
application to ubiquitin binding. J. Mol. Biol. 375, 1416–1433 (2008). https://doi.org/10.
1016/j.jmb.2007.11.063
61. Kim, B.L., Schafer, N.P., Wolynes, P.G.: Predictive energy landscapes for folding alpha-helical
transmembrane proteins. Proc. Natl. Acad. Sci. U S A 111, 11031–11036 (2014). https://doi.
org/10.1073/pnas.1410529111. 1410529111 [pii]
62. Kim, Y.C., Tang, C., Clore, G.M., Hummer, G.: Replica exchange simulations of tran-
sient encounter complexes in protein-protein association. Proc. Natl. Acad. Sci. U S A 105,
12855–12860 (2008). https://doi.org/10.1073/pnas.0802460105
63. Kmiecik, S., Gront, D., Kolinski, A.: Towards the high-resolution protein structure prediction.
Fast refinement of reduced models with all-atom force field. BMC Struct. Biol. 7, 43 (2007).
https://doi.org/10.1186/1472-6807-7-43
64. Kmiecik, S., Gront, D., Kolinski, M., Wieteska, L., Dawid, A.E., Kolinski, A.: Coarse-grained
protein models and their applications. Chem. Rev. 116, 7898–7936 (2016). https://doi.org/10.
1021/acs.chemrev.6b00163
65. Kmiecik, S., Gront, D., Kouza, M., Kolinski, A.: From coarse-grained to atomic-level char-
acterization of protein dynamics: transition state for the folding of B domain of protein A. J.
Phys. Chem. B 116, 7026–7032 (2012). https://doi.org/10.1021/jp301720w
66. Kmiecik, S., Kolinski, A.: Folding pathway of the b1 domain of protein G explored by multi-
scale modeling. Biophys. J. 94, 726–736 (2008). https://doi.org/10.1529/biophysj.107.116095
67. Kmiecik, S., Kolinski, A.: Simulation of chaperonin effect on protein folding: a shift from
nucleation-condensation to framework mechanism. J. Am. Chem. Soc. 133, 10283–10289
(2011). https://doi.org/10.1021/ja203275f
68. Kmiecik, S., Jamroz, M., Kolinski, A.: Multiscale approach to protein folding dynamics. In:
Kolinski, A. (ed.) Multiscale Approaches to Protein Modeling, pp. 281–293. Springer, New
York (2011). https://doi.org/10.1007/978-1-4419-6889-0_12
69. Knepp, A.M., Periole, X., Marrink, S.J., Sakmar, T.P., Huber, T.: Rhodopsin forms a dimer
with cytoplasmic helix 8 contacts in native membranes. Biochemistry 51, 1819–1821 (2012).
https://doi.org/10.1021/bi3001598
70. Koga, N., Takada, S.: Folding-based molecular simulations reveal mechanisms of the rotary
motor F1-ATPase. Proc. Natl. Acad. Sci. U S A 103, 5367–5372 (2006). https://doi.org/10.
1073/pnas.0509642103
71. Kolinski, A., Skolnick, J.: Reduced models of proteins and their applications. Polymer 45,
511–524 (2004). https://doi.org/10.1016/j.polymer.2003.10.064
72. Kolinski, A.: Protein modeling and structure prediction with a reduced representation. Acta
Biochim. Pol. 51, 349–371 (2004). doi: 035001349
73. Kouza, M., Banerji, A., Kolinski, A., Buhimschi, I.A., Kloczkowski, A.: Oligomerization of
FVFLM peptides and their ability to inhibit beta amyloid peptides aggregation: consideration
as a possible model. Phys. Chem. Chem. Phys. 19, 2990–2999 (2017)
82 S. Kmiecik et al.

74. Kouza, M., Banerji, A., Kolinski, A., Buhimschi, I.A., Kloczkowski, A.: Role of Resultant
Dipole Moment in Mechanical Dissociation of Biological Complexes. Molecules 23, 1995
(2018)
75. Kouza, M., Co, N.T., Li, M.S., Kmiecik, S., Kolinski, A., Kloczkowski, A., Buhimschi, I.A.:
Kinetics and mechanical stability of the fibril state control fibril formation time of polypeptide
chains: A computational study. J. Chem. Phys. 148, 215106 (2018)
76. Kouza, M., Hu, C.K., Li, M.S.: New force replica exchange method and protein folding
pathways probed by force-clamp technique. J. Chem. Phys. 128, 045103 (2008). https://doi.
org/10.1063/1.2822272
77. Kouza, M., Hu, C.K., Zung, H., Li, M.S.: Protein mechanical unfolding: importance of non-
native interactions. J. Chem. Phys. 131, 215103 (2009). https://doi.org/10.1063/1.3272275
78. Kouza, M., Jamroz, M., Gront, D., Kmiecik, S., Kolinski, A.: Mechanical unfolding of
DDFLN4 studied by coarse-grained knowledge-based CABS model. TASK Quaterly 18,
373–378 (2014)
79. Kouza, M., Co, N.T., Nguyen, P.H., Kolinski, A., Li, M.S.: Preformed template fluctuations
promote fibril formation: Insights from lattice and all-atom models. J. Chem. Phys. 142 (2015).
doi: Artn 145104. https://doi.org/10.1063/1.4917073
80. Kouza, M., Lan, P.D., Gabovich, A.M., Kolinski, A., Li, M.S.: Switch from thermal to force-
driven pathways of protein refolding. J. Chem. Phys. 146 (2017b). doi: Artn 135101. https://
doi.org/10.1063/1.4979201
81. Kramers, H.A.: Brownian motion in a field of force and the diffusion model of chemical
reactions. Physica 7(7), 284–303 (1940). https://doi.org/10.1016/S0031-8914(40)90098-2
82. Kubelka, J., Hofrichter, J., Eaton, W.A.: The protein folding ‘speed limit’. Curr. Opin. Struct.
Biol. 14, 76–88 (2004)
83. Kumar, S., Li, M.S.: Biomolecules under mechanical force. Phys. Rep.-Rev. Sect. Phys. Lett.
486, 1–74 (2010). https://doi.org/10.1016/j.physrep.2009.11.001
84. Kurcinski, M., Kolinski, A.: Theoretical study of molecular mechanism of binding TRAP220
coactivator to Retinoid X Receptor alpha, activated by 9-cis retinoic acid. J. Steroid Biochem.
Mol. Biol. 121, 124–129 (2010). https://doi.org/10.1016/j.jsbmb.2010.03.086
85. Kurcinski, M., Kolinski, A., Kmiecik, S.: Mechanism of folding and binding of an intrinsi-
cally disordered protein as revealed by ab initio simulations. J. Chem. Theor. Comput. 10,
2224–2231 (2014). https://doi.org/10.1021/ct500287c
86. Kurcinski, M., Oleniecki, T., Ciemny, M.P., Kuriata, A., Kolinski, A., Kmiecik, S. CABS-flex
standalone: a simulation environment for fast modeling of protein flexibility. Bioinformatics,
bty685 (2018).
87. Kuriata, A., Gierut A.M., Oleniecki, T., Ciemny, M.P., Kolinski, A., Kurcinski, M., Kmiecik,
S. CABS-flex 2.0: a web server for fast simulations of flexibility of protein structures. Nucl.
Acids Res. W1: W338–W343 (2018).
88. Lau, T.L., Kim, C., Ginsberg, M.H., Ulmer, T.S.: The structure of the integrin alphaIIb-
beta3 transmembrane complex explains integrin transmembrane signalling. EMBO J. 28,
1351–1361 (2009). https://doi.org/10.1038/emboj.2009.63
89. Lee, A.G.: How lipids affect the activities of integral membrane proteins. BBA-Biomembr.
1666, 62–87 (2004). https://doi.org/10.1016/j.bbamem.2004.05.012
90. Lee, E.H., Hsin, J., Sotomayor, M., Comellas, G., Schulten, K.: Discovery through the compu-
tational microscope. Structure 17, 1295–1306 (2009). https://doi.org/10.1016/j.str.2009.09.
001
91. Levitt, M., Warshel, A.: Computer simulation of protein folding. Nature 253, 694–698 (1975).
https://doi.org/10.1038/253694a0
92. Li, M.S.: Secondary structure, mechanical stability, and location of transition state of proteins.
Biophys. J. 93, 2644–2654 (2007). https://doi.org/10.1529/biophysj.107.106138
93. Li, L., Huang, H.H., Badilla, C.L., Fernandez, J.M.: Mechanical unfolding intermediates
observed by single-molecule force spectroscopy in a fibronectin type III module. J. Mol.
Biol. 345, 817–826 (2005). https://doi.org/10.1016/j.jmb.2004.11.021
Protein Dynamics Simulations Using Coarse-Grained Models 83

94. Li, M.S., Kouza, M.: Dependence of protein mechanical unfolding pathways on pulling speeds.
J. Chem. Phys. 130, 145102 (2009). https://doi.org/10.1063/1.3106761
95. Li, M.S., Kouza, M., Hu, C.K.: Refolding upon force quench and pathways of mechanical
and thermal unfolding of ubiquitin. Biophys. J. 92, 547–561 (2007). https://doi.org/10.1529/
biophysj.106.087684
96. Li, M.S., Mai, B.K.: Steered molecular dynamics-a promising tool for drug design. Curr.
Bioinform. 7, 342–351 (2012)
97. Lichter, S., Rafferty, B., Flohr, Z., Martini, A.: Protein high-force pulling simulations yield
low-force results. PLoS ONE 7, e34781 (2012). https://doi.org/10.1371/journal.pone.0034781
98. Liphardt, J., Onoa, B., Smith, S.B., Tinoco Jr., I., Bustamante, C.: Reversible unfolding of
single RNA molecules by mechanical force. Science 292, 733–737 (2001). https://doi.org/10.
1126/science.1058498
99. Liu, X., Shi, D., Zhou, S., Liu, H., Yao, X.: Molecular dynamics simulations and novel drug
discovery. Expert Opin. Drug Discov. 13, 23–37 (2018). https://doi.org/10.1080/17460441.
2018.1403419
100. Lu, H., Isralewitz, B., Krammer, A., Vogel, V., Schulten, K.: Unfolding of titin immunoglob-
ulin domains by steered molecular dynamics simulation. Biophys. J. 75, 662–671 (1998)
101. Lu, H., Schulten, K.: The key event in force-induced unfolding of Titin’s immunoglobulin
domains. Biophys. J. 79, 51–65 (2000). https://doi.org/10.1016/S0006-3495(00)76273-4
102. Lucent, D., England, J., Pande, V.: Inside the chaperonin toolbox: theoretical and computa-
tional models for chaperonin mechanism. Phys. Biol. 6, 015003 (2009). https://doi.org/10.
1088/1478-3975/6/1/015003
103. Malolepsza, E., Boniecki, M., Kolinski, A., Piela, L.: Theoretical model of prion propagation:
a misfolded protein induces misfolding. Proc. Natl. Acad. Sci. USA 102, 7835–7840 (2005)
104. Marrink, S.J., Tieleman, D.P.: Perspective on the Martini model. Chem. Soc. Rev. 42,
6801–6822 (2013). https://doi.org/10.1039/c3cs60093a
105. Marszalek, P.E., Lu, H., Li, H., Carrion-Vazquez, M., Oberhauser, A.F., Schulten, K., Fernan-
dez, J.M.: Mechanical unfolding intermediates in titin modules. Nature 402, 100–103 (1999).
https://doi.org/10.1038/47083
106. Mittag, T., Kay, L.E., Forman-Kay, J.D.: Protein dynamics and conformational disorder in
molecular recognition. J. Mol. Recognit. 23, 105–116 (2010). https://doi.org/10.1002/jmr.961
107. Morriss-Andrews, A., Shea, J.E.: Simulations of protein aggregation: insights from atomistic
and coarse-grained models. J. Phys. Chem. Lett. 5, 1899–1908 (2014). https://doi.org/10.
1021/jz5006847
108. Morriss-Andrews, A., Shea, J.E.: Computational studies of protein aggregation: methods and
applications. Annu. Rev. Phys. Chem. 66, 643–666 (2015). https://doi.org/10.1146/annurev-
physchem-040513-103738
109. Munoz, V., Henry, E.R., Hofrichter, J., Eaton, W.A.: A statistical mechanical model for beta-
hairpin kinetics. Proc. Natl. Acad. Sci. U S A 95, 5872–5879 (1998). https://doi.org/10.1073/
pnas.95.11.5872
110. Nasica-Labouze, J., et al.: Amyloid beta protein and Alzheimer’s Disease: when computer
simulations complement experimental studies. Chem. Rev. 115, 3518–3563 (2015)
111. Nguyen, P.H., Li, M.S., Stock, G., Straub, J.E., Thirumalai, D.: Monomer adds to preformed
structured oligomers of A beta-peptides by a two-stage dock-lock mechanism. Proc. Natl.
Acad. Sci. U.S.A. 104, 111–116 (2007). https://doi.org/10.1073/Pnas.0607440104
112. Nilsson, J., Persson, B., von Heijne, G.: Comparative analysis of amino acid distributions in
integral membrane proteins from 107 genomes. Proteins 60, 606–616 (2005). https://doi.org/
10.1002/prot.20583
113. Norgaard, A.B., Ferkinghoff-Borg, J., Lindorff-Larsen, K.: Experimental parameterization of
an energy function for the simulation of unfolded proteins. Biophys. J. 94, 182–192 (2008).
https://doi.org/10.1529/biophysj.107.108241
114. Okazaki, K.-I., Sato, T., Takano, M.: Temperature-enhanced association of proteins due to
electrostatic interaction: a coarse-grained simulation of Actin-Myosin binding. J. Am. Chem.
Soc. 134, 8918–8925 (2012). https://doi.org/10.1021/ja301447j
84 S. Kmiecik et al.

115. Paci, E., Karplus, M.: Unfolding proteins by external forces and temperature: the importance
of topology and energetics. Proc. Natl. Acad. Sci. U S A 97, 6521–6526 (2000). https://doi.
org/10.1073/pnas.100124597
116. Peplowski, L., Sikora, M., Nowak, W., Cieplak, M.: Molecular jamming–the cystine slipknot
mechanical clamp in all-atom simulations. J. Chem. Phys. 134, 085102 (2011). https://doi.
org/10.1063/1.3553801
117. Perilla, J.R., et al.: Molecular dynamics simulations of large macromolecular complexes. Curr.
Opin. Struct. Biol. 31, 64–74 (2015). https://doi.org/10.1016/j.sbi.2015.03.007
118. Periole, X., Knepp, A.M., Sakmar, T.P., Marrink, S.J., Huber, T.: Structural determinants of
the supramolecular organization of G protein-coupled receptors in bilayers. J. Am. Chem.
Soc. 134, 10959–10965 (2012). https://doi.org/10.1021/ja303286e
119. Plaxco, K.W., Simons, K.T., Baker, D.: Contact order, transition state placement and the
refolding rates of single domain proteins. J. Mol. Biol. 277, 985–994 (1998). https://doi.org/
10.1006/jmbi.1998.1645
120. Pulawski, W., Jamroz, M., Kolinski, M., Kolinski, A., Kmiecik, S.: Coarse-Grained simula-
tions of membrane insertion and folding of small helical proteins using the CABS model. J.
Chem. Inf. Model. 56, 2207–2215 (2016). https://doi.org/10.1021/acs.jcim.6b00350
121. Rathore, N., Knotts, T.A.T., de Pablo, J.J.: Confinement effects on the thermodynamics of
protein folding: Monte Carlo simulations. Biophys. J. 90, 1767–1773 (2006). https://doi.org/
10.1529/biophysj.105.071076
122. Rauscher, S., Gapsys, V., Gajda, M.J., Zweckstetter, M., de Groot, B.L., Grubmuller, H.:
Structural ensembles of intrinsically disordered proteins depend strongly on force field: a
comparison to experiment. J. Chem. Theor. Comput. 11, 5513–5524 (2015). https://doi.org/
10.1021/acs.jctc.5b00736
123. Rauscher, S., Pomès, R.: Molecular simulations of protein disorder. This paper is one of a
selection of papers published in this special issue entitled “Canadian Society of Biochemistry,
Molecular & Cellular Biology 52nd Annual Meeting—Protein Folding: Principles and Dis-
eases” and has undergone the Journal’s usual peer review process. Biochem. Cell Biol. 88,
269–290 (2010). https://doi.org/10.1139/o09-169
124. Rief, M., Gautel, M., Oesterhelt, F., Fernandez, J.M., Gaub, H.E.: Reversible unfolding of
individual titin immunoglobulin domains by AFM. Science 276, 1109–1112 (1997). https://
doi.org/10.1126/science.276.5315.1109
125. Rojas, A., Liwo, A., Browne, D., Scheraga, H.A.: Mechanism of fiber assembly: treatment of
a beta peptide aggregation with a coarse-grained united-residue force field. J. Mol. Biol. 404,
537–552 (2010)
126. Ruprecht, J.J., Mielke, T., Vogel, R., Villa, C., Schertler, G.F.: Electron crystallography reveals
the structure of metarhodopsin I. EMBO J. 23, 3609–3620 (2004). https://doi.org/10.1038/sj.
emboj.7600374
127. Russel, D., Lasker, K., Phillips, J., Schneidman-Duhovny, D., Velazquez-Muriel, J.A., Sali,
A.: The structural dynamics of macromolecular processes. Curr. Opin. Cell Biol. 21, 97–108
(2009). https://doi.org/10.1016/j.ceb.2009.01.022
128. Rydzewski, J., Nowak, W.: Ligand diffusion in proteins via enhanced sampling in molecular
dynamics. Phys. Life Rev. (2017). https://doi.org/10.1016/j.plrev.2017.03.003
129. Sansom, M.S., Scott, K.A., Bond, P.J.: Coarse-grained simulation: a high-throughput com-
putational approach to membrane proteins. Biochem. Soc. Trans. 36, 27–32 (2008). https://
doi.org/10.1042/BST0360027
130. Saunders, M.G., Voth, G.A.: Coarse-graining of multiprotein assemblies. Curr. Opin. Struct.
Biol. 22, 144–150 (2012). https://doi.org/10.1016/j.sbi.2012.01.003
131. Schafer, K., Oestereich, M., Gauss, J., Diezemann, G.: Force probe simulations using a hybrid
scheme with virtual sites. J. Chem. Phys. 147 (2017)
132. Scheraga, H.A., Khalili, M., Liwo, A.: Protein-folding dynamics: overview of molecular
simulation techniques. Annu. Rev. Phys. Chem. 58, 57–83 (2007). https://doi.org/10.1146/
annurev.physchem.58.032806.104614
Protein Dynamics Simulations Using Coarse-Grained Models 85

133. Schlick, T., Collepardo-Guevara, R., Halvorsen, L.A., Jung, S., Xiao, X.: Biomolecularmod-
eling and simulation: a field coming of age. Q. Rev. Biophys. 44, 191–228 (2011). https://doi.
org/10.1017/S0033583510000284
134. Schwaiger, I., Kardinal, A., Schleicher, M., Noegel, A.A., Rief, M.: A mechanical unfolding
intermediate in an actin-crosslinking protein. Nat. Struct. Mol. Biol. 11, 81–85 (2004)
135. Schwaiger, I., Kardinal, A., Schleicher, M., Noegel, A.A., Rief, M.: A mechanical unfolding
intermediate in an actin-crosslinking protein. Nat. Struct. Mol. Biol. 11, 81–85 (2004). https://
doi.org/10.1038/nsmb705
136. Scott, K.A., Bond, P.J., Ivetac, A., Chetwynd, A.P., Khalid, S., Sansom, M.S.: Coarse-grained
MD simulations of membrane protein-bilayer self-assembly. Structure 16, 621–630 (2008).
https://doi.org/10.1016/j.str.2008.01.014
137. Sen, T.Z., Kloster, M., Jernigan, R.L., Kolinski, A., Bujnicki, J.M., Kloczkowski, A.: Pre-
dicting the complex structure and functional motions of the outer membrane transporter and
signal transducer FecA. Biophys. J. 94, 2482–2491 (2008). https://doi.org/10.1529/biophysj.
107.116046
138. Sengupta, D., Marrink, S.J.: Lipid-mediated interactions tune the association of glycophorin
A helix and its disruptive mutants in membranes. Phys. Chem. Chem. Phys. 12, 12987–12996
(2010). https://doi.org/10.1039/c0cp00101e
139. Serohijos, A.W., Chen, Y., Ding, F., Elston, T.C., Dokholyan, N.V.: A structural model reveals
energy transduction in dynein. Proc. Natl. Acad. Sci. U S A 103, 18540–18545 (2006). https://
doi.org/10.1073/pnas.0602867103
140. Shoemaker, B.A., Portman, J.J., Wolynes, P.G.: Speeding molecular recognition by using the
folding funnel: the fly-casting mechanism. Proc. Natl. Acad. Sci. 97, 8868–8873 (2000).
https://doi.org/10.1073/pnas.160259697
141. Sieben, C., et al.: Influenza virus binds its host cell using multiple dynamic interactions. Proc.
Natl. Acad. Sci. U S A 109, 13626–13631 (2012). https://doi.org/10.1073/pnas.1120265109
142. Sieradzan, A.K., Jakubowski, R.: Introduction of steered molecular dynamics into UNRES
coarse-grained simulations package. J. Comput. Chem. 38, 553–562 (2017)
143. Sikora, M., Cieplak, M.: Mechanical stability of multidomain proteins and novel mechanical
clamps. Proteins 79, 1786–1799 (2011). https://doi.org/10.1002/prot.23001
144. Sikora, M., Sulkowska, J.I., Witkowski, B.S., Cieplak, M.: BSDB: the biomolecule stretching
database. Nucleic Acids Res. 39, D443–D450 (2011). https://doi.org/10.1093/nar/gkq851
145. Simmons, R.M., Finer, J.T., Chu, S., Spudich, J.A.: Quantitative measurements of force and
displacement using an optical trap. Biophys. J. 70, 1813–1822 (1996). https://doi.org/10.1016/
S0006-3495(96)79746-1
146. Smith, S.O., et al.: Implications of threonine hydrogen bonding in the glycophorin A trans-
membrane helix dimer. Biophys. J. 82, 2476–2486 (2002). https://doi.org/10.1016/S0006-
3495(02)75590-2
147. Smith, S.B., Cui, Y., Bustamante, C.: Overstretching B-DNA: the elastic response of individual
double-stranded and single-stranded DNA molecules. Science 271, 795–799 (1996). https://
doi.org/10.1126/science.271.5250.795
148. Spijker, P., van Hoof, B., Debertrand, M., Markvoort, A.J., Vaidehi, N., Hilbers, P.A.: Coarse
grained molecular dynamics simulations of transmembrane protein-lipid systems. Int. J. Mol.
Sci. 11, 2393–2420 (2010). https://doi.org/10.3390/ijms11062393
149. Stossel, T.P., Condeelis, J., Cooley, L., Hartwig, J.H., Noegel, A., Schleicher, M., Shapiro,
S.S.: Filamins as integrators of cell mechanics and signalling. Nat. Rev. Mol. Cell Biol. 2,
138–145 (2001). https://doi.org/10.1038/35052082
150. Sulkowska, J.I., Sulkowski, P., Onuchic, J.N.: Jamming proteins with slipknots and their free
energy landscape. Phys. Rev. Lett. 103, 268103 (2009). https://doi.org/10.1103/PhysRevLett.
103.268103
151. Sulkowska, J.I., Sulkowski, P., Szymczak, P., Cieplak, M.: Untying knots in proteins. J. Am.
Chem. Soc. 132, 13954–13956 (2010). https://doi.org/10.1021/Ja102441z
152. Sulkowska, J.I., Cieplak, M.: Mechanical stretching of proteins - a theoretical survey of the
protein data bank. J. Phys.-Condens. Mat. 19 (2007). https://doi.org/10.1088/0953-8984/19/
28/283201
86 S. Kmiecik et al.

153. Szilagyi, A., Gyorffy, D., Zavodszky, P.: The twilight zone between protein order and disorder.
Biophys. J. 95, 1612–1626 (2008). https://doi.org/10.1529/biophysj.108.131151
154. Szymczak, P., Janovjak, H.: Periodic forces trigger a complex mechanical response in ubiq-
uitin. J. Mol. Biol. 390, 443–456 (2009). https://doi.org/10.1016/j.jmb.2009.04.071
155. Takada, S.: Coarse-grained molecular simulations of large biomolecules. Curr. Opin. Struct.
Biol. 22, 130–137 (2012). https://doi.org/10.1016/j.sbi.2012.01.010
156. Takagi, F., Koga, N., Takada, S.: How protein thermodynamics and folding mechanisms are
altered by the chaperonin cage: molecular simulations. Proc. Natl. Acad. Sci. U.S.A. 100,
11367–11372 (2003). https://doi.org/10.1073/pnas.1831920100
157. Taylor, W.R., Katsimitsoulia, Z.: A coarse-grained molecular model for actin-myosin simu-
lation. J. Mol. Graph. Model. 29, 266–279 (2010). https://doi.org/10.1016/j.jmgm.2010.06.
004
158. Thirumalai, D., Reddy, G., Straub, J.E.: Role of water in protein aggregation and amyloid
polymorphism. Acc. Chem. Res. 45, 83–92 (2012). https://doi.org/10.1021/ar2000869
159. Turjanski, A.G., Gutkind, J.S., Best, R.B., Hummer, G.: Binding-induced folding of a natively
unstructured transcription factor. PLoS Comput. Biol. 4, e1000060 (2008). https://doi.org/10.
1371/journal.pcbi.1000060
160. Uversky, V.N.: Introduction to intrinsically disordered proteins (IDPs). Chem. Rev. 114,
6557–6560 (2014). https://doi.org/10.1021/cr500288y
161. Uversky, V.N., Gillespie, J.R., Fink, A.L.: Why are “natively unfolded” proteins unstructured
under physiologic conditions? Proteins: Struct. Funct. Bioinf. 41, 415–427 (2000). https://
doi.org/10.1002/1097-0134(20001115)41:3%3c415:aid-prot130%3e3.0.co;2-7
162. Vajda, S., Kozakov, D.: Convergence and combination of methods in protein-protein docking.
Curr. Opin. Struct. Biol. 19, 164–170 (2009). https://doi.org/10.1016/j.sbi.2009.02.008
163. Valbuena, A., et al.: On the remarkable mechanostability of scaffoldins and the mechanical
clamp motif. Proc. Natl. Acad. Sci. U S A 106, 13791–13796 (2009). https://doi.org/10.1073/
pnas.0813093106
164. Vendruscolo, M., Dobson, C.M.: Protein dynamics: Moore’s law in molecular biology. Curr.
Biol. 21, R68–R70 (2011). https://doi.org/10.1016/j.cub.2010.11.062
165. Verkhivker, G.M.: Protein conformational transitions coupled to binding in molecular recogni-
tion of unstructured proteins: Deciphering the effect of intermolecular interactions on compu-
tational structure prediction of the p27Kip1 protein bound to the cyclin A–cyclin-dependent
kinase 2 complex. Proteins: Struct. Funct. Bioinf. 58, 706–716 (2005). https://doi.org/10.
1002/prot.20351
166. Verkhivker, G.M., Bouzida, D., Gehlhaar, D.K., Rejto, P.A., Freer, S.T., Rose, P.W.: Simulating
disorder–order transitions in molecular recognition of unstructured proteins: where folding
meets binding. Proc. Natl. Acad. Sci. 100, 5148–5153 (2003). https://doi.org/10.1073/pnas.
0531373100
167. Vogel, V., Sheetz, M.: Local force and geometry sensing regulate cell functions. Nat. Rev.
Mol. Cell Biol. 7, 265–275 (2006). https://doi.org/10.1038/nrm1890
168. Wang, Y.M., Latshaw, D.C., Hall, C.K.: Aggregation of A beta(17-36) in the presence of
naturally occurring phenolic inhibitors using coarse-grained simulations. J. Mol. Biol. 429,
3893–3908 (2017)
169. Wang, J., Wang, Y., Chu, X., Hagen, S.J., Han, W., Wang, E.: Multi-scaled explorations of
binding-induced folding of intrinsically disordered protein inhibitor IA3 to its target enzyme.
PLoS Comput. Biol. 7, e1001118 (2011). https://doi.org/10.1371/journal.pcbi.1001118
170. West, D.K., Olmsted, P.D., Paci, E.: Mechanical unfolding revisited through a simple but
realistic model. J. Chem. Phys. 124 (2006). https://doi.org/10.1063/1.2185100
171. Wolynes, P.G., Onuchic, J.N., Thirumalai, D.: Navigating the folding routes. Science 267,
1619–1620 (1995). https://doi.org/10.1126/science.7886447
172. Wright, P.E., Dyson, H.J.: Intrinsically unstructured proteins: re-assessing the protein
structure-function paradigm. J. Mol. Biol. 293, 321–331 (1999). https://doi.org/10.1006/jmbi.
1999.3110
Protein Dynamics Simulations Using Coarse-Grained Models 87

173. Yao, X.Q., Kenzaki, H., Murakami, S., Takada, S.: Drug export and allosteric coupling in
a multidrug transporter revealed by molecular simulations. Nat. Commun. 1, 117 (2010).
https://doi.org/10.1038/ncomms1116
174. Zacharias, M.: Accounting for conformational changes during protein-protein docking. Curr.
Opin. Struct. Biol. 20, 180–186 (2010). https://doi.org/10.1016/j.sbi.2010.02.001
175. Zhang, J., Muthukumar, M.: Simulations of nucleation and elongation of amyloid fibrils. J.
Chem. Phys. 130, 035102 (2009). https://doi.org/10.1063/1.3050295
176. Zhmurov, A., Dima, R.I., Kholodov, Y., Barsegov, V.: Sop-GPU: accelerating biomolecular
simulations in the centisecond timescale using graphics processors. Proteins 78, 2984–2999
(2010). https://doi.org/10.1002/prot.22824
177. Zhou, H.-X.: Polymer models of protein stability, folding, and interactions†. Biochemistry
43, 2141–2154 (2004). https://doi.org/10.1021/bi036269n
178. Zhou, H.X., Dill, K.A.: Stabilization of proteins in confined spaces. Biochemistry 40,
11289–11293 (2001). https://doi.org/10.3410/f.1002736.29765
179. Zhou, J., Thorpe, I.F., Izvekov, S., Voth, G.A.: Coarse-grained peptide modeling using a
systematic multiscale approach. Biophys. J. 92, 4289–4303 (2007). https://doi.org/10.1529/
biophysj.106.094425
Physics-Based Modeling of Side
Chain—Side Chain Interactions
in the UNRES Force Field

Mariusz Makowski

Abstract Work on a development of a new model of side-chain—side-chain


interactions for side-chains of amino acids, to be used in the UNRES force-field
and in other large-scale simulations, has been described in this chapter. In the pre-
sented model a polar/charged side chain consists of two sites of interaction, nonpolar
and polar. General expressions for the effective energy of interaction between amino
acids are given depending on the kind of interacting pair. Results of tests with the
new UNRES force-field parameters have also been shown together with an extension
of the force-field for the phosphorylated amino-acids in this chapter. The results of
the studies on the influence of particle size on the free-energy profile of hydrophobic
interactions, and the temperature dependence of the potential of mean force for side
chain—side chain interactions are also presented.

1 Introduction

Proteins are crucial elements of the cell and simulations of their structure and dynam-
ics are therefore of great significance in biochemistry, molecular biology, genet-
ics, medical sciences, and in other sciences, which focus on investigating processes
which occur in the living systems [1–26]. An important problem encountered while
researching of proteins is the prediction of their spatial structure. This problem is of
crucial importance, because the correct folding of proteins is necessary for proper
function. Knowledge of protein structure is also necessary for drug design and in
research concerning the interactions of specific antibodies, inhibitors, enzymes, etc.
Experimental methods such as X-ray crystallography or NMR spectroscopy do not
keep up with the need for solved protein structures. To date (data as of June 13,
2012) 536,489 sequences have been deposited in the Swiss-Prot data base (http://
us.expasy.org), and only 82,522 of them have been solved experimentally; these
structures are stored in the Protein Data Bank (PDB) (http://www.rcsb.org/pdb). For

M. Makowski (B)
Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308 Gdańsk, Poland
e-mail: mariusz.makowski@ug.edu.pl

© Springer Nature Switzerland AG 2019 89


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_4
90 M. Makowski

comparison, 4831 folded protein structures were deposited in the PDB 2003, whereas
24,156 sequences were deposited in the Swiss-Prot database from February 2003 to
March 2004. Therefore, the ratio of the number of solved structures to that of known
sequence does not increase substantially. That is why, within the last 20 years, an
intensive development of methods for the prediction of the spatial structures of pro-
teins from their amino acid sequence has occurred [1–26].
In order to predict protein structures, one can use knowledge-based (i.e., based on
structural databases) or physics-based methods. Knowledge-based methods are, for
the time being, more effective than physics-based methods provided that the database
contains structures similar to that under consideration. The following methods belong
to this group: the very popular comparative modeling methods [20–22, 24, 25, 27–33],
the threading methods [10, 20, 34, 35], as well the fragment methods [36, 37]. As was
mentioned, the second important group comprises physics-based methods which,
apart from predicting the three-dimensional structure, enable us to investigate the
mechanism of protein folding. In principle, physics-based methods should be able
to predict protein structures from their amino-acid sequences alone. However, they
are also partially based on the data from structural databases [5, 11, 12, 38]. The
thermodynamic hypothesis of Anfinsen [39], according to which the native structure
of a protein corresponds to the minimum of its free energy, is the starting point
for these methods. A series of CASP (Critical Assessment of Techniques for Protein
Structure Prediction) experiments have been organized every other year starting from
1994. During this experiment methods which enable to predict protein structures
from amino-acid-sequence information alone are evaluated. Essential information
regarding the CASP experiment can be found at http://predictioncenter.org.
The research of protein—protein interactions, the interactions of proteins with
other molecules, as well as the understanding of the process of folding and con-
formational changes, and, above all, the understanding of functionally important
motions, is very important. The results of such structural research enable very often
to introduce new methods and provide new direction for the treatment of diseases. In
comparison with experimental research, theoretical calculations have some advan-
tages, such as lower costs, higher safety level, and shorter time of experiment. Unfor-
tunately, the effectiveness of such methods for protein structure prediction, as well
as their accuracy, is not high so far. Methods based on structural databases are really
efficient for the prediction of static structures, but the dynamics of the structure
as well as a description and understanding of the nature of interactions are to a
substantial extent beyond the scope of their possibilities. Therefore, methods based
on the physics of interactions (physics-based methods) are better for this purpose.
Moreover, physics-based methods are independent of databases or data from other
experimental measurements. Consequently, they enable us to examine structures of
more complicated proteins with degenerate native structure such as, e.g., the prion
proteins [40, 41], and also to simulate protein-folding pathways.
Because of the large number of atoms in biomolecules, the use of an all-atom
representation of a system is not practical in protein-structure prediction or ab initio
folding simulations, except for small proteins. One can conduct millisecond simula-
tions of small proteins in the all-atom representation [42, 43]; however, this process
Physics-Based Modeling of Side Chain—Side Chain Interactions … 91

is still too expensive due to the substantial lengthening of the time needed for calcu-
lations. Use of united-residue models is, therefore, more practical. That is why there
is a need for the development of coarse-grained physics-based interaction potentials,
which are applied in large-scale protein simulation.
The structure of a given protein is unambiguously determined by its amino acid
sequence. The interactions between the amino acid side chains in proteins are very
specific and they are encoded in the protein sequence. On the other hand, coarse-
grained potentials of interactions between side chains, which take all kinds of inter-
actions into consideration, including hydrophobic and electrostatic interactions, as
well as the formation of hydrogen bonds with participating polar groups, have not
yet been developed.
In this chapter the results of work, which concern research on a new potential
and describe the amino-acid side chain interactions that occur in proteins have been
presented. This new physics-based potential can also be applied in other large-scale
protein simulations. The motivation of this research was to improve the UNRES
(UNited RESidue) [38] force field. In the present UNRES force field, the side-
chain—side-chain interaction potentials are based on the Gay-Berne functional [44],
which implies spheroidal symmetry of the potentials. It should be kept in mind,
though, that under physiological conditions, some side chains possess a charged or a
polar group. It can, therefore, be assumed that such side chains consist of two parts,
namely a hydrophobic “tail” and a charged/polar “head.” Because of spheroidal
symmetry, the Gay-Berne potential can be used only for the description of uniform
interactions such as those between nonpolar side chains; however, even in this case,
it does not completely describe such interactions, as it does not reproduce the desol-
vation maximum. Such an improper functional
  form, which describes the potential
of side chain—side chain interactions U SCi SC j for amphipolar side chains, is there-
fore probably one of the most important reasons for the imperfection of the UNRES
force-field in predicting protein structures. That is why a new model for the inter-
actions of side chains is proposed. Every side chain is represented by two centers,
namely a nonpolar center, which is placed in the middle of the side chain (represented
by an ellipsoid of revolution), and a charged or polar center placed in the headgroup
of the side chain. The energy function of the interaction is then computed as a sum
of components, which includes the electrostatic interactions of charged or polar ele-
ments, the interactions between the charged or polar centre and the nonpolar one, the
interactions between the nonpolar centres, and the energy terms which come from
the molecular surface area accessible to solvent. Search for new potentials which
describe interactions of side chains was conducted on the basis of molecular dynam-
ics simulations of appropriate model systems in water. The main goal of this research
was to develop a new model and to parameterize it. The new model is physics-based
and has been parameterized by fitting analytical formulas which describe the respec-
tive free-energy surfaces to potentials of mean force (PMF) of pairs of interacting
side chain models in water, which were calculated by means of molecular dynamics
in the AMBER force-field [45].
92 M. Makowski

When describing the potential for side-chain—side-chain interactions of


biomolecules in the UNRES force field, one has to present an outline of this force
field. As was already mentioned, the UNRES united-residue model of polypeptide
chains was developed in the groups of Professors Harold A. Scheraga and Adam
Liwo, and described in several papers [38, 46–52]. In the UNRES model, the polypep-
tide chain is represented as a sequence of α carbon atoms (Cα ) which are connected
by virtual bonds whose equilibrium length is 3.8 Å, this corresponding to the trans
configuration of a peptide group. The Cα atoms are not interaction sites, as they only
assist in the definition of the geometry of a chain. The side chains (SC) are defined as
single interaction sites, which are attached to the main chain. The interaction sites of
the polypeptide groups (p) are placed in the middle of virtual Cα –Cα bonds (Fig. 1).
The virtual-bond valence angles (θ ) as well as virtual-bond-dihedral angles (γ )
define the geometry of the polypeptide chain (the Cα backbone), whereas the α SC
and β SC angles define position of the side chains relative to the backbone. The
energy
 function
 consists of terms accounting for the interactions
 between side chains
U SCi SC j , between side chains and peptide groups U SCi p j , between peptide groups
 
corr  3, 4, 5, 6], and local terms that account
U pi p j , correlation (multibody) terms [U (i)
for the energetics of the rotation about the virtual-bond Cα –Cα axis (U tor and U tord ),
the energetics of the bending of virtual valence angles (U b ), and different rotameric
states of side chains (U rot ), as shown in Eq. (1). Each term is assigned an appropriate
weight; the weights were determined by using a hierarchical optimization of the
potential-energy function [47, 48, 53, 54].

Fig. 1 The UNRES model


of polypeptide chains. This
figure was taken from Ref.
[51]
Physics-Based Modeling of Side Chain—Side Chain Interactions … 93
  
U  w SC U SCi SC j + w SC p U SCi p j + wel U pi p j
i< j i j i< j−1
  
+ wtor Utor (γi ) + wtor d Utor d (γi , γi+1 ) + wb Ub (θi )
i i i

 
6
(i) (i)
+ wr ot Ur ot (αi , βi ) + wcorr,bb Ucorr,bb
i i3
(2) (3) (3)
+ wcorr ;sc Ucorr
(2)
;r ot + wcorr ;r ot Ucorr ;r ot (1)

In the present version of the UNRES force-field the potential of side-chain—side-


chain interactions has the Gay-Berne potential [44] functional form, which accounts
for some of the anisotropy of the interactions. The parameters of this potential were
determined on the basis of fitting the Gay-Berne-type potential functions to the side-
chain contact energies correlation functions determined on based on experimental
protein structures selected from the PDB [38]. The fitted parameters were subse-
quently refined by hierarchical optimization of the UNRES energy function [53,
55].
In further parts of this chapter, results of work which constitute a substantial
contribution to the modification of U SCi SC j in the UNRES force field have been
presented.

2 Analytical Formulas for the Potentials of Mean Force


of Interaction of Amino-Acid Side Chains in Water

2.1 Simple Spherical Systems

For versatility and to save computational time, it isadvisable


 to express the potentials
of amino acid side chain—side chain interactions U SCi SC j in proteins and peptides,
which are to be used in the UNRES force-field, analytically. At the beginning of the
work on U SCi SC j potentials [56], simple, analytical formulas for calculating the free
energy of cavity creation (ΔF cav ) were proposed for pairs of spherical or spheroidal
interacting sites. These equations were introduced based on an analysis of the change
of the number of water molecules observed upon the formation of a hydrophobic
dimer. For an analytical description of this phenomenon, a Gaussian-overlap model
was initially developed, which approximates the density of the solvent in the sol-
vation sphere with the derivative of the spherical Gaussian density in the standard
deviation of the Gaussian. On the basis of the differential—Gaussian expression of
the density of water in the first hydration sphere, an approximate analytical expres-
sion was developed for the free energy of cavity creation for spherical particles [56].
This equation is convenient to use in computer simulations. The proposed model
reproduces the desolvation maximum in the potential of mean force for hydrophobic
94 M. Makowski

particles, which interact [56]. The desolvation maximum was not accounted for in the
UNRES force-field so far. The Gaussian-differential model of hydrophobic associa-
tion has been extended to sites with spheroidal symmetry [56] and an approximate
expression for the free energy of hydrophobic association of spheroidal sites was
developed [56].
The above-presented considerations concern only cavity potential, which is not
the total effective energy of interactions. Interacting sites can have charged or polar
groups. General expressions for the effective energy of all types of interaction in
these systems are given by Eqs. 2–7 (details can be found in [57–62]):

Wnn  E vdW + Fcav (2)


Wcc  E vdW + E el + E pol + E GpolB + Fcav (3)
Wcn  E vdW + E pol + E GpolB + Fcav (4)
Wcp  E vdW + E GpolB + E pol + ΔFcav + E cp + EL J (5)
W pn  E vdW + E GpolB + ΔFcav (6)
W pp  E vdW + E GpolB + ΔFcav + E pp (7)

Depending on the kind of interacting pair (Eqs. 2–7), the potential function (W )
contains all or part of the following terms: Coulombic-energy terms (E el ), solvent-
polarization terms represented by the Generalized Born Model E GpolB , solute-
polarization terms (E pol ), van der Waals terms (E vdW ), and the terms that account
for the energy of cavity creation (F cav ). the isotropic Lennard-Jones term (E L J )
expresses the van der Waals interaction energy between two amphiphilic headgroups,
(E cp ) is the interaction energy between charged and polar sites, and (E pp ) is a poten-
tial of interactions between two polar headgroups. Both the Lennard-Jones potential
and the Kihara potential, which is a modified version of Lennard-Jones potential
were tested to express the energy of van der Waals interactions [57].
Further, the proposed expressions for the effective energy of hydrophobic-
association with selected model systems [57] and to propose new analytical functional
forms for determining the U SCi SC j potential in peptides and proteins were tested. Sim-
ple mathematical formulas for the description of all possible interactions between
pairs consisting of the following molecules: a hydrophobic molecule (methane), a
positively charged molecule (ammonium cation), and a negatively charged molecule
(chloride anion) have been proposed [57]. For comparison two models were used for
the estimation of the free-energy term which comes from hydrophobic association.
One of them is a model of molecular-surface area with equations given by Rank and
Baker [57], the other is a model based on the differences between two overlapping
Gaussian functions [56]. Both analytical expressions fit the PMF plots very well.
However, the function with the Kihara potential and molecular surface area compo-
nent was rejected from the further work for the following two reasons [57]. First,
the best fit to the PMF curves was obtained when the Kihara potential consisted
only of the repulsive term. Second, the expression for the molecular surface area
Physics-Based Modeling of Side Chain—Side Chain Interactions … 95

which was proposed by Rank and Baker can readily be expressed analytically only
for spherical sites. Even though the sample solutes considered in [57] were spherical,
real nonpolar amino acid side chains can only be approximated by spheroids. The
Gaussian-differential-based cavity potential reproduces the desolvation maximum
[56, 57] in the PMF curve very well. Moreover, values of the fitted parameters [57]
do also have physical meaning. The same Gaussian-differential-based expression is
used to represent the cavity-creation free energy of pairs of charged, as well as of
those of charged and nonpolar solutes [57].

2.2 Identical and Different Hydrophobic Side Chain

The results of the research on the U SCi SC j potentials of pairs of identical [58] and
different [59] hydrophobic molecules, which model hydrophobic amino acid side
chains, are presented. Each of these analytical potentials consists of a sum of the
van der Waals potential and expressions for the free energy of cavity creation [56].
The van der Waals energy was expressed by the Gay-Berne potential [44] and the
Gaussian-differential-based term for the free energy of cavity creation for a pair of
spheroidal particles [58, 59]. Based on the definition of the model of interaction of
two hydrophobic particles (Fig. 2), the potentials of mean force were determined for
the side-chain pairs studied by umbrella-sampling molecular dynamics simulations
in water followed by post-processing with the weighted histogram analysis method
[59].
The determined PMF curves are four-dimensional functions of the distance
between geometric centers of the systems studied and their orientation. The results
of calculations described in [58, 59] show that the analytical functions fit well to the
PMF hypersurface determined by molecular dynamics simulations. And the fitted
parameters of the analytical potentials of side chain—side chain interactions have
physical meaning. Moreover, the contact free energies calculated from the PMF
curves correlate well with those determined from PDB data using the quasi-chemical
approximation [38] as shown in Eqs. 8 and 9.

eiPj D B  1.39(0.20)ei j + 0.21(0.40) R  0.812 (8)

After removing outlier points, the correlation is expressed by Eq. 9.

eiPj D B  1.84(0.17)ei j + 1.05(0.33) R  0.908 (9)

2.3 Like-Charged Side-Chains

A charged side chain should be considered as a structure consisting of a charged


“head” and a nonpolar “tail” [60]. Depending on the mutual orientation of side chains
96 M. Makowski

different kinds of interactions are observed such as charged headgroup—charged


headgroup, charged headgroup—nonpolar site, and nonpolar site—nonpolar sites.
A new model of interaction for charged side chains, which takes into considera-
tion the occurrence of a charged headgroup and nonpolar “tail” was defined. In the
first approximation, each charged side chain was modeled by two interaction sites.
Additionally, for oppositely charged side-chain pairs, qualitative reproduction of the
shape and location of the minima and maxima in the PMF curves (corresponding to
different orientations) required the taking into account of the mobility of the charged
headgroup within a given side chain, as well as some other modifications of the
model in order to provide a good fit to the PMF curves.
The new model of interactions for charged side chains is presented in Fig. 3.
Every side chain of this type consists of two parts, i.e., a nonpolar “tail” that is
represented by an ellipsoid of revolution, and polar/charged “head” represented as a
filled sphere at the end of one site of every side chain. The center of such side chain
is placed in the geometric center of the side chain. Additionally, in this new model
for the interactions of charged and polar side chains, two more distances between

Fig. 2 Definition of variables describing the location of two spheroidal sites (i and j) with respect
(1) (2)
to each other. The vector ûij is the unit vector of the long axis of site i, ûij is the unit vector of
(1)
the long axis of site j, r̂ij is the unit vector of the vector pointing from site i to site j, θi j is the
(1) (2)
angle between the vector r̂ij and vector ûij and θi j is the angle between the vector r̂ij and vector
(2) (2)
ûij , and φi j is the angle of counterclockwise rotation of the vector ûij about the vector r̂ij from
(1)
the plane defined by the vector ûij and vector r̂ij when looking from the center of site j toward the
center of site i. This is also Fig. 1 in Ref. [58]
Physics-Based Modeling of Side Chain—Side Chain Interactions … 97

Fig. 3 Illustration of the new model for the interactions of charged and polar side chains. A side
chain of this type consists of a nonpolar site (represented by an ellipsoid of revolution) and a
polar/charged site (represented by a shaded sphere). The center of the polar/charged site of side
(1)
chain i is at the distance di from the geometric center of that side chain (SC i ) (which is located
between the polar/charged and nonpolar center and represented by a small sphere in the figure),
(1)
and that of side chain j is at the distance d j from the side-chain center (SC i ), while the centers
(2) (2)
of the nonpolar sites of side chains i and j are at distances di and d j , respectively, from their
(1)
geometric centers (SC i and SC j , respectively). The vector ûij is the unit vector of the long axis of
(2)
the nonpolar site of side chain i, ûij is the unit vector of the long axis of the nonpolar site of side
chain j, r̂ij is the unit vector pointing from the geometric center of the nonpolar site of side chain i
to that of side chain j, r ij is the distance between these two centers, ri j is the distance between the
center of the charged/polar site of side chains i and j, rij is the distance between the center of the
charged site of side chain i and the center of the nonpolar site of side chain j, and r ji is the distance
between the center of the charged site of side chain j and the center of the nonpolar site of side
chain i. This is also Fig. 2 in Ref. [60]

charged and hydrophobic centers from the geometric center of the side chain were
defined.

Wcc  EGBerne + Eel + EGB


pol + Epol + Fcav + Fcav + E L J
iso
(10)

In Eq. 10, the effective energy of interactions between two charged amino-acid
side chains is a sum of the Gay-Berne potential (E GBerne ), which accounts for the
van der Waals interactions between nonpolar sites; the energy of Coulombic inter-
actions between charged sites (E el ); the energy of polarization, which comes from
interactions between charged and nonpolar sites of side chains (E pol ); the energy
of solvent-polarization by charged sites (calculated the generalized Born model)
98 M. Makowski
 
E GpolB ; the free energy of cavity creation which corresponds to charged parts of
 
the side chain model ΔFcav iso
; the free energy of cavity creation for nonpolar parts
of side chains (ΔFcav ); the Lennard-Jones potential for the description   of van der
Waals interactions between two charged parts of the side chains E L J . It should
be noted that isotropic terms in the U SCi SC j potential between charged parts of side
chains appeared, because if they are not taken into consideration, there is no possibil-
ity to distinguish the “charged head—charged head” orientation from the “nonpolar
tail—nonpolar tail” orientation.
For the EGBerne energy term the Gay-Berne-type potential expressed by Eq. 11.
It should be noted that, previously, Eq. 11 was used to express the complete side
chain—side chain interaction potential in the UNRES force-field.
⎡ ⎤
12 6
σi0j σi0j
E G Ber ne  4εi j ⎣ − ⎦ (11)
ri j − σi j + σi0j ri j − σi j + σi0j

where ri j is the distance between the centers of the side chains, σi j is the distance
corresponding to the zero value of E GBerne for arbitrary orientation of the particles
(σi0j is the distance corresponding to the zero value of E GBerne for the side-to-side
approach of the particles), εi j (depending on the relative orientation of the particles)
is the van der Waals well depth. The dependence of εi j and σi j on the orientation of
the particles is given by Eqs. 12–14 and 15, respectively [60].

εij ≡ ε(ω(1) (2) (12) 0 (1) (2)


ij , ωij , ωij )  εij εij εij (12)
−1/2
ε(1)
ij  1 − χ (1) (2) (12)2
ij χij ω ij (13)
 
(12) 2
χ(1) (1)2
ij ωij + χ(2) (2)2
ij ωij − 2χ(1) (2) (1) (2)
ij χij ωij ωij ωij
ε(2)
ij  1 − (14)
1 − χ(1) (2) (12)2
ij χij ωij
 
χ(1) (1)2
ij ωij + χ(2) (2)2
ij ωij − 2χ(1) (2) (1) (2) (12)2
ij χij ωij ωij ωij
σij  σij0 1− (15)
1 − χ(1) (2) (12)2
ij χij ωij

with

ωi(1) (1) (1)


j  ûi j · r̂i j  cos θi j (16)
ω(2) (2) (2)
ij  ûij · r̂ij  cos θij (17)
ω(12)
ij  ûij(1) · ûij(2)  cos θ(1)
ij cos θ(2)
ij + sin θ(1)
ij sin θ(2)
ij cos φij (18)

where ûij(1) and ûij(2) are unit vectors along the principal axes of the interacting sites
(identified in this work with the Cα -SC axes), r̂ij is the unit vector pointing from the
center of site i to that of site j, rij is the distance between the side-chain centers (Figs. 2
and 3), the parameters χ(1) (2)
ij and χij are the anisotropies of the van der Waals distance,
Physics-Based Modeling of Side Chain—Side Chain Interactions … 99

the parameters χ(1) (2)


ij and χij are the anisotropies of the van der Waals well depth
and the parameter εi0j is the well-depth corresponding to the side-to-side orientation
of the interacting particles.
The Coulombic term is given by Eq. 19.

qi q j
E el  332 (19)
ri j

where qi and qj are the charges of sites i and j, respectively, ri j is the distance
between the centers of the charged sites of side chains SCi and SCj (see Fig. 3), and
the coefficient 332 is introduced to express the energies in kcal/mol.
The component E GpolB corresponds to the “bulk dielectric” solvent-polarization. For
the “bulk dielectric” solvent-polarization part involving a pair of charged particles,
the expression from the Generalized Born model was adopted.
 
1 1 1
E GpolB  332qi q j − (20)
εin εout f G B (ri j )

where 2in is the effective dielectric constant of the “inside” of the interacting particles,
2out is the effective dielectric constant of the solvent, and fGB is expressed by Eq. 21.


 2   ri2j
f G B ri j  ri2j + ai a j exp − (21)
4ai a j

where ai and aj are the Born radii of sites i and j, respectively.


Equation 22 was proposed to express the polarization component involving the
interaction between charged and nonpolar particles (see Fig. 3).
4 4
pol 1 pol 1
E pol  αi j + α ji (22)
f G B (rij ) f G B (rij )

pol pol
where αi j and α ji are related to the polarizability of the nonpolar parts of side chain
i and side chain j, respectively. At large distances, this contribution of polarization
energy varies as 1/r 4 . The rationale for expression 22 is that a nonpolar particle
replaces the solvent at distance r, consequently removing a part of the polarization
interaction with the solvent. The polarization interaction energy is proportional to
the square of the electric field which, by Coulomb’s law, varies as 1/r 4 .
The expression for isotropic cavity creation (or solvent-restructuring) term ΔFcav iso

was derived previously [56] based on the Gaussian overlap model and is expressed
by Eq. 23. This term enables to differentiate between head-to-head and tail-to-tail
orientations of side chains in our analytical PMF curves. In the original formula-
tion of the Generalized-Born model, this term is proportional to molecular surface
100 M. Makowski

area, hence the complete name of Generalized Born Surface Area (GBSA) which is,
however more difficult to compute less numerically stable compared to Eq. 23.
1
αiiso(1)
j [(x) 2 + αiiso(2)
j x − αiiso(3)
j ]
ΔFcav
iso
 (23)
1 + αiiso(4)
j · x 12

with

ri j
x (24)
(σiiso )2 + (σ jiso )2

where ri j is the distance between two charged parts (see Fig. 3) of particles i and j, σiiso
and σ jiso can be identified with the minimum distance between the center of charge
of particle i or j, respectively. The parameters αiiso(1)
j , αiiso(2)
j , αiiso(3)
j , αiiso(4)
j , σiiso , and
σ jiso are determined by least-squares fitting [60] of the analytical expression for the
free energy of two side chains interacting in water (Eq. 10) to the potentials of mean
force determined from MD simulations.
The expression for ΔFcav of spheroidal particles was derived in Ref. [56] on the
side-chain—side-chain interaction potential in the UNRES force field and given by
Eq. 25. This term accounts for the free-energy contribution due to restructuring water
molecules around a hydrophobic dimer.
1
αi(1) (2) (3)
j [(x · λ) + αi j x · λ − αi j ]
2
ΔFcav  (25)
1 + αi(4)
j (x · λ)
12

with
ri j
x (26)
σi2 + σ j2
 2
χi(1) (1)2 (2) (2)2 (1) (2) (1) (2) (12)
j ωi j + χi j ωi j − 2χi j χi j ωi j ωi j ωi j
λ 1− (27)
1 − χi(1) (2) (12)2
j χi j ωi j

where the symbols ωi(1) (2) (12)


j , ωi j , and ωi j are defined by Eqs. 16–18, respectively, ri j
is the distance between the centers of the particles, χi(1)
j and χi(2)
j are anisotropies
pertaining to ΔF cav , and σi and σ j can be identified with the minimum distance
between the center of particle i or j, respectively. The parameters αi(1) (2) (3)
j , αi j , αi j ,
and αi(4)
j , σi and σ j and the anisotropies are determined by least-squares fitting of
the analytical expression for the free energy of two side chains interacting in water
(Eq. 10) to the potentials of mean force determined from MD simulations.
The isotropic Lennard-Jones potential (E L J ) describing the van der Waals inter-
actions between two charged headgroups is expressed by Eq. 28:
Physics-Based Modeling of Side Chain—Side Chain Interactions … 101
⎡ ⎤
12 6
σij σij
E L J  4 · εi j · ⎣ − ⎦ (28)
ri j ri j

where ri j is the distance between the centers of the charged headgroups, σij is the
distance corresponding to the zero value of E LJ, and εi j is the van der Waals well
depth.
The results of fitting of the analytical function to the PMF curves depending on
the distance and orientations of the propionate anion—propionate anion pair (which
models the interactions between two ionized aspartic-acid side chains) are shown in
Fig. 4.
It follows from Fig. 4 that the proposed analytical expression reproduces the PMF
curves very well. An analysis of these plots demonstrates that the energy expressions

Fig. 4 PMF curves for the propionate—propionate pair (model of side chain pair for Asp—Asp)
determined by the weighed histogram analysis method from molecular dynamics calculations in
water. The curves are coloured according to side-chain orientations and the colour codes are the
same as in [60]. Thinner lines correspond to PMF curves from MD simulations, whereas thicker
correspond to fitted analytical approximation of the PMF function consisting of the sum of the
Gay-Berne potential (Eq. 2 in [60])which describes van der Waals interactions between side chains,
sum of electrostatic (Eq. 10 in [60]) and generalized Born potentials (Eq. 11 in [60]) which describe
electrostatic interactions between charged parts of the side chains, polarization energy term (Eq. 13
in [60]) to express interactions between charged and nonpolar parts of side chains, isotropic cavity
potential (Eq. 14 in [60]) of charged parts, equation to represent cavity potential of hydrophobic
parts (Eq. 16 in [60]), and isotropic Lennard-Jones potential (Eq. 19 in [60]). This is also Fig. 5a in
Ref. [60]
102 M. Makowski

within the fitted parameters have physical sense for like-charged models. In particular,
all energy components tend to zero at large distances [60].

2.4 Oppositely Charged Side Chains

For pairs of oppositely charged side chains [61], the model had to be extended to
introduce multiple locations of a charged head relative to the nonpolar center. The
rationale for that is a fact, that assuming a fixed distance of the center of the charged
site from side-chain center does not reproduce the shape and structure of the contact
(salt-bridge) minima in the PMF surface corresponding to head-to-head orientations
of the side chains; consequently, it was assumed that the charged site can exist in two
states, which differ from each other in the distance of the charged center from the
side-chain center. The introduction of the two-state model also enable to differentiate
the head-to-head and side-to-side orientations from each other. The two state model
of two interacting side chains is shown in Fig. 5.
The respective analytical expression for the potential is given by Eq. 29, in which
the energy is a sum of the Gay-Berne potential which describes van der Waals inter-
actions between nonpolar parts (EGBerne ), the cavity creation term for nonpolar parts
of side chains (ΔFcav ), and the free energy corresponding to the summation of the
Coulombic interactions (Eq ) for interactions of head group quadrupoles (Eqd ), the
solute-polarization energy  (Epol ), the solvent-polarization energy expressed by the
generalized Born model EGB pol , the free energy of cavity creation due to headgroups
 
ΔFcav , and the van der Waals interaction energy between the head groups repre-
iso

sented by the Lennard-Jones potential (ELJ ). For pairs including arginine, the spread
of the charge distribution must also be included by introducing a term correspond-
ing to averaged quadrupole-quadrupole interactions and, for all pairs, an explicit
Lennard-Jones term must be included between the charged centers.

Wcc  EGBerne + Fcav −



N
E(i) (i) (i) GB(i) (i)
q + Eqd + ELJ + Epol + Epol + Fcav
iso(i)
+ RTln w exp −
(i)

i1
RT

N
+ RTln w(i) (29)
i1

As was mentioned earlier, for pairs of oppositely-charged side chains [61], the
model had to be extended to introduce multiple locations of a charged head relative
to the nonpolar center. It was assumed that every charged side-chain head group can
exist in two states. These states differ in the distance of the charged center from the
center of the side chain. This two-state model of charged side chains is shown in
Fig. 5.
Physics-Based Modeling of Side Chain—Side Chain Interactions … 103

Fig. 5 Illustration of the new model for the interactions of charged and polar side chains. A side
chain of this type is assumed to consist of the charged (shaded) and nonpolar (ellipsoidal) parts. The
geometric centers of side chains i and j are denoted as SC i and SC j , respectively and represented by
small black circles located between the centers of the charged and nonpolar sites. The charged site
of each side chain can exist in two possible states; hence two shaded spheres are shown for each
charged site. The spheres corresponding to alternative positions of the charged sites (farther away
from the centers of side chain i and j, respectively) are boarded by dashed lines and are transparent to
indicate that each of them corresponds to the alternative state of a single site and does not represent
(1) (2)
an additional site. The vector ûij is the unit vector of the long axis of side chain i, ûij is the unit
vector of the long axis of side chain j, r̂ij is the unit vector pointing from the geometric center of
the nonpolar site of side chain i to that of side chain j, r ij is the distance between these two centers,
ri j is the distance between the charged/polar centers of the head groups of side chains i and j, rij
and r ji are the distances between the charged centers of particle i and the center of particle j, and
the charged center of particle j and the center of particle i, respectively (for clarity sake we show
(1,1) (1,2)
only the distances that involve the polar/charged center in the first possible state), di , di and
(1,1) (1,2)
dj , dj are the distances from the geometrical center of side chain i and j, respectively, to the
(2) (2)
center of the charge of head group i and j respectively, and di and d j are the distances from
the geometrical center of side chain i and j, respectively, to the nonpolar center of particles i and j
respectively. This is also Fig. 1 in Ref. [61]

In the Eq. 29 the superscript (i) indicates the index of the microstate, w(i) is the
weight of this microstate (also treated as an adjustable parameter in fitting Eq. 29 to
the PMF), N is the number of microstates (N  4) R is the universal gas constant,
and T is the absolute temperature (T  298). Each of the microstates corresponds to
different distances between the center of the charged headgroups and the side-chain
104 M. Makowski

center (see Fig. 5). Two possible states for the centers of the charged headgroups for
each side-chain model of a pair were assumed, which gives a total number of four
microstates (in one state the headgroup is closer and in the other one farther from
the side-chain center).
Except for Eqd the terms in Eq. 26 have been defined in the previous subsection.
The average energy of interactions of two point quadrupoles (Eqd ) i and j is expressed
by Eq. 27 (see Appendix of the Ref. [61] for more details and derivation; Eqs.
A1–A9):
⎡ , ,

2 (1) 2 (2)
5 + 3(cos αi j − 1) − 2 (cos θi j + cos θi j )
2 75
⎢ ⎥
A ⎢ 315 , , ⎥
E qd  ⎢
 5 ⎢ 2+ cos 2 (1)
θ cos 2 (2)
θ ⎥ (30)
ij ij ⎥

f G B ri j ⎣ ⎦
−45 cos αi j cos θi(1) (2)
j cos θi j

where A is a parameter to be determined by least-squares fitting, ri j is the distance


,
(2),
between the centers of the charged groups, the angles θi(1)
j and θi j are the angles
between the vector linking the side-chain centers of the charged headgroups and the
Ca …SC axes of the side chains i and j, respectively, aij is the angle between the
Ca …SC axes of the side chains (which are identified with the axes about which the
quadrupoles rotate), and fGB is expressed by Eq. 21. Introduction of the quadrupole
term makes sense only for systems with nonspherical charge distribution such as
the guanidine and carboxyl groups. Sample PMF plots for the propionate anion—n-
pentylamonium cation, together with analytical fits are shown in Fig. 6.

Fig. 6 PMF curves for the


propionate
anion—n-pentylamonium
cation pair. Black, red, blue
and green dashed lines refer
to the PMF curves
determined by molecular
dynamics calculations for
orientations. Solid lines of
the same colour refer to
analytical approximation of
PMF (Eq. 8) with
coefficients determined by
means of the least-squares
method (Eq. 21 in the paper
[60]). This is also Fig. 2a in
Ref. [61]
Physics-Based Modeling of Side Chain—Side Chain Interactions … 105

The analytical approximations to the PMF curves are reasonable (Fig. 6). The
ability of reproducing salt-bridges (red curves in Fig. 6) by the analytical function
(for the charged headgroup—charged headgroup orientation) is a very important
feature of the new energy function [61]. Even though salt-bridges do not occur very
often in proteins, they can be a factor that stabilizes protein structure at early folding
stages [63]. Although the new potential will certainly result in a somewhat longer
calculation time for oppositely charged side-chain pairs (due to multiple charge
states), this increase will not be substantial, as among twenty amino acids which are
of biological importance, only four possess charged side chains.

2.5 Charged—Hydrophobic/Polar
and Polar—Hydrophobic/Polar Side Chains

The same parametrization approach as above-described in previous Sects. 2.1–2.4


was also used for the remaining pairs composed of charged and polar, polar and
polar, charged and hydrophobic, and hydrophobic and hydrophobic side chains [63].
Their general analytical expressions were given by Eqs. 4–7.
The polarization component of the interactions between charged or polar and
nonpolar particles in Eqs. 4 and 5, respectively, is expressed by Eq. 31.
4 4
pol 1 pol 1
Epol  αij + αji (31)
fGB (rij ) fGB (rji )

pol pol
where αi j and α ji are related to the polarizability of the nonpolar parts of side
chain i and side chain j, respectively.
The E cp interaction potential between charged and polar sites of Eq. 5 is given by
Eq. 32:

(1) q · cos θ1 (2) q 2 · sin θ12


E cp  wdi p · − wdi p · (32)
Ri2j Ri4j

(1) (2)
where wdi p and wdi p are the parameters determined by least-squares fitting of the
analytical expressions to the potentials of mean force, q is the net charge of the
charged headgroup, and Rij is the distance between the centers of the amphiphilic
headgroups.
The average energy of the interaction between two polar-group dipoles (E pp ) of
Eq. 7 is expressed by Eq. 33:
106 M. Makowski
 
w p1  (12) (1) (2)

E pp  · cos ωij − 3 · cos θij · cos θij
Ri3j
 
w p2  (12) (1) (2) 2
 
2 (1) 2 (2)

− · 4 + (cos ωi j − 3 · cos θi j · cos θi j ) − 3 · cos θi j + cos θi j (33)
Ri6j

where w p1 and w p2 are the parameters determined by least-squares fitting, and Rij is
the distance between the centers of the polar headgroups.
A detailed discussion of the results has been described in Ref. [63]. It was also
observed that the model used reproduces all features of the interacting pairs well.
Based on these results [63] the preliminary tests of the UNRES force-field with the
new side chain—side chain interaction potentials and model were then tested with two
small α-helical proteins, i.e. the N-terminal part of the B-domain of staphylocaccal
protein A, (PDBL 1BDD; a three-α-helix bundle) and UPF0291 protein YnzC from
Bacillus subtilis (PDB: 2HEP; an α-helical hairpin). Results of these tests were
satisfactory [63]. However, it was observed that to achieve better resolution there
was a need to recalibrated the force-field with a larger number of training proteins.

3 Influence of Particle Size on the Free-Energy Profile


of Hydrophobic Interactions

Hydrophobic interactions play a very important role in chemical and biological struc-
tures. They are often responsible for the formation and stabilization of different kinds
of systems or biological structures in aqueous environments such as proteins, bio-
logical membranes, and macromolecular complexes. Entropy is the driving force
of hydrophobic interactions. Hydrophobic interactions are interactions mediated by
the solvent. Hydrophobic particles avoid the water molecules, which leads to indi-
rect interactions between them. This avoiding of water molecules by hydrophobic
surfaces causes specific packing of water particles in the vicinity of nonpolar parti-
cles. The water molecules that are in close contact with the hydrophobic surface are
ordered. This ordering of water molecules diminishes the entropy of the system. The
tendency to lower the entropy is higher when the hydrophobic surface is larger, and
leads to hydrophobic interactions. The result of hydrophobic association is a smaller
solvent accessible surface area compared to that of the separate hydrophobic parti-
cles when considered in total. Because of its specific character and our inability to
determine the structural details of hydro-phobic interactions experimentally, one uses
mainly theoretical methods to study this phenomenon. The potential of mean force
expressed, e.g., as a function of the distance between the centres of hydrophobic par-
ticles is a good quantitative measure of the dependence of hydrophobic interactions
on the geometry of the system.
The results of research concerning the influence of the size of the hydrophobic
particles on the shape of PMF curves were presented [60]. The authors performed
their calculations for five pairs of hydrophobic particles: methane, ethane, pro-pane,
Physics-Based Modeling of Side Chain—Side Chain Interactions … 107

isobutene, and neopentane. For each of the studied systems, PMF curves were deter-
mined both in water and in vacuo.
The solvent contributions to the potentials of mean force were calculated as the
difference of the PMFs determined in water and the respective PMFs determined in
vacuo [64]. Based on the analysis of the results [64], a conclusion can be drawn that
the depth of the contact minimum increases with increasing size of the inter-acting
nonpolar particles, both in water and in the gas phase. Additionally, the changes in
the height and location of the desolvation maximum (which comes from the solvent
contribution to the PMF) can be well described by the molecular surface area [64].
An analysis of the density distribution of water [64] shows that density increases
in the first and second hydration spheres. The highest density of water is observed
in the contact region of the hydrophobic particles. However, the ordering of water
molecules in the first hydration sphere is weak. The average number of hydrogen
bonds is smaller in the first solvation sphere than in bulk water [64]. The observed
average number of hydrogen bonds close to the interacting hydrophobic particles is
smaller for neopentane than for methane. However, if hydrogen bonds appear, they
are stronger than in water. This observation is in qualitative agreement with results of
previous research [64]. This smaller number of hydrogen bonds in the first hydration
sphere can be explained on a basis of the fact that water molecules are in contact
with nonpolar particles, so they have a smaller chance of forming hydrogen bonds
between each other than these ones which are farther from the nonpolar molecules. A
traditional explanation of the hydrophobic effect, which emphasizes the ordering of
water molecules in the first hydration sphere of interacting nonpolar molecules and
which leads to low entropy, is insufficient. For an explanation of this phenomenon,
cavity formation and small size of water molecules is important.
Later, the results of the research on hydrophobic association of the larger non-
polar molecules [65]: bicyclooctane, adamantane, and fullerene (C60) were com-
pared with those obtained for neopentane [64]. For the purpose of data analysis,
it was assumed that the average shape of the molecules is spherical. Additionally,
calculations for the sphere with van der Waals radius equal to the mean radius of
adamantane were also carried out. The shape of the determined PMF curves in water
[65] is characteristic of hydrophobic interactions. Each of these curves possesses a
contact minimum, a desolvation maximum, and a solvent-separated minimum [65].
Based on the results of simulations, it can be concluded that the relative contribu-
tion from nonbonded interactions between the hydrophobic molecules to the PMF
increases with the increase of molecular size. For smaller molecules [64], the minima
of the PMF curves determined in water had more negative free energies compared to
those determined in vacuo. Conversely, the minima of the PMF curves determined
in vacuo are deeper than those of the PMF determined in water for bicyclooctane,
adamantane, and fullerene [65]. Therefore, solvent contribution to the PMF is pos-
itive for the three large hydrophobic molecules [65]. Solvent contribution to PMF
curves for small hydrophobic particles has negative values or tends to zero for isobu-
tane and neopentane [64]. An attempt to explain this difference was an analysis of the
density of water molecules around neopentane (a smaller molecule) and adamantane
(a large molecule) (Fig. 7).
108 M. Makowski

Fig. 7 Normalized distribution functions of the water molecule density in the vicinity of the
adamantane dimer (Figures a–c) and neopentane (Figures d–f) at monomer-separation distances
h a 6.8 Å, b 8.8 Å, c 10.2 Å, d 5.8 Å, e 7.85 Å and f 9.2 Å, which correspond to the contact min-
ima (a and d), the desolvation barrier (b and e), and the solvent-separated minimum configurations
(c and f), respectively. The color scale is shown above the panels; and the bulk water density is
displayed in white. The solute is in grey, the space between the solute and the first hydration layer
is in violet, and the first hydration layer is in blue plus light blue (a–c) and green, red and yellow
(d–f). This is also Fig. 7 in Ref. [66]

It was found that positive solvent contribution to the PMF of hydrophobic


molecules larger than neopentane can be caused by a larger space between the two
interacting hydrophobic molecules, causing the water molecules to become trapped.
Water molecules which are trapped in the direct neighborhood of two hydropho-
bic particles have limited potential for mobility (their entropy decreases) and the
formation of hydrogen bonds with neighboring water particles. Therefore, the favor-
able effect of decreasing the molecular surface connected with the formation of a
contact between nonpolar molecules is dominated by unfavorable ordering of water
molecules in the region of space where the solvation spheres of the hydrophobic
particles overlap. Based on the results of the research [65], it can also be concluded
Physics-Based Modeling of Side Chain—Side Chain Interactions … 109

that large hydrophobic molecules such as bicyclooctane, adamantane, and fullerene


cannot be regarded as classical, small hydrophobic particles. On the other hand these
molecules are not large enough to treat them as macroscopic hydrophobic surfaces.

4 Temperature Dependence of the Potential of Mean Force

This part of this chapter concerns research on the temperature dependence of the
effective potential for the interaction of amino acid side chains in water [66, 67]. Most
contemporary coarse-grained force fields used for protein structure prediction do not
depend on temperature; this is inconsistent with the physical sense of most coarse-
grained force-fields, which stem from potentials of mean force. The temperature
dependence was introduced for the first time in the UNRES force-field in 2007
[68], via multibody-interaction terms. However, the U SCi SC j potential of the UNRES
force field remains temperature-independent. The main component of hydrophobic
interactions is the free energy of cavity creation and changes in entropy due to
the reorganisation of water molecules; these free energies depend on temperature.
Consequently, the U SCi SC j potentials should depend on temperature.
In paper [66], the first attempt at the derivation of temperature-dependent U SCi SC j
potentials was presented. Only a pair of interacting methane molecules at different
temperatures at constant volume or constant pressure in water was considered. The
PMFs and dimensionless PMFs (W/RT , where W is the potential of mean force, R
is the universal gas constant, and T is the absolute temperature) at different sim-
ulation temperatures and constant volume or pressure were plotted and analyzed
against the methane-methane distance. An important finding from this study is that
the dimensionless PMFs at constant volume nearly overlap. Therefore, to obtain
temperature-dependent potentials for the interactions of nonpolar side chains, one
has to multiply the dimensionless potential by absolute temperature. The results
of research [66] are in accordance with literature data [69]. Moreover, it was also
observed that the depths of the contact minimum (the first minimum on the plot, at
the shortest distance) in the PMF plots depend on the temperature very strongly and
increase with the temperature [66]. This means that the entropy of association is posi-
tive because S  −∂ F/∂ T (where: S denotes entropy, F denotes the free energy, and
T denotes the absolute temperature). The dimensionless PMFs [66] weakly depend
on the temperature in the contact minimum, which means that the energy of associa-
tion is small. A very strong dependence of the desolvation maximum (first maximum
at the shortest distance on the PMF curve) on the temperature is observed in the PMF
plots determined at constant pressure [66]. The height of the desolvation maximum
decreases with increasing temperature in both cases [66].
One important conclusion from the work discussed above [65] is that the assump-
tion, that the depth of the minimum of the side chain—side chain interaction compo-
nent of coarse-grained potentials is independent of temperature, neglects its actual
increase with increasing temperature. The neglect of this increase is equivalent to
assuming that the strength of the simulated interactions between nonpolar side chains
decreases with temperature at contact distance, which does not agree with the experi-
110 M. Makowski

mental finding that increasing temperature leads to hydrophobic association, or early


protein-folding stages. Consequently, the absence of temperature dependence of the
potentials of side-chain interactions can introduce artifacts into the results of simu-
lations and lead to false conclusions.
From the above considerations it follows that any force field that uses implicit
water model, i.e., its part representing solvent-mediated interaction has the sense of a
potential of mean force, should incorporate the dependence of the solvent-dependent
part on temperature to reproduce the thermodynamics of protein folding correctly. In
most coarse-grained force fields, including the UNRES force field, the solvent-related
part is included in the effective potentials for side chain—side chain interactions.
The results of the study reported in work [67] as well as those of previous work
[66] strongly suggest that the dimensionless potentials of mean force of pairs of
nonpolar solutes are almost independent of temperature in the region of contact
minimum, this being consistent with hydrophobic association as an entropy-driven
phenomenon. The PMF does not vary remarkably with temperature in the region of
the desolvation maximum and of the solvent-separated minimum. Additionally, it
was found that the potentially salt-bridge-forming a model of Lys-Asp or Glu pair
follows the same temperature-dependence pattern [67]. On the other extreme are the
models of pairs modeling Arg-Asp or Glu and hydrophobic particle—Asp or Glu
pairs, for which the PMF is virtually independent of temperature and the depth of
the contact minimum of the dimensionless PMF decreases with temperature, which
means that association is driven by energy [67]. The reasons for this are probably
branched head groups of both solute molecules, one of which is a donor and the
other is an acceptor in hydrogen bonding; consequently, the energy of hydrogen-
bond formation overcomes that of solvation of isolated solute molecules.
Based on the results of these studies [67], it was concluded that, as far as coarse-
grained force fields are concerned, there is a need to introduce temperature depen-
dence of the potentials of interaction of (a) pairs nonpolar side chains, (b) Lys-Asp
and Lys-Glu pairs, (c) pairs of positively-charged side chains, and (d) pairs com-
posed of negatively-charged and nonpolar side chains. As proposed previously [66],
the temperature-dependent potentials of categories (a) and (b) could be derived by
scaling the reference potentials by temperature or, more rigorously, by combining
the observations that the depth of the minimum of the dimensionless PMF dependent
 
on distance and orientation W̃ r, θ (1) , θ (2) , φ; T and the height of the desolvation
  
maximum of the PMF W r, θ (1) , θ (2) , φ; T are independent of temperature. Then
PMF fitting could be carried out under the conditions given by Eqs. 34 and 35.
 
∂ W̃ r, θ (1) , θ (2) , φ, T
≡ 0 at the contact minimum (34)
 (1) ∂ (2) T 
∂ W r, θ , θ , φ, T
≡ 0 at the desolvation maximum (35)
∂T
The potentials of category (c) and (d) require more consideration because the
dependence on temperature is coupled with the dependence on orientation [67].
Physics-Based Modeling of Side Chain—Side Chain Interactions … 111

5 Interaction Between O-Phosphorylated and Standard


Amino-Acid Side-Chain Models

The last part of this chapter concerns preliminary research on the interaction between
O-phosphorylated and standard amino acid side chains in water [70]. Phosphoryla-
tion of hydroxylated amino-acid side-chains such as serine (Ser), threonine (Thr),
and tyrosine (Tyr) by protein kinases can activate numerous enzymes and play a
very important role in several cellular processes. It is known that more than one
third of proteins in eukaryotas are subjects of phosphorylation. It can regulate for
example metabolic pathways, gene translation and transcription, membrane trans-
port, hormonal response, and many more. It should be noted, that knowledge of
how phosphorylation alters the structure and function of proteins is still not very
well recognized, because of specific physico-chemical properties of phosphorylated
group, i.e., −2 charge at physiological pH, which could perturb the local electrostatic
potential in proteins [70].
Similarly, to the work described in the Chapter “Protein Structure Prediction
Using Coarse-Grained Models” the PMFs dependent on distance and orientation for
interactions of pairs of phosphorylated amino acids and natural amino acids side-
chain models in water were calculated with MD simulations and then discussed.
The positions and depths of the contact minima and the positions and heights of the
desolvation maxima, including their mutual orientation depend on the character of
the interacting pairs. [70]. The same effect was observed in our previous work [56–62,
65, 66]. The quality of results of the coarse-grained model of the interactions of the
O-phosphorylated amino-acid side chains with natural amino-acid side chains and
the respective potentials developed is now being introduced into the UNRES force
field, which will enable to simulate and test its predictive power with the proteins
containing O-phosphorylated amino-acid residues [70].

6 Summary

The results of the research on the development of the new side chain—side chain
interaction potentials to be used in the coarse-grained physics-based UNRES force
field for protein simulations strongly suggest that more additional work is needed.
The parameters of this potential were determined from fitting of analytical functions
to the PMF curves obtained from the MD calculations. The new U SCi SC j potentials
have been implemented into UNRES and were tested on two small α-proteins. Based
on these preliminary tests it was observed that more additional tests were needed
and presumably an additional re-calibration of the parameters should be done to
significantly improve the predictive power of the UNRES. Replacement of the old
potentials of side chain—side chain interactions with the new one eliminated the last
knowledge-based component of UNRES.
112 M. Makowski

Introducing a temperature dependent and phosphorylated U SCi SC j potentials


require more tests. These are very interesting scientific problems and will be contin-
ued. Additionally, the treatment of ionic strength and pH in the U SCi SC j potentials,
as well develop the potentials for the interactions of side chains with nucleic acids
are developed.

Acknowledgements This research was conducted by using the resources of (a) our 818-processor
Beowulf cluster at the Baker Laboratory of Chemistry and Chemical Biology, Cornell University,
(b) the National Science Foundation Terascale Computing System at the Pittsburgh Supercomputer
Center, (c) 45-processor Beowulf cluster at the Faculty of Chemistry, University of Gdańsk, (d) the
Informatics Center of the Metropolitan Academic Network (IC MAN) in Gdańsk. This work was
supported by grants from the U.S. National Institutes of Health (GM-14312), the U.S. National Sci-
ence Foundation (MCB05-41633), the Polish Ministry of Science and Education (N N204 152836),
and the Polish National Science Centre (UMO-2013/10/E/ST4/00755).

References

1. Lee, J., Scheraga, H.A., Rackovsky, S.: Conformational analysis of the 20-residue membrane-
bound portion of melittin by conformational space annealing. Biopolymers 46, 103–115 (1998)
2. Lee, J., Liwo, A., Scheraga, H.A.: Energy-based de novo protein folding by conformational
space annealing and an off-lattice united-residue force field: application to the 10-55 fragment
of staphylococcal protein A and to apo calbindin D9K. Proc. Natl. Acad. Sci. U.S.A. 96,
2025–2030 (1999)
3. Pillardy, J., Czaplewski, C., Wedemeyer, W.J., Scheraga, H.A.: Conformation-Family Monte
Carlo (CFMC): an efficient computational method for identifying the low-energy states of a
macromolecule. Helv. Chim. Acta 83, 2214–2230 (2000)
4. Levitt, M.: Simplified representation of protein conformations for rapid simulation of protein
folding. J. Mol. Biol. 104, 59–107 (1976)
5. Crippen, G.M., Ponnuswamy, P.K.: Determination of an empirical energy function for protein
conformational-analysis by energy embedding. J. Comput. Chem. 8, 972–981 (1987)
6. Scheraga, H.A.: Calculations of stable conformations of polypeptides, proteins, and protein
complexes. Chem. Scr. 29A, 3–13 (1989)
7. Dill, K.A.: Dominant forces in protein folding. Biochemistry 29, 7133–7155 (1990)
8. Scheraga, H.A.: Some approaches to the multiple-minima problem structures. Int. J. Quant.
Chem. 42, 1529–1536 (1992)
9. Scheraga, H.A.: Predicting three-dimensional Structures of Oligopeptides. In: Lipkowitz, K.,
Boyd, D.B. (eds.) Reviews of Computational Chemistry, vol. 3, pp. 73–142. VCH Publ, New
York (1992)
10. Seetharamulu, P., Crippen, G.M.: A potential function for protein folding. J. Math. Chem. 6,
91–110 (1991)
11. Godzik, A., Koliński, A., Skolnick, J.: De-novo and inverse folding predictions of protein-
structure and dynamics. J. Comput. Aided Mol. Des. 7, 397–438 (1993)
12. Koliński, A., Godzik, A., Skolnick, J.: A general-method for the prediction of the 3-dimensional
structure and folding pathway of globular-proteins—application to designed helical proteins.
J. Chem. Phys. 98, 7420–7433 (1993)
13. Sippl, M.J.: Boltzmann principle knowledge-based mean fields and protein-folding—an
approach to the computational determination of protein structures. J. Comput. Aided Mol.
Des. 7, 473–501 (1993)
14. Koliński, A., Skolnick, J.: Monte-Carlo simulations of protein-folding. Lattice model and
interaction scheme. Proteins 18, 338–352 (1994)
Physics-Based Modeling of Side Chain—Side Chain Interactions … 113

15. Koliński, A., Skolnick, J.: Monte-Carlo simulations of protein-folding. 2. Application to


protein-A, ROP, and crambin. Proteins 18, 353–366 (1994)
16. Vasquez, M., Nemethy, G., Scheraga, H.A.: Chem. Rev. 94, 2183–2239 (1994)
17. Skolnick, J., Koliński, A., Ortiz, A.R.: MONSSTER: a method for folding globular proteins
with a small number of distance restraints. J. Mol. Biol. 265, 217–241 (1997)
18. Bystroff, C., Baker, D.: Prediction of local structure in proteins using a library of sequence-
structure motifs. J. Mol. Biol. 281, 565–577 (1998)
19. Moult, J.: Predicting protein three-dimensional structure. Curr. Opin. Biotechnol. 10, 583–588
(1999)
20. Scheraga, H.A., Lee, J., Pillardy, J., Ye, Y.-J., Liwo, A., Ripoll, D.R.: Surmounting the multiple-
minima problem in protein folding. J. Glob. Optim. 15, 235–260 (1999)
21. Samudrala, R., Xia, Y., Huang, E., Levitt, M.: Ab initio protein structure prediction using a
combined hierarchical approach. Proteins 37(suppl 3), 194–198 (1999)
22. Simons, K.T., Bonneau, R., Ruczinski, I., Baker, D.: Ab initio protein structure prediction of
CASP III targets using ROSETTA. Proteins 37(suppl 3), 171–176 (1999)
23. Skolnick, J., Fetrow, J., Ortiz, A.R., Koliński, A.: New methods for the prediction of protein
structure and function from sequence. FASEB J. Suppl. S 13, A1584–A1584 (1999)
24. Lazaridis, T., Karplus, M.: Effective energy functions for protein structure prediction. Curr.
Opin. Struct. Biol. 10, 139–145 (2000)
25. Osguthorpe, D.J.: Ab initio protein folding. Curr. Opin. Struct. Biol. 10, 146–152 (2000)
26. Murzin, A.G.: Progress in protein structure prediction. Nat. Struct. Biol. 8, 110–112 (2001)
27. Warme, P.K., Momany, F.A., Rumball, S.V., Tuttle, R.W., Scheraga, H.A.: Computation
of structures of homologous proteins—alpha-lactalbumin from lysozyme. Biochemistry 13,
768–782 (1974)
28. Clark, D.A., Shirazi, J., Rawlings, C.J.: Protein topology prediction through constraint-based
search and the evaluation of topological folding rules. Protein Eng. 7, 751–760 (1991)
29. Rooman, M.J., Wodak, S.J.: Extracting information on folding from the amino-acid-
sequence—consensus regions with preferred conformation in homologous proteins. Biochem-
istry 31, 10239–10249 (1992)
30. Jones, T.A., Thirup, S.: Using known substructures in protein model-building and crystallog-
raphy. EMBO J. 5, 819–822 (1986)
31. Ginalski, K., Elofsson, A., Fischer, D., Rychlewski, D.: 3D-Jury: a simple approach to improve
protein structure predictions. Bioinformatics 19, 1015–1018 (2003)
32. Bujnicki, J.M., Elofsson, A., Fischer, D., Rychlewski, L.: LiveBench-1: continuous bench-
marking of protein structure prediction servers. Protein Sci. 10, 352–361 (2001)
33. Kosiński, J., Cymerman, I.A., Feder, M., Kurowski, M.A., Sasin, J.M., Bujnicki, J.M.: A
“Frankenstein’s monster” approach to comparative modeling: merging the finest fragments
of fold-recognition models and iterative model refinement aided by 3D structure evaluation.
Proteins 53, 369–379 (2003)
34. Johnson, M.S., Overington, J.P., Blundell, T.L.: Alignment and searching for common protein
folds using a Data-Bank of structural templates. J. Mol. Biol. 231, 735–752 (1993)
35. Fischer, D., Rice, D., Bowie, J.U., Eisenberg, D.: Assigning amino acid sequences to 3-
dimensional protein folds. FASEB J 10, 126–136 (1996)
36. Simons, K.T., Koopernberg, C., Huang, E., Baker, D.: Assembly of protein tertiary structures
from fragments with similar local sequences using simulated annealing and Bayesian scoring
functions. J. Mol. Biol. 268, 209–225 (1997)
37. Rohl, C.A., Strauss, C.E., Misura, K.M., Baker, D.: Protein structure prediction using rosetta.
Methods Enzymol. 383, 66–93 (2004)
38. Liwo, A., Ołdziej, S., Pincus, M.R., Wawak, R.J., Rackovsky, S., Scheraga, H.A.: A united-
residue force field for off-lattice protein-structure simulations. 1. Functional forms and parame-
ters of long-range side-chain interaction potentials from protein crystal data. J. Comput. Chem.
18, 849–873 (1997)
39. Anfinsen, C.B.: Principles that govern folding of protein chains. Science 181, 223–230 (1973)
114 M. Makowski

40. Jansen, K., Schafer, O., Birkmann, E., Post, K., Serban, H., Prusiner, S.B., Riesner, D.: Struc-
tural intermediates in the putative pathway from the cellular prion protein to the pathogenic
form. Biol. Chem. 382, 683–691 (2001)
41. Morillas, M., Vanik, D.L., Surewicz, W.K.: On the mechanism of alpha-helix to beta-sheet
transition in the recombinant prion protein. Biochemistry 40, 6982–6987 (2001)
42. Shaw, D.E., Maragakis, P., Lindorff-Larsen, K., Piana, S., Dror, R.O., Eastood, M.P., Bank,
J.A., Jumper, J.M., Salmon, J.K., Yibing, S., Wriggers, W.: Atomic-level characterization of
the structural dynamics of proteins. Science 330, 341–346 (2010)
43. Lindorff-Larsen, K., Piana, S., Dror, R.O., Shaw, D.E.: How fast-folding proteins fold. Science
334, 517–520 (2011)
44. Gay, J.G., Berne, B.J.: Modification of the overlap potential to mimic a linear site-site potential.
J. Chem. Phys. 74, 3316–3319 (1981)
45. Pearlman, D.A., Case, D.A., Caldwell, J.W., Ross, W.S., Cheatham III, T.E., DeBolt, S., Fer-
guson, D., Seibel, G., Kollman, P.A.: AMBER, a package of computer-programs for applying
molecular mechanics, normal-mode analysis, molecular-dynamics and free-energy calcula-
tions to simulate the structural and energetic properties of molecules. Comput. Phys. Commun.
91, 1–41 (1995)
46. Liwo, A., Lee, J., Ripoll, D.R., Pillardy, J., Scheraga, H.A.: Protein structure prediction by
global optimization of a potential energy function. Proc. Natl. Acad. Sci. U.S.A. 96, 5482–5485
(1999)
47. Lee, J., Liwo, A., Ripoll, D.R., Pillardy, J., Scheraga, H.A.: Calculation of protein conformation
by global optimization of a potential energy function. Proteins Struct. Funct. Genet. 37(Suppl.
3), 204–208 (1999)
48. Lee, J., Liwo, A., Ripoll, D.R., Pillardy, J., Saunders, J.A., Gibson, K.D., Scheraga, H.A.:
Hierarchical energy-based approach to protein-structure prediction: blind-test evaluation with
CASP3 targets. Int. J. Quantum Chem. 71, 90–117 (2000)
49. Liwo, A., Pincus, M.R., Wawak, R.J., Rackovsky, S., Scheraga, H.A.: Calculation of protein
backbone geometry from beta-carbon coordinates based on peptide-group dipole alignment.
Protein Sci. 2, 1697–1714 (1993)
50. Ołdziej, S., Kozłowska, U., Liwo, A., Scheraga, H.A.: Determination of the potentials of mean
force for rotation about C-alpha-C-alpha virtual bonds in polypeptides from the ab initio energy
surfaces of terminally blocked glycine, alanine, and proline. J. Phys. Chem. A 107, 8035–8046
(2003)
51. Liwo, A., Ołdziej, S., Czaplewski, C., Kozłowska, U., Scheraga, H.A.: Parametrization of
backbone-electrostatic and multibody contributions to the UNRES force field for protein-
structure prediction from ab initio energy surfaces of model systems. J. Phys. Chem. B 108,
9421–9438 (2004)
52. Czaplewski, C., Liwo, A., Ołdziej, S., Scheraga, H.A.: Improved conformational space anneal-
ing method to treat beta-structure with the UNRES force-field and to enhance scalability of
parallel implementation. Polymer 45, 677–686 (2004)
53. Liwo, A., Arłukowicz, P., Czaplewski, C., Ołdziej, S., Pillardy, J., Scheraga, H.A.: A method
for optimizing potential-energy functions by a hierarchical design of the potential-energy land-
scape: application to the UNRES force field. Proc. Natl. Acad. Sci. U.S.A. 99, 1937–1942
(2002)
54. Liwo, A., Arłukowicz, P., Ołdziej, S., Czaplewski, C., Makowski, M., Scheraga, H.A.: Opti-
mization of the UNRES force field by hierarchical design of the potential-energy landscape. 1.
Tests of the approach using simple lattice protein models. J. Phys. Chem. B 108, 16918–16933
(2004)
55. Ołdziej, S., Liwo, A., Czaplewski, C., Pillardy, J., Scheraga, H.A.: Optimization of the UNRES
force field by hierarchical design of the potential-energy landscape. 2. Off-lattice tests of the
method with single proteins. J. Phys. Chem. B 108, 16934–16949 (2004)
56. Makowski, M., Liwo, A., Scheraga, H.A.: Simple physics-based analytical formulas for the
potentials of mean force for the interaction of amino-acid side chains in water. 1. Approximate
expression for the free energy of hydrophobic association based on a Gaussian-overlap model.
J. Phys. Chem. B 111, 2910–2916 (2007). Erratum: J. Phys. Chem. B 114, 1226 (2010)
Physics-Based Modeling of Side Chain—Side Chain Interactions … 115

57. Makowski, M., Liwo, A., Maksimiak, K., Makowska, J., Scheraga, H.A.: Simple physics-based
analytical formulas for the potentials of mean force for the interaction of amino-acid side chains
in water. 2. Tests with simple spherical systems. J. Phys. Chem. B 111, 2917–2924 (2007)
58. Makowski, M., Sobolewski, E., Czaplewski, C., Liwo, A., Ołdziej, S., No, J.H., Scheraga, H.A.:
Simple physics-based analytical formulas for the potentials of mean force for the interaction of
amino-acid side chains in water. 3. Calculation and parameterization of the potentials of mean
force of pairs of identical hydrophobic side chains. J. Phys. Chem. B 111, 2925–2931 (2007)
59. Makowski, M., Sobolewski, E., Czaplewski, C., Ołdziej, S., Liwo, A., Scheraga, H.A.: Simple
physics-based analytical formulas for the potentials of mean force for the interaction of amino-
acid side chains in water. IV. Pairs of different hydrophobic side chains. J. Phys. Chem. B 112,
11385–11395 (2008)
60. Makowski, M., Liwo, A., Sobolewski, E., Scheraga, H.A.: Simple physics-based analytical
formulas for the potentials of mean force of the interaction of amino-acid side chains in water.
V. Like-charged side chains. J. Phys. Chem. B 115, 6119–6129 (2011)
61. Makowski, M., Liwo, A., Scheraga, H.A.: Simple physics-based analytical formulas for the
potentials of mean force of the interaction of amino-acid side chains in water. VI. Oppositely-
charged side chains. J. Phys. Chem. B 115, 6130–6137 (2011)
62. Makowski, M., Liwo, A., Scheraga, H.A.: Simple physics-based analytical formulas for the
potentials of mean force of the interaction of amino-acid side chains in water. VII. Charged—hy-
drophobic/polar and polar—hydrophobic/polar side chains. J. Phys. Chem. B 121, 379–390
(2017)
63. Lewandowska, A., Ołdziej, S., Liwo, A., Scheraga, H.A.: Beta-hairpin-forming peptides; mod-
els of early stages of protein folding. Biophys. Chem. 151, 1–9 (2010)
64. Sobolewski, E., Makowski, M., Czaplewski, C., Liwo, A., Ołdziej, S., Scheraga, H.A.: Potential
of mean force of hydrophobic association: dependence on solute size. J. Phys. Chem. B 111,
10765–10774 (2007)
65. Makowski, M., Czaplewski, C., Liwo, A., Scheraga, H.A.: Potential of mean force of large
hydrophobic particles: towards nanoscale limit. J. Phys. Chem. B 114, 993–1003 (2010)
66. Sobolewski, E., Makowski, M., Ołdziej, S., Czaplewski, C., Liwo, A., Scheraga, H.A.:
Towards temperature-dependent coarse-grained potentials of side-chain interactions. I. Molec-
ular dynamics study a pair of methane molecules in water at various temperatures. Protein Des.
Eng. Sel. (PEDS) 22, 547–552 (2009)
67. Sobolewski, E., Ołdziej, S., Wiśniewska, M., Liwo, A., Makowski, M.: Toward temperature-
dependent coarse-grained potentials of side-chain interactions for protein folding simulations.
II. Molecular dynamics study of pairs of different types of interactions in water at various
temperatures. J. Phys. Chem. B 116, 6844–6853 (2012)
68. Liwo, A., Khalili, M., Czaplewski, C., Kalinowski, S., Ołdziej, S., Wachucik, K., Scheraga,
H.A.: Modification and optimization of the united-residue (UNRES) potential energy function
for canonical simulations. I. Temperature dependence of the effective energy function and tests
of the optimization method with single training proteins. J. Phys. Chem. B 111, 260–285 (2007)
69. Paschek, D.: Temperature dependence of the hydrophobic hydration and interaction of simple
solutes: an examination of five popular water models. J. Chem. Phys. 120, 6674–6690 (2004)
70. Wiśniewska, M., Sobolewski, E., Ołdziej, S., Liwo, A., Scheraga, H.A., Makowski, M.: The-
oretical studies of interactions between O-phosphorylated and standard amino-acid side-chain
models in water. J. Phys. Chem. B 119, 8526–8534 (2015)
Modeling Nucleic Acids
at the Residue–Level Resolution

Filip Leonarski and Joanna Trylska

Abstract Coarse–grained models and force fields have become useful in the studies
of the dynamics and physicochemical properties of nucleic acids. Reduced represen-
tations of DNA or RNA allow saving computational cost of a few orders of mag-
nitude in comparison with full–atomistic simulations. In this chapter we describe
a few selected coarse–grained models of nucleic acids in which one nucleotide is
represented as either one, two or three beads. We present the examples of the models
designed to investigate the internal dynamics and temperature-dependent denatura-
tion of nucleic acids, as well as created to predict the tertiary structure of RNA or used
for large ribonucleoprotein complexes. We describe how the purpose of the model
affects the design of the potential energy function and the choice of the simulation
method. We also address the limitations of these models.

1 Introduction

Genomes of many species, including human, have been already mapped [50, 140]
and are publicly available [34]. Their analyses give critical information on the cell
components. However, in numerous cases, looking solely at the nucleotide sequence
is not enough to explain how the processes in the cell are controlled. This happens
because these sequences give rise to three-dimensional molecules, immersed in the
environment of the cell, which undergo thermal fluctuations and “precisely” interact.
Therefore, the knowledge of the sequence, even though crucial, is only the first
step to analyze the spatial and temporal pattern of biomolecular interactions. To
understand these interactions one needs to capture both the structural properties
and time-dependent dynamics of single molecules and macromolecular complexes.

F. Leonarski
Faculty of Chemistry, Centre of New Technologies, University of Warsaw, Warsaw, Poland
e-mail: F.Leonarski@cent.uw.edu.pl
J. Trylska (B)
Centre of New Technologies, University of Warsaw, Warsaw, Poland
e-mail: joanna@cent.uw.edu.pl

© Springer Nature Switzerland AG 2019 117


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_5
118 F. Leonarski and J. Trylska

Below, we give a few examples where the dynamics is indispensable for biological
function.
Cells use multiple strategies to pack and protect long strands of deoxyribonu-
cleic acid (DNA) to provide for both DNA compaction and DNA accessibility for
transcription, replication and repair. For example, in bacteria DNA is supercoiled,
a torsional stress is applied to a circular DNA duplex (plasmid) [141, 145]. The
changes in supercoiling result in a bacterial response to hostile conditions such as
starvation or thermal shock [90, 91]. The unwinding of supercoiled DNA is also the
first step of transcription and replication [20]. In eukaryotes, proteins are used to
help pack DNA in the nucleus to the form of chromatin. The simplest building block
is the nucleosome, which is composed of histone proteins that are wrapped around
by about 140 base pair long DNA duplex [114]. Multiple successive nucleosomes
are separated by DNA linkers and resemble “beads on a string” under an electron
microscope. Such organization allows the cell to control access to nucleosomal DNA,
which is possible only when DNA unwraps from the histone core. Therefore, under-
standing the dynamics of this mechanism is crucial to control the gene expression or
design how to put the genetic material into cells.
Another important aspect of the stability of DNA is related to the flexibility and
dynamics of its double–helical structure. Topological stress, temperature or force–
pulling might break bonds between the complementary bases and destroy the helix.
DNA denaturation is easily tracked by UV–monitored changes of absorbance upon
raising the temperature. This process depends on the sequence and length of DNA,
and solution conditions such as ionic strength and pH [87]. Even though in living
cells a complete denaturation is not desirable, the local opening of a double–helix
is important for gene regulation. A small “bubble” of denatured DNA forms in the
areas where the transcription is initiated and/or regulated [20].
Even though ribonucleic acid (RNA) differs from DNA by just one hydroxyl
group in the sugar ring, this difference has important implications for the RNA
architecture leading to a plethora of RNA structures with diverse roles. Messenger
RNAs (mRNAs) serve as templates to transfer genetic information from DNA to
ribosomes. The RNAs that do not carry genetic information form a large group of
non–coding RNAs [83]. Transport RNAs supply the ribosome with amino acids.
The ribosome itself contains ribosomal RNA which serves not only as a structural
skeleton but also as a catalytic center. There is also a myriad of regulatory RNAs
such as micro RNAs, small interfering RNAs, and small nucleolar RNAs.
Functional differences between DNA and RNA arise from the structural ones.
DNA predominantly forms an ordered double–helical structure, with adenosine–
thymine (A–T) and guanine–cytosine (G–C) complementary canonical base pairs.
RNA is predominantly single–stranded with nucleotides bound by both complemen-
tary and non–complementary hydrogen bonds. Complementary ones, in the Watson–
Crick sense, represent the secondary structure. They are formed first, in microsec-
ond to millisecond time scales [2]. Bonds formed according to other schemes are
responsible for the RNA 3D folds and the entire tertiary structure [12, 66]. The ter-
tiary structure formation requires even seconds. The network of interactions in RNA
leads to double helical regions, intertwined with loops and junctions. Evolutionary
Modeling Nucleic Acids at the Residue–Level Resolution 119

conservation analyses show a strong link between the tertiary structure of RNA and
its function, whereas the secondary structure and sequence are less conserved [15].
The functionality of RNAs is related to its flexibility and ability to change
folds [10]. Some RNAs adapt multiple functional conformations in response to exter-
nal conditions. The examples are riboswitches [137] which respond to ligand or metal
ion concentrations and RNA thermometers [93] which respond to temperature shifts.
These are mRNA fragments that typically include the Shine–Dalgarno sequence [6]
responsible for binding of mRNA to the ribosome and initialization of translation.
The Shine–Dalgarno sequence either forms a hairpin loop which is not exposed to
interact with the ribosome or it switches the fold and the sequence becomes accessi-
ble to the ribosome. The accessibility of Shine–Dalgarno sequence depends on the
environment and can be moderated by external conditions.
Full understanding of the above processes requires the knowledge of how the
structure of nucleic acids changes and fluctuates in time and how this dynamics is
related with function. The methods that gain information solely from sequence are
of great value, e.g., thermodynamic nearest–neighbor model has been successful in
predicting denaturation temperatures of various DNA or RNA duplexes [54, 71, 139].
Also, the secondary structure can be in most cases reliably predicted just based on the
sequence [82]. However, the sequence-based methods fail for more complicated tasks
such as predictions of RNA 3D structure [117] and more importantly dynamics. The
dynamics, which is typically simulated based on the 3D model, helps in understanding
the functional roles of various nucleic acid architectures.
Also, in comparison with the number of available sequences of functional nucleic
acids, the experimentally-determined 3D structural data lag behind. As of December
2017, there have been 3738 structures deposited in the Protein Data Bank contain-
ing RNA molecules [8]. In the year 2016 there were only 311 new RNA structures
resolved. When compared with proteins these numbers are 132768 deposited struc-
tures and 10270 resolved in 2016. Efficient ways to predict the RNA 3D structure
will help filling the gap of still low number of RNA structures in the crystallographic
database. The dynamical data for RNA are even more sparse also because the dynam-
ics is difficult to be monitored experimentally at atomic level and on fast time scales.
So the modeling methods that add the fourth dimension – time-dependence are ben-
eficial to understand the complexity of interactions in the cell.
To characterize the dynamical processes occurring in nucleic acid molecules mul-
tiple techniques have been used. The three main conformational sampling techniques,
are molecular dynamics simulations, normal mode analysis and Monte Carlo algo-
rithms [61]. All typically require, as a starting point, a set of initial coordinates of
the molecule describing its 3D structure. The Monte Carlo (MC) algorithms are
probabilistic methods that help to stochastically explore the conformational space of
molecules. In an MC simulation (e.g. [57, 129]) small modifications of molecule’s
coordinates are randomly introduced and are either accepted or rejected based on
the potential energy of the system. If the modification lowers the potential energy, it
is always accepted. Otherwise, its acceptance is probabilistic, more likely to happen
if an absolute value of energy change is small. A wide number of possible confor-
mations can be probed using this method. Conversely, if one is not interested in a
120 F. Leonarski and J. Trylska

wide search of conformational space but in low frequency dynamics of a known


native state, normal mode analysis can be applied [73]. Normal mode analysis pre-
dicts the system’s motions at equilibrium by decomposing them into independent
vibrational modes. This method looks for vibrational normal modes with lowest fre-
quencies which are usually connected with molecule’s function. Molecular dynamics
(MD) [85, 124] is a tool most often used to analyze the time dependent dynamic
behavior of biomolecules. By integrating Newton’s equations of motion one can
calculate the positions and velocities of atoms or residues at small subsequent time
steps (the trajectory). However, to solve these equations one has to provide ini-
tial positions and velocities. The latter ones are usually assigned according to the
Boltzmann–Maxwell distribution at a requested temperature.
All these methods require a mathematical formula with a set of parameters (force
field, FF) to calculate the potential energy of the system. Well known examples of
such FFs are AMBER [16, 17] or CHARMM [13, 77], which provide sets of param-
eters to simulate proteins, nucleic acids, lipids and other molecules. They employ
a full–atomistic representation of a molecule, i.e., consider each atom separately in
integrating equations of motion. To provide a good description of the environment
one has to include solvent effects. This can be achieved by explicitly adding water
(or other solvent) molecules to a system. The state-of-the-art examples of MD simu-
lations in explicit solvent include a millisecond simulation of a 58 amino-acid bovine
pancreatic trypsin inhibitor protein, performed on a computer build exclusively and
purposely for MD simulations by D. E. Shaw group [127, 128] and 13.3 µs MD
simulation of folding of a 162 amino-acid human pin1 WW domain by K. Schulten
group [36]. Also, 100 million atom model of bacterial cytoplasm (containing pro-
teins, nucleic acids, metabolites, ions, and water molecules at atomistic description)
in a 100 nm box size has been recently simulated on a time scale of tens of nanosec-
onds [149]. However, traditional full–atomistic FFs have their limitations primarily
because they were parameterized based on experiments and quantum-mechanical
calculations for small molecules. There are also doubts about the quality of the
microsecond scale simulations since the FF parameterization was not performed
with such long time scales in mind. Unfortunately, the microsecond time scale is still
too short to model global conformational changes in RNA, to fully grasp how the
RNA tertiary structure is formed or to predict unliganded states of riboswitches.
Time saving by at least an order of magnitude can be achieved by performing a
simulation with the solvent modeled implicitly. To do this one can modify the FF to
include hydrophobic effects by adding a term involving the solvent accessible surface
area [43]. One can also modify equations of motions to include random collisions
with water, like in the Langevin-type dynamics [35]. But simplifying a system can
go further than just removing the solvent degrees of freedom. One can reduce the
system’s representation to achieve the necessary reduction in complexity. In such
simulations chemical groups or even whole residues can be represented as single
interacting centers (beads). Then the gain in performance is two-fold. The more
obvious one is the decrease of the number of interactions in the calculations of the
potential energies or forces. Additionally, the most frequent vibrations are removed
from the system, smoothing the potential energy surface, and allowing one to use a
Modeling Nucleic Acids at the Residue–Level Resolution 121

larger simulation time step. Therefore, such coarse–graining (CG) procedure, should
be appropriate to simulate the above introduced nucleic acid dynamical problems that
occur on nanoseconds to seconds.
In this chapter we present the CG FFs for nucleic acids that use between one and
three beads per nucleotide. We believe that such models give a reasonable balance
between the quality of the results and time efficiency of the calculations. However,
there are models that use a higher number of beads and represent the structural details
of bases (e.g., [11, 21, 23, 74, 75, 106, 108, 119, 148]). On the other spectrum
there are coarser models in which the building blocks are formed of helices and
single–stranded loops (not single nucleotides) [3, 4, 14, 51, 98]. Here, we describe
only the models that use spherical beads but some authors implement interaction
centers as ellipsoids [92] or disks [86]. We will also not cover the two models which
are historically important: a one–bead model published in 1970s by Olson [95–97]
which was the first attempt of coarse-grain DNA modeling and a three-bead per
nucleotide model by Vorobjev [144] from 1990, since the latter model was not used
in actual simulations, despite its strong theoretical background. Also, the models
that we review here belong to the class of the off–lattice models for which the bead
coordinates in the simulations are not limited to a certain set of positions such as
nodes of a cubic grid. Our selection of the models is arbitrary and far from complete
because our aim was to give informative examples of how the CG models for nucleic
acids are constructed and to which biological problems they can be applied.

2 Coarse Grained Force Field Parameterization

Coarse–graining is not a problem–free procedure. One of the challenges of the


CG models is their parameterization. For full–atomistic models there are well–
established protocols where FF parameters are determined and benchmarked based
on quantum chemistry models and experimental measurements of thermodynamic
parameters for small molecules. CG models are hard to fit directly to the quantum
mechanical data, although there are examples, such as the UNRES FF developed by
Liwo et al. [45, 69, 70, 76] (and described in this book), where tedious derivations
led to a usable CG potential.
Most CG potentials presented in this chapter can be classified as statistical or
knowledge–based. The parameters of such FFs were found based on the average
properties derived from large sets of reference data characterizing the molecules of
interest. In some cases the data sets include all nucleic acids of a certain class found
in one of the crystallographic databases [7, 8]. In other cases, the data sets are gath-
ered from full–atomistic simulations. However, in both cases the parameterization
procedure is the same. Using the Boltzmann inversion procedure one can infer the
potential energy from distributions of certain observables (e.g., distances, angles,
dihedrals) acquired from one of the mentioned sources [112]. A distribution d(r ) of
an observable r is linked with the potential energy using the equation
122 F. Leonarski and J. Trylska

d(r )
V (r ) = −k B T ln , (1)
d0 (r )

where d0 (r ) is a reference distribution of such an observable, k B is the Boltzmann


constant and T temperature. To use this method one has to assume that V (r ) is not
correlated with other observables, which, if not satisfied, can lead to errors in the
potential energy approximation. However, there are also other problems associated
with the Boltzmann inversion approach. The structural data set might be biased
(e.g., the PDB database contains rather short RNA molecules and long RNAs are
underrepresented) or affected by uncertainties in the structure determination due
to low resolution of electron density maps. Also, the structures of biomolecules
from X–ray crystallography are derived at temperatures much lower than 310 K.
Moreover, crystallization of biomolecules typically occurs under unphysiologic high-
salt conditions and induces crystal packing forces. Last, but not least, finding the
proper reference distribution, d0 (r ), is difficult because it should take into account
the specifics of nucleic acid structures (such as the linearity).
Therefore, some authors fix certain parameters to the values that are experimen-
tally known, e.g., from the thermodynamic measurements. Next, this procedure is
followed by a trial–and–error optimization of other parameters in order to correct for
the drawbacks of the Boltzmann inversion method. Typically, one performs tests of a
CG FF on a known system and systematically modifies the parameters until a reliable
set meeting the assigned criteria is found. This last step can be performed in a sys-
tematic way using local or global optimization methods [47, 63, 64, 111, 122]. Such
methods were implemented by us in a software package, called RedMDStream [65],
aimed for easy optimization of residue level resolution CG potentials. The package
simplifies the workflow, from the assignment of a particular CG representation and
potential to running thousands of test molecular dynamics simulations in order to
find one model that best fits the user defined criteria.

3 Force Field Description

The basic criterion that we apply to divide the CG models into classes is the number
of beads used to represent one nucleotide. As stated in the introduction we will cover
only a small spectrum of possible representations—one to three beads per nucleotide.
However, even in this bead range one observes differences between the design and
applicability of the coarser– and finer–grained models.
There are also other measures to compare FFs apart from the number of interact-
ing centers. These are the mapping (where the centers of beads are positioned), the
definition of potential energy function, and the range of applicability (or transfer-
ability). Unfortunately, for the CG methods this applicability range is usually very
narrow. To provide reliable and predictive results CG methods have to be fine–tuned
Modeling Nucleic Acids at the Residue–Level Resolution 123

for a particular process and/or group of molecules. Overall, CG FFs lack the general
transferability of all-atom ones. For example, in the presented set of FFs there is not
even one that can be, out of the box, applied to both RNA and DNA systems. Apart
from the target molecule, we have selected the following main classes of problems
that the FF can be applied to:
• Long timescale dynamics—a model provides reliable information about the time
evolution of a molecular structure on at least nanosecond scale.
• Tertiary structure prediction—a model finds a 3D structure or a set of 3D structures
that are closest to the native state. The focus is only on the final structure, not on
the way it is achieved (in contrast to the folding simulations).
• Temperature denaturation—a model correctly predicts the effects of the tempera-
ture increase on nucleic acid stability.
• Supercoiling—a model predicts the effects associated with supercoiling (e.g.,
unwinding mechanics).
• Large molecule mechanics—a model is designed to simulate the dynamics of large
molecular complexes (>1000 residues) such as the ribosome or nucleosome.
• Interaction with non–nucleic acid molecules—a model is able to predict interac-
tions with ligands, proteins or nanomaterials (ions and solvent are not included in
this category).
The FF applicability results from its implementation details such as the definition
of the potential energy function with respect to the chosen degrees of freedom and
connectivity. For the residue-resolution CG FFs the potential energy function, Vtotal ,
is usually expressed in the following, general way:

Vtotal = Vintrastrand + Vinterstrand + Vnb . (2)

The intrastrand term covers the interactions of beads connected by covalent bonds
which extend up to the third neighbor. This term is composed of a pseudo–bond
(Vbond ), pseudo–angle (Vangle ), and pseudo–dihedral (Vdi hedral ) parts (see Fig. 1):

Vintrastrand = Vbond + Vangle + Vdi hedral . (3)

Typically, these bonds are not allowed to break in a simulation, so they are represented
with harmonic potentials (see Fig. 1a, b, c):

Vbond (r ) = kr (r − r0 )2 , (4)

Vangle (θ ) = kθ (θ − θ0 )2 , (5)

Vdi hedral (φ) = kφ (φ − φ0 )2 , (6)


124 F. Leonarski and J. Trylska

(a)

V(r) [kcal/mol]
0

r0

r [Å]

(b)
V(θ) [kcal/mol]

θ0
θ [deg]

(c)
2K φ
V(φ) [kcal/mol]

−π + φ0 φ0 π + φ0
φ [deg]

Fig. 1 Intrastrand potentials used in the presented CG FFs. a The pseudo–bond harmonic (solid
line, see (4)), cubic (long–dashed line) and quartic (short–dashed line, see (33)) potential. b The
pseudo–angle potential (see (5)). c The pseudo–dihedral potential implemented using a cosine
function (long-dashed line, see (6)) or harmonic potential (solid line, see (7))
Modeling Nucleic Acids at the Residue–Level Resolution 125

where kr ,1 kθ and kφ are the force constants, r0 the equilibrium distance, and φ0 and
θ0 are the equilibrium angles. The drawback of the above Vdi hedral is that it is not
periodic so to account for full rotation of the pseudo–dihedral angle, a formula with
a cosine is used (see also Fig. 1c):

Vdi hedral (φ) = kφ [1 − cos(φ − φ0 )] , (7)

with the same definition of kφ and φ0 . Beads positioned in the same strand can
form complementary bonds which is especially important for RNA that is usually
composed of only one folded strand. However, as these are usually residues separated
by more than three bases, for the purpose of CG FFs such bonds are not considered
to be “intrastrand” and accounted for in the interstrand part.
The interstrand term describes the interaction of complementary strands. This
term models hydrogen bonds which in nature can be broken by raising the temperature
or adding denaturating agents or enzymes. Breakable bonds are usually implemented
using the Lennard–Jones potential (see Fig. 2a):
   σ 6 
σ 12
VL J (r ) = 4ε − (8)
r r

or in an alternative form (r0 = 21/6 σ ):


   r 6 
r 12
0 0
VL J (r ) = ε −2 , (9)
r r

or the Morse (see Fig. 2b) potential:

VMorse (r ) = V0 (exp[−α(r − r0 ) − 1]) − 1)2 − V0 . (10)

Equations (8) and (9) are two forms of the same equation. ε describes the depth of
the potential energy well. σ is the distance where the potential energy is equal to zero
and req is the distance where the potential energy has a minimum. For the Morse
potential of Eq. 10, V0 is also the depth of the energy well and α describes the width
of the potential well. The Lennard–Jones potential might be modified (for example
softened) by changing the powers in the equation. However, not all FF models permit
such actions because this requires a more complex potential energy formulation. It is
not always necessary to allow for the interstrand bond breaking because a particular
CG model may be designed only for non–denaturating conditions. In such case a
simple (4) harmonic potential may suffice. The CG models also differ in the way the
interstrand bond network is set. Simpler models have a predefined network which
is based on the secondary structure prediction and the pairing is not altered during

1 Inthis chapter we ignore the 21 factor because the harmonic potentials in CG FFs are presented
differently (either with or without the 21 factor). Including this factor affects only the numerical
value of a force constant but does not change its general form.
126 F. Leonarski and J. Trylska

a simulation, so even after denaturation the molecule will always return to the same
conformational setting as in native conditions. This is beneficial for RNA structure
prediction, when we are interested in the folds that correspond only to one particular
secondary structure. In the case of more elaborate CG FF models interstrand bonds
can be formed dynamically when the two complementary bases are close and their
topology permits bonding.
The last category of terms are the nonbonded ones nb. They account for the
interactions of residues that are not connected explicitly by intrastrand and inter-
strand terms. Their basic function is to introduce a short–range repulsion to avoid
overlapping of non–interacting beads, however, they also account for long-range
electrostatic interactions and solvent or other environmental conditions. The imple-
mentation of these terms varies among FFs depending on their applications. Some
FFs use Lennard–Jones or Morse terms as in (8) or (10) that describe both the attrac-
tion at short distances and repulsion at long distances. However, for highly charged
molecules, such as nucleic acids, one could also use the Coulomb electrostatic poten-
tial to describe the repulsive–only potential, with or without shielding (see Fig. 2c):
qi q j
VCoulomb (r ) = , (11)
4π ε0 εw r
qi q j
VShCoulomb (r ) = exp (−r/k D ) , (12)
4π ε0 εw r

where qi and q j are the charges of interacting beads, ε0 is the vacuum and εw the
solvent permittivity. The Debye length,
 0.5
ε0 εw k B T
kD = , (13)
2N A e2 I

depends on the temperature T and ionic strength I of the solution. R is a universal


gas constant, N A is the Avogadro number and e is the electron charge [56].
For intrastrand and interstrand interactions we introduce the following notation
shown in Fig. 3: i : i + n denotes the interaction between a nucleotide and its n-th
successor on a single strand, i : j + n or i : j−n denotes the interaction between a
nucleotide and its n-th successor (or predecessor) of its complementary strand.
A graphical representation of CG FFs described in this chapter can be found in
Figs. 4 and 5. In Table 1 we compare the features of the described models. In the
following sections we present the models in the descending order of complexity—
from three to one bead per nucleotide.
Modeling Nucleic Acids at the Residue–Level Resolution 127

a. Lennard−Jones potential b. Morse potential

0 0
V(r) [kcal/mol]

V(r) [kcal/mol]
ε V0

σ r0 r0
r [Å] r [Å]

c. Screened Coulomb repulsion d. Discrete energy function

0
V(r) [kcal/mol]

V(r) [kcal/mol]

V3

V2

0
V1

kD 1 kD 2 r1 r2 r3 r4
r [Å] r [Å]
e. Morse with barrier potential f. Restraint potential

V0
V(r) [kcal/mol]

V(r) [kcal/mol]

ak3

0 0
V0c

r0 r1 r2 r3 r4

r [Å] r [Å]

Fig. 2 Interstrand and nonbonded potentials used in the presented CG FFs. a Lennard–Jones
potential. b Morse potential with α = 1.0 (solid line) and α = 2.0 (long–dashed line). c Coulomb
potential without screening k D = ∞ (solid line), Coulomb potential with two example Debye
lengths k D1 < k D2 (short– and long–dashed line, respectively). d Discrete potential taken from
the model of Ding et al. [29, 30]. e Morse potential with a barrier used in Trovato et al. [135]:
Morse potential (solid line), switch function (short–dashed line), final potential (long–dashed line).
f Restraint potential from the model of Malhotra et al. [22, 78, 79]
128 F. Leonarski and J. Trylska

Fig. 3 DNA helix showing


the nucleotide numbering
according to the i:i+n and
i:j+n convention, with a
single nucleotide pair
(darker) in the middle as a
reference. This helix is
shown in a one–bead
representation, with
interaction centers placed on
phosphorus atoms as in FF
by Trovato and Tozzini [135]
and Trylska et al. [136, 143]

4 Three-Bead DNA Model for Dynamics and Melting

The first example that we describe of a three–bead per nucleotide model is the one
of Knotts et al. [56] designed for DNA. In this model the beads that mimic the sugar
and phosphate are placed at the centers of mass of these groups. The adenine and
guanine base beads are placed in the position of their N1 atoms and the thymine
and cytosine beads in the position of their N3 atoms (see Fig. 5a). The authors argue
that representing the DNA backbone with two beads is necessary to properly model
the deformation of grooves which are important for protein–DNA interactions. The
choice of a three–bead representation also helps in later transformation from a CG
representation to a full–atomistic one. The intrastrand part of the potential energy
function contains one additional term, Vstack , in comparison with Eq. 3:

Vintrastrand = Vbond + Vangle + Vdihedral + Vstack . (14)


Modeling Nucleic Acids at the Residue–Level Resolution 129

Fig. 4 Left: RNA hairpin loop (PDB:1ATO [58]); Right: yeast phenylalanine tRNA
(PDB:6TNA [131]): a full–atomistic representation b three–bead per nucleotide representation
as in the work of Ding et al. [30]. c one–bead per nucleotide as in the work of Jonikas et al. [53].
For the RNA hairpin loop (left) we show the bead placement with non–breakable bonds and for
tRNA (right) we show only the bead placement
130 F. Leonarski and J. Trylska

Fig. 5 Guanine—cytosine
nucleotide pair represented
in different CG
representations: a three–bead
model as in Knotts et al. [56],
a similar model is described
in the work of Ding et
al. [30], however the base
atom is placed in the center
of the 6–member nucleotide
ring, b a two–bead model
with pseudo–atoms placed
on the backbone and base as
in Drukker et al. [32], c one
bead centered on the
phosphorus atom as in
Trovato et al. [135] and by
Trylska et al. [136, 143], d
one bead placed in the
nucleotide geometric center
as in Savalyev et al. [122], e
one bead centered on the C3
atom as in Jonikas et al. [53],
f one bead placed on the
phosphorus atom and a
special “dummy” bead in the
middle of a complementary
pair as Malhotra et al. [22,
78, 79]
Table 1 Comparison of features of CG FFs presented in this chapter
Knotts et Ding et al. Hyeon et Ouldridge Drukker et Trovato et Savalyev Jonikas et Trylska et Malhotra
al. [25, 37, [30] al. [48] et al. [100, al. [32] al. [135] et al. [122, al. [53] al. [136, et al. [22,
56, 109] 102, 104] 123] 143] 78, 79]
Number of beads/nt 3 3 3 3 2 1 1 1 1 1.5
Nucleic acid
DNA      
RNA     
Applicability
Long timescale dynamics         
Tertiary structure prediction  
Temperature-dependent denaturation     
Force–pulling denaturation 
Supercoiling  
Modeling Nucleic Acids at the Residue–Level Resolution

Large macromolecule mechanics  


Interaction with non–nucleic acids  
Potential energy formulation properties
Bias towards reference positions   
Bias towards reference secondary       
structure
Base specificity     
Breakable hydrogen bonds      
Explicit ions  
131
132 F. Leonarski and J. Trylska

The pseudo–bond Vbond and pseudo–angle Vangle potentials are implemented using
harmonic potentials (see (4) and (5) and Fig. 1a, b). The pseudo–dihedral potential
Vdihedral is implemented using a cosine potential (see (7) and Fig. 1c). The Vstack term
is modeled with the Lennard–Jones potential (see (8) and Fig. 2b).
The first three terms in (14) are standard but Vstack is an additional Go–type
potential introduced to account for the stacking interactions [46]. This interaction is
modeled only between the base beads that belong to one strand and in the reference
(“native”) structure are positioned within a 9 Å cut–off distance. Therefore, this
potential accounts for both the i:i+1 and i:i+2 interaction.
In the interstrand term the complementary base pairs are connected using the
Lennard-Jones like potential (see Fig. 2a), but with the 12–10 powers instead of
12–6 as in (8):
    10
σi j 12 σi j
Vinterstrand (ri j ) = 4εbpi j 5 −6 , (15)
ri j ri j

where the summation is over all G-C and A-T base pairs that are not already consid-
ered in Vstack .
The nonbonded potential in the original paper [56] is composed of an excluded vol-
ume term Vex , implemented using the Lennard–Jones potential (see (8) and Fig. 2a)
and a shielded electrostatic term VShCoulomb (see (12) and Fig. 2c):

Vnb = Vex + VShCoulomb , (16)

where the Vex term is only calculated when the ri j distance between beads is smaller
than a predefined cut–off. The VShCoulomb defines the electrostatic repulsion of only
phosphorus atoms (with the charges qi = q j = −1).
This model was parameterized in an iterative way. The first guess of parameters
was taken from the geometry of an ideal B–DNA helix. Second, a 14 base–pair DNA
duplex was simulated with the CG model using replica-exchange MD [133]. Eight
replicas (or system copies) were simulated in parallel and assigned temperatures in
the range 260–400 K. Temperatures were swapped between two replicas with a prob-
ability related to their potential energy difference. Each replica was equilibrated and
10 ns production runs were performed. The advantage of replica-exchange MD over
constant-temperature MD was that it allowed the authors to determine the melting
curves of the duplex and provided distance distributions in eight different temper-
atures. Also, the effect of parameters on the potential of mean force with varying
temperature was analyzed using a weighted histogram analysis method [59] and the
parameters were improved for the next iteration step.
Next, to validate the model, the obtained FF parameter set was evaluated by per-
forming CG replica–exchange MD simulations and comparing them with the DNA
thermal denaturation experiments. In the simulation the melting and the formation of
the denaturation bubble were observed in accord with the reference data for varying
salt concentrations. Knotts et al. [56] show that with their FF they were able to predict
Modeling Nucleic Acids at the Residue–Level Resolution 133

the melting temperatures of three DNA duplexes with an error lower than 5%. To
validate the mechanical properties of the model, a CG traditional MD was performed
at 300 K. The persistence length for four different fragments of λ-phage plasmids
(one of them was 1489 base pairs and 0.5 µm long) was calculated. Their model
overestimated the persistence length by 2.3 but the authors claim that this is much
less than in other CG models. Based on their parameterization Knotts et al. suggest
that the dihedral force constant (kφ ), potential energy well depths for base–pairing
(εbpi j ), stacking and excluded volume (E ex ), are the most important parameters to
tune.
The presented model was further improved. Sambriski et al. [121] added entropic
effects to the potential energy to allow for rehybridization of the DNA strands, as
the original model of Knotts et al. [56] was unable to model strands’ renaturation.
DeMille et al. [26] added explicit solvation with water as well as monovalent ions.
This modification provides a good cylindrical distribution of ions around DNA but
it over–estimates the DNA melting temperatures. Next, Freeman et al. [37] added to
the model terms for the interactions of DNA with both mono– and di–valent ions.
This model is one of the most comprehensive CG FFs from the ones presented in
this chapter. It can be used to estimate both DNA melting curves and DNA mechan-
ical properties. The subsequent modifications of this model add better treatment
of solvation and electrostatics. Nevertheless, there is still room for improvement,
especially to correct for high errors of the calculated persistence lengths.

5 RNA Folding with a Three–Bead Model

The model by Ding et al. [30] was designed to predict the tertiary structure of RNA
but may be also used to study the mechanism of RNA folding. This model is based
on discrete MD previously successfully applied to protein folding [18, 29]. In this
method, the interaction between beads is described using pairwise, discontinuous
functions (see Fig. 2d):


⎪ ∞ r < r1


⎨V1 r1 < r < r2
Vbond (r ) = V2 r2 < r < r3 , (17)



⎪ ...

∞ r > rmax


⎪ ∞ r < r1


⎨ V1 r1 < r < r2
Vnb (r ) = V2 r2 < r < r3 . (18)



⎪ ...

0 r > rmax
134 F. Leonarski and J. Trylska

Multiple–step distances r1 , r2 , r3 , . . . between beads are defined. If the distance


between two beads is between r1 and r2 their pair potential interaction energy has a
value of V1 , if this distance is between r2 and r3 the potential is assigned a different
value – V2 , etc. If the distance is smaller than a minimal distance, then an infinite
value of the potential is assigned to avoid overlapping. However, if the distance is
larger than some maximal value, there are two possibilities; the potential energy is
equal to 0 (if the interaction is considered “breakable”) or infinity (if the interaction
is considered“unbreakable”). The functions described by (17) and (18) could not be
used in traditional MD because of their discontinuity, so Ding et al. [29, 30] have
chosen a different approach. In principle, the bead velocities are constant during the
dynamics and are changed only by colliding with other interacting centers. If the bead
kinetic energy is larger than the difference between the two energy steps Vi − Vi−1
and the distance is smaller than ri , a collision can occur and velocities are updated.
Otherwise, a hard reflection occurs without any change in the potential energy. The
advantage of using this discrete MD method is its higher efficiency in comparison
to standard MD. In the latter each MD step requires recalculating the forces acting
on all atoms in the system and then solving the equations of motion. In the discrete
method, in the case of no collisions, one needs to update only the positions of the
beads, not velocities.
In this model single beads are assigned to a phosphate group (P), sugar (S) and
base (B) (see Fig. 4b). As in the model of Knotts et al. [56] for DNA, the sugar and
phosphate beads are placed in the centers of masses of these groups and the base bead
is placed in the center of a six–membered ring. The intrastrand interactions contain
only the Vbond distance–dependent term and are a combination of unbreakable bonds
between the P, S and B beads. Since there are no explicit pseudo–angle and pseudo–
dihedral terms as in (3), additional bonds between the beads of two neighboring
nucleotides are added (e.g., a bond between the S bead of an (i − 1)th nucleotide
and a cytosine B bead of an ith nucleotide). The stacking interaction between the
bases is also implemented as a breakable bond (18) and designed in a way to provide
a correct angle between the three bases in one line.
The interstrand terms are composed of breakable bonds between complementary
nucleotides (also including the wobble pair G-U). A complementary pair is repre-
sented as three bonds: base–base and two sugar–base bonds. Such bonds are assigned
only if a correct (in the Watson-Crick sense) distance and orientation between the
sugar and base beads of both nucleotides is achieved. But for loops the reduction of
the degrees of freedom underestimates the entropy so loop forming may be modeled
in an unphysical fashion. To account for better representation of loops, first, loop
forming free energies are calculated according to the nearest–neighbor model [81]
and for loops the interstrand bond is formed only with a probability based on this
free energy value.
The nonbonded interactions are implemented as follows. The phosphate-placed
beads repel each other by a discretized screened Coulomb potential (see Fig. 2c for
the Coulomb potential and Fig. 2d for the general discrete potential). The base-placed
beads are connected with an attractive force due to the hydrophobic nature of the
nucleotides. The attraction between bases may result in overpacking of the bases, so
Modeling Nucleic Acids at the Residue–Level Resolution 135

there is an additional term which penalizes the bases with too many contacts in the
defined cut–off region.
The model of Ding et al. [30] was parameterized based on the thermodynamic
data from the nearest–neighbor model by Mathews et al. [81] and on distributions
calculated from known 3D RNA structures. It was next evaluated on 153 known
RNA structures of the lengths between 10 and 100 nucleotides. Their sequences
were used to create linear RNA molecules, which were simulated with the discrete
MD method [29] and their folding was analyzed. The so called Q-values, defined as
a fraction of native base pairs present in a given RNA conformation, were assessed.
The average Q-value for all the tested structures was 94%, which is 3% higher
than Mfold [153], a secondary structure prediction software (especially in the case
of pseudo–knots). 84% of RNA structures had a root mean square deviation from
the final reference structure lower than 4 Å, which is a good score. RNA folding
with this potential can be performed using the iFoldRNA web server [126]. The
performance of the model was assessed in the RNA Puzzle competition [89] in which
the participants are provided with a sequence and secondary structure of an RNA
whose crystal structure was solved but not yet released. According to the published
ranking, the pipeline involving the Dang et al. model provided the best solution for
one of the puzzles, i.e. the ydaO riboswitch structure (puzzle 12) [113].

6 RNA Thermal Unfolding and Stretching with a


Three–Bead Model

Another example of a three–bead per nucleotide FF is given by Hyeon et al. [48,


49] and is an extension of a model that was previously designed for protein fold-
ing [55]. This RNA FF was created to model mechanical unfolding of a particular
22-nucleotide long RNA hairpin (P5GA hairpin) with a known NMR structure [118].
This hairpin is structurally similar to another P5ab hairpin of group I intron in the
Tetrahymena thermophila ribozyme for which the force unfolding studies were per-
formed using optical tweezers [67, 68]. Hyeon et al. [48] compare their simulation
results of the P5GA hairpin to the ones from the above mentioned experiments. Their
CG model assigns beads to phosphate, sugar and base groups and places the beads in
the geometrical centers of these groups. To create the topology the authors used the
concept of the Go model [39] in which the interactions present in the native structure
are attractive and all the others are repulsive.
The intrastrand potential, similar to the one used by Hyeon et al. [55] for proteins,
is composed of three potential terms, like in (4), where the Vbond and Vangle terms
use a harmonic function (see (4) and Fig. 1a and (5) and Fig. 1b) and Vdihedral is
implemented using the cosine potential (see (7) and Fig. 1c).
The interstrand potential is composed of a stacking term:

Vstack = ΔG i (T )For , (19)


136 F. Leonarski and J. Trylska

where ΔG i (T ) are the Turner’s parameters of the nearest–neighbor model [81]. For
is an orientation term, including both i : j and i + 1 : j−1 distances and sugar and
base bead angles involving i, i + 1, j, j−1th nucleotides (according to the i:j notation
shown in Fig. 3).
The nonbonded term is described using the Lennard–Jones potential (see (8) and
native
Fig. 2a), with separate formulas Vnb for the interaction of beads forming the native
non-native
contacts (closer than 7 Å in the reference structure) and Vnb for the interactions
of non–native beads, and Debye-Huckel potential VPP for the repulsion of phosphorus
beads (see (12) and Fig. 2c):

Vnb = Vnb
native
+ Vnb
non-native
+ VPP . (20)

The FF was first tested by performing MD simulations of the unfolded P5GA


hairpin structure, without force steering, to see if the structure converges towards
the NMR resolved one. By slow cooling, simulated annealing, and steepest–descent
minimizations, the RNA hairpin converged to the experimentally folded structure
with the root mean square deviation of 0.1 Å. Next, the dynamics of stretching of the
RNA P5GA loop was studied to calculate the phase diagrams of denaturation arising
from external force and temperature. Finally, in the simulation, the hairpin was pulled
and later refolded from an extended conformation using a force quench. These MD
simulations gave insight into the mechanism of force unfolding and refolding of the
P5GA loop.
This model was further used to investigate the folding of RNA pseudo–knots. In
the work by Cho et al. [19] the simulations of folding of three pseudo–knots (MMTV
and SRV-1 from viral genomes and hTR from human telomerase) were performed
and the folding mechanisms were consistent with experimental data. However, the
authors emphasized that even though these pseudo–knots are structurally similar,
their folding occurred through different scenarios. In the work by Biyun et al. [9]
further analysis of the hTR pseudo–knot folding was performed—the effects of the
ion concentration jumps and temperature decrease on folding were investigated giv-
ing a better understanding of the transient states and folding pathways.
The model, with some improvements [27], was further used in an ambitious study
on how Mg2+ ions stabilize the group I intron RNA [28]. The role of ions in folding
and function of catalytic RNAs is crucial but yet unsolved. The study involved mul-
tiple folding simulations of large RNAs surrounded by K+ , Mg2+ and Ca2+ ions so
it gives insight into the processes that cannot be easily accessed experimentally. Yet
this study also needs to be critically evaluated. To prove correctness of the approach,
the authors have shown that the simulation properly reproduced twelve binding sites
found in the reference X-ray structure of Azoarcus group I intron (PDB ID:1U6B) [1].
However, the experimental crystal structures are not bare of errors [24] and none of
the sites found in the 1U6B structure passed the validation scrutiny, e.g., with Check-
MyMetal tool [151], suggesting that these might be misattributions for monovalent
ions or water [62]. This example further confirms that researchers should not take
experimental data for granted but should properly understand such data and know
their limitations.
Modeling Nucleic Acids at the Residue–Level Resolution 137

7 DNA Nanodevices with a Three Collinear Bead Model

The purpose of this three-bead model designed by Ouldridge and coworkers was to
simulate the dynamics of DNA nanodevices [100, 102, 104]. The interactions in such
DNA nanostructures are based on selective binding of complementary nucleotide
pairs. DNA strands can be designed to form two dimensional lattices [80], poly-
hedra [40, 125] or other regular structures [116]. There are also DNA structures
in which the complementary hydrogen bonds are dynamically formed and broken.
Overall, one can design a set of interacting DNA strands with a particular purpose
in mind. A cycle based on single– to double–stranded DNA and reverse transitions
may be used to create DNA tweezers [150] or DNA walkers that perform a direc-
tional movement on a DNA track [5, 42, 99]. To simulate such devices a CG model
needs to correctly predict the complementary bond breaking and forming events.
To satisfy this crucial requirement Ouldridge and coworkers have chosen a top–
down methodology. In contrast to other models presented in this chapter, which are
designed by mapping the full–atomistic structure on a CG set of positions, this model
was designed in order to fit with the DNA hybridization and thermodynamic data. It
might appear strange that the model ignores such basic measures as different sizes
of DNA grooves. Its efficiency, however, is measured by the correspondence with
hybridization enthalpies and entropies. And as long as there is an agreement between
thermodynamic predictions and the 3D model, the model is considered acceptable
for a particular task it was designed for.
In this FF a nucleotide is modeled as three collinear beads (see Fig. 6). A sin-
gle bead mimics the position of the backbone and two beads represent a base—the
first one is responsible for stacking and the second one is responsible for hydrogen–

Fig. 6 Four base pair part of


a helix in the Ouldridge et al.
model [100]. Large beads
represent the backbone sites.
Small black beads represent
the stacking sites and small
white beads the base
repulsion/hydrogen–bonding
sites. In contrast to other
presented FFs, this model
does not provide a mapping
function that links
full–atomistic and
coarse–grained structures, so
the full–atom structure is not
shown in the background
138 F. Leonarski and J. Trylska

bonding and excluded volume interactions.2 The distances between the backbone
bead and base sites and between two consecutive backbone sites were chosen to be
consistent with the geometry of the B–DNA helix. Since these three beads are always
collinear and their distances are kept constant, based on the number of degrees of
freedom we classify this model as a two bead one. The top–down methodology pre-
cludes direct transformation of a full–atomistic structure into the CG representation.
However, such relationship is unnecessary because the model was not designed to
reproduce the results from more detailed methods. The (re)mapping is not required
for applying this CG model as long as one is interested solely in the dynamics of
DNA hybridization. Here, the fidelity to the 3D structure is rather substituted with an
adherence to the 2D hydrogen bond topology. Such bonding network may be created
by a user or taken from a cadnano program [31], which facilitates the design of DNA
Origami.
Presented CG FF is consistent with a general form presented in (2). The potential
might be used in both Langevin MD and Virtual Move MC simulation methods [146]
(variant of MC simulation by Whitelam et al. to model system dynamics in time).
For efficient simulation in the latter one all interactions have to be pairwise, so the
authors included in the model only interactions between two nucleotides (treated as
rigid bodies),
The intrastrand interactions are modeled using three terms:

Vintrastrand = Vbond + Vstack + Vex , (21)

where the Vbond term, responsible for the interaction of two backbone beads, uses a
finitely extensible nonlinear elastic spring:
 
ε (r − r0 )2
Vbond = − ln 1 − , (22)
2 Δ2

where r0 is the equilibrium distance, Δ defines the range of acceptable deviations


from the equilibrium (for r < r0 − Δ or r > r0 + Δ the potential is infinite ∞) and
ε reflects the value of the potential on the edges (at r = r0 − Δ and r = r0 + Δ) and
controls the steepness of the potential. The stacking term, Vstack , is controlled by the
Morse potential (see Fig. 2b and (10)) and connects the stacking sites of the base. This
term is multiplied by numerous orientation terms that depend on mutual arrangement
of bases (see Fig. 7), e.g., preventing the formation of a left–handed helix (see [100,
104] for full equations). Finally, the excluded volume term Vex is responsible for the
interactions between the base repulsive site and the neighboring base backbone site
and is described by the repulsive part of the Lennard–Jones potential (see Fig. 2a for
r < σ and (8)).
The interstrand potential is composed of two terms:

2 There is an earlier version of the model [104] in a four collinear beads variant, with separate beads

for base repulsion site and base hydrogen–bonding site.


Modeling Nucleic Acids at the Residue–Level Resolution 139

Backbone−base vector
Base normal

δrstack
δrHB

0.74 units
0.80 units
θ5

θ1
θ3

θ2 θ6
θ4

θ8 δrbackbone

θ7
δrbase−back

δ rback−base
δ rbase

Fig. 7 Topology of interactions presented in the Ouldridge et al. model. The upper part presents the
stacking and non–bonded interactions. Middle left, middle right, and bottom left pictures show the
angles that modulate the hydrogen bonding and stacking terms. The bottom right figure shows the
topology of the excluded volume terms (Figure was taken from Ref. [100] and used with permission)

Vinterstrand = VH B + Vcross-stacking , (23)

where VH B is the hydrogen bonding and Vcross-stacking the cross–stacking potential


term. These interactions are calculated between all A and T bases and C and G bases
in the system (no secondary structure is supplied), therefore, cutoffs are applied. The
VHB term accounts for the interactions of hydrogen bonding sites of two complemen-
tary bases and is implemented with the Morse potential (see Fig. 2b and (10) with
orientation terms as in Vstack (see [100, 104] for full equations). The Vcross-stacking
term connects the stacking sites of a base and its complementary counterpart neigh-
140 F. Leonarski and J. Trylska

bor (i.e., i : j + 1 and i : j−1 interactions). It is implemented with a harmonic potential


(see 1a and (4)) multiplied by additional orientation terms [100, 104].
Finally, the non-bonded term, Vnb , is an excluded volume potential, which is
implemented using the repulsive part of the Lennard–Jones potential (see Fig. 2a
and (8) for r < σ ). Vnb describes the interactions of the backbone site with the base
repulsion site, between the base repulsion sites, and between the backbone sites (but
not between the i:i+1 neighbors).
This FF was applied to simulate the dynamics of DNA tweezers [150]—a DNA
system with two arms that can acquire an open or a closed state, like real tweezers.
The transition between the two states is done by adding two short complementary
DNA fragments. These oligomers take part in a sequence of events—hybridization
and strand–breaking, but finally are removed from the system, with the tweezers
state altered. The model of Ouldridge et al. [102] was the first CG model applied to
DNA tweezers. CG Virtual Move MC simulations [146] helped to understand the
free energy changes related to the transition between an open and closed state, caused
among others by unfavorable opening up of a second single–stranded region when the
displacement begins. This CG model was also applied to simulate a DNA walker [5]
in which a short single–stranded DNA fragment moves over a longer strand—a track.
The CG Langevin MD simulations pointed to possible problems in this nanostructure,
e.g., the authors predicted that in some cases a backward movement of the walker
might occur. The CG simulations gave ideas how to avoid this backward movement,
e.g., suggested to apply a mechanical tension to the track. The Ouldridge et al. [102]
model was also used to simulate kissing loops [115] and Holliday junctions [101] –
well known RNA motifs.
The model of Ouldridge et al. [102] is an interesting approach to modeling nucleic
acids—its biggest advantage is a top–down design that sets thermodynamics above
structural fidelity. Although the model seems perfect for nanotechnological applica-
tions, in the current version it cannot be applied to biological problems. The structural
details that are not so important in the nanotechnology field, such as the major and
minor groove sizes (which in this model are of the same size), are fundamental for
DNA—protein interactions. Also, in the case of RNA, it would be problematic that
a starting structure for the CG simulation cannot be supplied from an external file
containing the coordinates of an already folded 3D RNA model and that there is no
treatment of non–canonical base pairs.

8 Thermal Denaturation of DNA with a Two–Bead Model

The two-bead per nucleotide model by Drukker et al. [32, 33] was designed to
describe thermal denaturation of DNA. The model was applied in nanomaterial sci-
ence and used to model DNA translocation in nanopores [110] and carbon nan-
otubes [152]. The CG beads are placed in the geometrical center of a backbone
group (sugar and phosphate) and a base (see Fig. 5b).
Modeling Nucleic Acids at the Residue–Level Resolution 141

This CG FF uses the standard intrastrand potential scheme (4), where the pseudo–
bond Vbond , pseudo–angle Vangle and pseudo–dihedral Vdihedral potentials are harmonic
(see (4), (5), (6) and Fig. 1). In addition, the i:i+2 bonds between the backbone beads
are added in the intrastrand potential to account for stabilization of the backbone
helical conformation since the Vangle and Vdihedral were insufficient.
In the interstrand potential, this model accounts for the chemical details of hydro-
gen bonding. The A–T pair is connected by two and the C–G pair by three bonds.
Each base can serve as a donor and an acceptor of a hydrogen bond. A and T are both
an acceptor and a donor of one bond. G is a donor of two bonds and an acceptor of
one. C is a donor of one and an acceptor of two. To assure that interstrand interactions
are considered only between the correctly oriented beads, a θijHB angle is introduced.
This is an angle between a donor backbone, donor base and acceptor base beads.

Vinterstrand (r, θ HB ) = VMorse (r ) − VH 2 (r ) f (θ HB ) , (24)

1
VH 2 (r ) = (tanh[λ(r − r2 )] − 1) , (25)
4
1
(cos(γ θiHj B ) + 1) θmin < θiHj B < θmax
f (θiHj B ) = 2 . (26)
0 otherwise

There are three parts of the potential: VMorse is a Morse potential (see (10) and Fig. 2b)
that stabilizes a bond between two complementary residues. VH 2 mimics the solvent
effects, which stabilize the denaturated state, and is a switch function (see Fig. 2e
for an example of a switch function), with the λ parameter controlling the steepness
at the switching distance r2 . Function f (θi j ) describes the effect of the θijHB on the
total potential (only if θijHB is in the range θmin – θmax ). The intrastrand potential
can be applied between any two complementary bases so it is not dependent on
the inputted secondary structure. The nonbonded potential is implemented using the
Lennard–Jones potential (see (8) and Fig. 2a).
This FF was used in 75 ns-long MD simulations and correctly predicted the
melting temperatures of 10 base-pair DNA duplexes containing either A–T or C–G
pairs [32]. For the A–T and G–C duplexes, the calculated melting temperature error
was on average 6.5 and 18.5 K, respectively. The model was also shown to give correct
melting temperatures of these duplexes containing single mismatches. Introducing a
single G–G mismatch to the G–C duplex decreased the melting temperature by 21 K.
Such decrease is consistent with the predictions from the thermodynamic models but
any quantitative conclusions cannot be made because the thermodynamic models
give temperature shifts from 12 to 38 K.
This two–bead FF is useful in the simulations in which the complementary bonds
need to be broken such as in DNA melting. With less interaction sites it gives a
higher efficiency than three–bead models [56]. This CG FF does not depend on a
provided secondary structure as in one–bead models [135]. A two–bead model is the
minimal one to be able to introduce base–base orientation terms, as in (24), and this
is a necessary condition to determine the presence of a hydrogen bond.
142 F. Leonarski and J. Trylska

9 One–Bead Model for Linear and Circular DNA

Trovato and Tozzini [135] designed a one–bead model for MD simulations of a linear
and circular DNA duplexes and parameterized it to account also for the temperature
effects. The nucleotide bead is placed in the position of a phosphorus atom (see
Figs. 3 and 5c). This model was also modified for RNA helices using an automatic
parameterization method based on evolutionary algorithm [63, 64].
The sum of standard terms as in (3) forms the intrastrand potential. The pseudo–
bond Vbond , pseudo–angle Vangle , and pseudo–dihedral Vdihedral potentials have the
harmonic functional form (see (4), (5), (6) and Fig. 1).
The interstrand potential is added based on the information about the secondary
structure. This potential has a specific topology (see Fig. 8). For a complementary pair
i:j the following pseudo–bonds are created: i:j, i:j+1, i+1:j+1. The term is composed
of a Morse with a barrier function (for the graph of the potential function see Fig. 2e):

Vinterstrand = Vi: j + Vi: j+1 + Vi+1: j+1 , (27)

Fig. 8 One–bead representation of DNA. The interstrand interaction topology for a single com-
plementary pair is shown according to the model of Trovato and Tozzini [135]
Modeling Nucleic Acids at the Residue–Level Resolution 143

i: j i: j
Vi: j (r ) = V0 ([1 − exp(−αi: j (rkl − r0 ))]2 − ci: j )swi: j (ri: j ) , (28)

1 i: j i: j
swi: j (r ) = V [1 − tanh(λi: j (r − r1 ))] , (29)
2 1
i: j i: j
where V0 , r0 and αi: j control the shape of the original Morse potential, c affects
the energy difference between the energy minimum and unbound state, λi: j controls
i: j
the slope of the switch function, V1 controls the switch function energy difference,
i: j
and r1 the position of the switch. Equations (28) and (29) are identical for i:j+1 and
i+1:j+1. Even though the formula seems complicated it is advantageous; the Morse
function (28) enables accounting for the breaking of hydrogen bonds and the switch
function (29) adds a barrier for long–range electrostatic repulsion.
A similar formula is used for the nonbonded potential:

Vnb (r ) = V0nb ([1 − exp(−αnb (rkl − r0nb ))]2 − cnb )


, (30)
(1 + 2sw2nb (rkl ))sw1nb (rkl ) + 2 Anb sw2nb (rkl ))

1 nb
swqnb (r ) = V [1 − tanh(λqnb (r − rqnb ))] , (31)
2 q
where the Anb parameter controls the addition of a second switch function and thus
affects the slope of the “unbound” site of the barrier. Other labels are consistent with
(28) and (29), but since two switch functions are used in (30), superscripts in (31)
denote the first and the second switch. The authors found that this formula provides
for the stabilization of DNA grooves. Since both interstrand and nonbonded potential
formulas are computationally expensive, the energy (and force) can be precomputed
for a range of distances. Next, their value at a given bead distance, which is between
two precomputed points, is interpolated. This procedure saves a lot of time in contrast
to calculating exponential and hyperbolic tangents in each simulation step for each
pair of beads connected by interstrand or nonbonded interactions.
First, the potential was parameterized based on the potential of mean force, cal-
culated from the experimentally derived 3D structures containing DNA helices. Sec-
ond, the potential was tuned to match the experimental melting temperatures. The
authors validated their CG FF by performing MD simulations of 92 base-pair DNA
nano–circles with different twist angles. The effects of the initial twist angle on the
nano–circle topology were in agreement with full–atomistic simulations [44, 60].
Next, the authors showed the results of CG MD simulations of a DNA plasmid com-
posed of 861 base pairs (approx. 0.3 µ circumference length) on a microsecond time
scale. These MD simulations show that modification of a torsional stress affects the
stability of the plasmid and allows forming a denaturation “bubble” [135].
The potential was further extended by us to RNA molecules [63–65]. For an RNA
helix we have shown that, if thermal melting of helices is not of interest, the potential
performs equally well with a harmonic potential for Vi: j (r ), Vi: j+1 (r ) and Vi+1: j+1 (r ),
while the nonbonded Vnb (r ) can be simply substituted with Coulomb electrostatics.
While such simpler potentials are less precise in describing the physics of RNA, they
144 F. Leonarski and J. Trylska

are more practical—finding robust parameters is just considerably easier. In a later


study on an RNA hairpin [65] we have shown that making nucleotide parameters
dependent on its secondary structure improves the fidelity of an RNA simulation.

10 One–Bead DNA Model Derived with the


Renormalization Group Method

This model of Savalyev et al. [122, 123] is more a parameterization method than a
model to study the dynamics of DNA. The authors present a renormalization group
optimization method developed by Swendsen [132] and further improved by Lyubart-
sev and Laaksonen [72], to find the best parameters of a DNA one–bead FF. For
the renormalization group method, categorized as the local optimization  method, the
potential energy function V has to be a linear combination of terms, V = iN ki ∗ Vi ,
with a set of linear combination parameters ki . In addition, a set of observables S j
that characterize a CG FF has to be defined. These observables must depend on the
selected ki parameters in the potential energy expansion. The aim of the optimization
is to find a set of ki that result in S j which best resemble the reference data. The
observables used by Savalyev et al. were distance distribution, with reference values
taken from full–atomistic simulations. In the parameterization procedure one cre-
ates a set of ki parameters and calculates the “susceptibility” of a certain parameter
to affect the observables. This susceptibility is expressed as a partial derivative of
an S j observable over a ki parameter. Next, these derivatives are used to calculate
the corrections to parameter sets. This method allows for an objective and effective
parameterization, however, it is only applicable to linear combination terms. This
means that if the methodology was applied to a harmonic potential ki (r − r0 )2 , it
could find an optimal value of the ki force constant but not the equilibrium distance
r0 .
To show the applicability of the renormalization group optimization Savalyev et
al. [122] validate it on a one-bead CG FF of DNA. In the model the pseudo–atoms
are placed in the geometrical center of a nucleotide (see Fig. 5d). The FF uses only
pseudo–bond and pseudo–angle terms omitting the pseudo–dihedral term. These
terms are a sum of the harmonic, cubic and quartic terms to include the anharmonicity
of bonds (see Fig. 1a):
Vintrastrand = Vbond + Vangle , (32)

Vbond (r ) = kr 2 (r − r0 )2 + kr 3 (r − r0 )3 + kr 4 (r − r0 )4 , (33)

Vangle (θ ) = kθ2 (θ − θ0 )2 + kθ3 (θ − θi )3 + kθ4 (θ − θ0 )4 . (34)

The interstrand terms are implemented using the so called “fan” interactions.
The name originates from their topology (see Fig. 9) because they explicitly connect
a nucleotide bead with eleven beads on the opposite strand. Fan interactions are
Modeling Nucleic Acids at the Residue–Level Resolution 145

Fig. 9 A DNA helix in a one–bead representation with the beads placed in the geometrical center
of each base. The cartoon representation in the background shows the positions of the phosphate
backbone (ribbon) and bases. The bonds between the beads represent the “fan” interactions, as
defined in the Savelyev et. al [122] model. These interactions connect the nucleotide corresponding
to bead i with eleven nucleotide beads from j−5 to j+5 on the complementary strand

thus i:j−5 to i:j+5 interactions in the previously introduced notation (see Fig. 3).
These interactions are implemented in the same way as Vbond interactions, i.e., as a
combination of harmonic, cubic and quartic terms (see Fig. 1a)

Vfan = (k2 (ri: j+m − r0 )2 + k3 (ri: j+m − r0 )3 + k4 (ri: j+m − r0 )4 ) . (35)
−6<m<6

For both the intra– and interstrand potentials, the CG equilibrium distances and
angles, as well as the starting values of force constants, are found by matching with
the reference distance distributions. The reference distributions are obtained from a
60 ns full–atomistic MD simulations of a 16 base-pair DNA duplex performed with
the AMBER parmbsc0 FF [107]. The CG force constants are optimized based on
the difference of distance distribution from 20 ns CG MD simulations (preceded by
a 5 ns heating and 10 ns equilibration) and from the reference full–atomistic MD
simulation.
In their first paper Savalyev et al. [122] model the nonbonded interactions using
the following equation to match the potential of mean force:

exp (−r/k D ) qi q j exp(−(r − rbead )/k D )


Vel (r ) = A + . (36)
r4 4π ε0 εw r (1 + rbead /k D )

In the next publication [123] of this group, another expression for this potential was
found, which better describes the interaction of DNA beads with ions:

A  3
qi q j
Vel (r ) = 12 + Bk exp(−Ck (r − Rk )) + , (37)
r k=1
4π ε0 εw r

where A and Bk , Ck and Rk are adjustable parameters.


146 F. Leonarski and J. Trylska

According to the authors, the optimized potential results in a radial distribution


function that is consistent with the full–atomistic simulation. Even though the cal-
culated DNA persistence length is overestimated nearly two times, it is in agreement
with the results from full–atomistic model used for parameterization [84]. If the force
constants are rescaled by a factor of 0.7, the DNA persistence length differs by less
than 10% from the experiments. This value is much lower than the one obtained by
Knotts et al. [56] for a three–bead model (error higher than 200% was considered
good enough in this work). However, Knotts et al. modeled a much longer 1489
base-pair DNA and Savalyev et al. tested less than 100 base pairs. Also, Knotts et
al. tested a larger variety of DNA chain lengths and sequences. The difference in
the persistence length in the model of Savalyev et al. is attributed to widening of the
“fan” interactions allowing for larger internal fluctuations. Savalyev et al. estimated
also the effects of NaCl salt concentration on supercoiling of a 90–base pair DNA
plasmid but made only qualitative conclusions.
This work is a success of automatic optimization methods for CG FFs. By perform-
ing a systematic procedure, the authors obtained a reasonable one–bead FF for DNA.
However, in comparison with the other DNA FFs (e.g., Trovato and Tozzini [135]),
the model of Savalyev et al. [122] contains too many potential terms and parameters.
The latter model requires 10 quartic, cubic and harmonic terms per one nucleotide
pair and the model of Trovato and Tozzini [135] requires only three Morse with a
barrier terms. Since the Savalyev et al. [122] algorithm cannot cope with non–linear
parameters, in order to find an efficient DNA model, a perfect strategy would be
to add new terms. However, this would result in a much slower performance of the
method.

11 One–Bead Model for RNA Structure Prediction from


Tertiary Contacts

The Nucleic Acid Simulation Tool (NAST) by Jonikas et al. [53] was designed to
generate candidate tertiary structures of RNAs in order to solve the RNA structure
prediction problem. This is the only FF presented in this chapter that is not intended to
analyze the internal motions of nucleic acids. MD is only used as a means of sampling
the conformational space. The generated trajectories do not serve to understand the
RNA folding process. Only the final 3D RNA structure is of value. Also, NAST was
not designed to perform the tertiary structure prediction from scratch, like the model
by Ding et al. [30]. In advance, one has to provide the RNA secondary structure (from
a secondary structure prediction software [82]) and information about a few tertiary
contacts (from chemical or spectroscopic methods [38, 88, 120, 138]). Even though
NAST imposes these a priori requirements, it is useful. For example, for RNAs which
are difficult to crystalize but were preliminary studied with some chemical methods,
NAST can quickly perform a wide search of possible conformations and generate
multiple candidate structures for further studies. The quality of these structures is
Modeling Nucleic Acids at the Residue–Level Resolution 147

assessed by comparing them with pairwise distance distributions from small angle
X–ray scattering experiments, solvent accessibility data, and the NAST energy tool.
The RNA structures that score best can be later tested with full–atomistic models.
In this model RNA is represented using a single bead, centered on the C3’ atom
of the sugar group (see Figs. 4c and 5e). The total potential energy of RNA is a sum
of four terms:
Vtotal = Vintrastrand + Vinterstrand + Vtertiary + Vnb (38)

where Vintrastrand , Vinterstrand and Vnb are consistent with (2) and Vtertiary is a restraint
for tertiary RNA contacts.
The intrastrand potential is a sum of three terms previously shown in (3) with the
pseudo–bond Vbond and pseudo–angle Vangle potentials using the harmonic functions
(see (4), (5) and Fig. 1a, b) but the Vangle term uses two different force constants, kθ ,
for single– and double–stranded regions. The pseudo–dihedral potential Vdi hedral is
implemented using a cosine potential (see (7) and Fig. 1c).
The interstrand potential is composed in a similar manner as the intrastrand one,
of bonded (Vbond , (4)), angle (Vangle , (5)) and dihedral (Vdihedral , (7)) terms. The
pseudo–bonds connect only complementary base pairs, the pseudo–angle is between
a complementary bond and the next neighbor on the first strand, and pseudo–dihedrals
are the following: j−1:j:i:i+1 and j+1:i−1:i:i+1 (for details of the i, j notation see
Fig. 3).
The nonbonded interactions, Vnb , are implemented using a repulsive–only poten-
tial  σ 12
Vrep (r ) = 4V0 . (39)
r
The user-supplied tertiary interactions (Vtertiary ) are implemented as pseudo–bonds
with the assumption that the nucleotides are close to each other in the final structure
(see (4)). If a particular tertiary contact is uncertain, the authors recommend using a
much smaller force constant for that contact.
NAST is a knowledge–based model. It was parameterized by fitting the presented
potential functions to the Boltzmann inversion of distance distributions from three
high resolution ribosome structures. The authors test 3D structure predictions for
tRNA and a P4–P6 medium–sized RNA with the root mean square deviations from
the crystal structures equal to 8 and 16.3 Å, respectively. Another measure, the GDS–
TS score (the average percentage of residues that are within 1, 2, 4 and 8 Å of their
reference position), was equal to 0.2 and 0.06. These numbers are larger than the ones
obtained by Ding et al. [30], where the root mean square deviation was on average
less than 4 Å. This difference is accounted to the smaller number of details that can
be captured in a one–bead NAST FF in comparison with the three–bead FF of Ding
et al. [30].
NAST relies on the provided secondary structure and tertiary contacts. The out-
putted 3D prediction is the best one that reflects the applied constraints. In a more
complex model of Ding et al. [30], in which the mutual base orientation is present,
the tertiary interactions can be predicted. However, the predictions of the NAST
148 F. Leonarski and J. Trylska

one–bead model are useful because they still give a lower deviation from the crys-
tal structures than a random structure of the same sequence and a similar radius of
gyration. This model could be useful e.g., to rebuild missing loops.
NAST also provides a supplementary tool, C2A, to rebuild a full–atomistic struc-
ture from a CG model [52]. The remapping is performed by finding short fragments
of a matching sequence in the 3D structure of the ribosome and then building and
optimizing the user’s RNA structure. However, remapping is as good as the given CG
model. Two measures were used to asses the quality of remapping—root mean square
deviations and interaction network fidelity (INF) score, which measures the number
of correctly found base pairs and stacking [105]. If a tRNA [131] PDB structure
is reduced to a CG representation and then used as a template for C2A rebuilding
the root mean square deviation of 2.81Å and the INF score of 69% are achieved.
However, if the best NAST-predicted tRNA structure is used, the root mean square
deviation becomes 8.30Å and the INF score drops to 46% (35% if only base pairing
is taken into account in the INF score).

12 One–Bead Model for Large RNAs

Contrary to the previous NAST model, this one was designed for large RNAs and
ribonucleoprotein particles, namely the 30 S ribosomal subunit which contains over
1500 nucleotide long RNA chain (16S RNA). The model is based on previous ones
for supercoiled DNA [134] and ribosomal RNA [22, 78, 79] and can be classified as
either one bead or one and a half bead per residue model. The residues are represented
as beads centered on a P atom (for nucleic acids) or Cα atom (for proteins). In order
to achieve the correct helical conformation, additional space–filling dummy beads
(“X-atoms”) are placed in the geometrical center of complementary base pairs (apart
from the last base pair in the helix, see Figs. 5f and 10).
The intrastrand potential is a sum of standard terms shown in Eq. 3 with Vbond ,
Vangle , and Vdi hedral calculated with harmonic functions ((4), (5), (6), and Fig. 1a,b,c).
The pseudo–bond and pseudo–angle force constants are higher in helical regions than
in non–helical ones.
The interstrand potential is also composed of pseudo–bonded, pseudo–angle and
pseudo–dihedral terms with the same definitions of potentials as in (4), (5) and
(6). Here, the pseudo–bonds connect the nucleotide beads with the corresponding
X-atoms and complementary bases (i:j+1 and i:j−1 configurations). The pseudo–
angle interactions are present in the P–X–P configuration along a complementary
bond. There is also a dihedral angle connecting i−1:i:j:j+1 (see Fig. 10). Proteins are
modeled as simple elastic networks where the harmonic pseudo–bonds are created
between all protein beads that are closer than 8 Å to each other.
The nonbonded potential consists of restraints on the helix–helix and protein–
RNA distances (Vrest ) and a volume exclusion term (Vexcl ) among P-, X- and C-atoms

Vnb = Vrest + Vexcl , (40)


Modeling Nucleic Acids at the Residue–Level Resolution 149

Fig. 10 Topology of interstrand bonds in the CG model of Malhotra et al. [22, 78, 79] showing the
placement of beads on the phosphorus P atoms. The central nucleotide pair is represented with two
pseudo–atoms: in the position of the P atom and the dummy X atom in the middle. Two neighboring
base pairs are presented only using P atoms. There are 6 pseudo–bonds associated with this i:j pair
(two P–X in the middle and four P–P bonds) and a pseudo–angle P–X–P



⎪ k2 (r − r2 )2 r < r2

⎨ 0 r2 < r < r3
Vrestr (r ) = k (r − r ) 2
3  r3 < r < r4 , (41)


3

⎩k 3 b + 1
r −r3
r > r4

Vexcl (r ) = K excl (r − d0 )2 r < d0 . (42)

For graphical representation of distances r2 , r3 , r4 see Fig. 2f. a = 3(r4 − r3 )2 and


b = −2(r4 − r3 )3 . k3 is a constant describing the steepness of the restraint potential
for r > r3 . The Vrestr restraints, presented in (41), are applied to all P–Cα pairs
that lie within 10 Å in the reference structure. For helices, these restraints are also
applied to non–canonical base pairs (i.e. nucleotides hydrogen bonded with other than
Watson–Crick type bonding). However, the ones explicitly enumerated by Wimberly
et al. [147], in the paper describing the crystal structure of Thermus thermophilus 30S
ribosomal subunit, are considered in the interstrand potential on the same basis as
the Watson–Crick ones. Others, i. e., all P–P pairs within a 6 Å cut–off distance that
are not already connected, are included in the restraints term. This term gives some
150 F. Leonarski and J. Trylska

freedom of movement between the r2 and r3 distances (which are independently set
for each type of atom pairs), however, the movement is penalized if going outside
of this range (see Fig. 2f). Therefore, this restraint term generates a bias toward a
starting structure. The space exclusion term, Vexcl , prohibits two nucleotide beads
from getting closer to each other than d0 .
This is also a knowledge–based potential. The crucial parameters for the model,
i.e. the parameters of protein–RNA distance restraints are taken from the high reso-
lution Thermus thermophilus 30S ribosome subunit structure. Other parameters are
taken from the lower resolution ribosome models and/or older models [78]. The force
constants k2 and k3 are optimized to maintain the crystal structure of the 30 S subunit
at room temperature, while allowing for flexibility of the free 16 S ribosomal RNA.
This model was designed and applied to study the assembly of proteins to 16S
RNA of the small ribosomal subunit. Stagg et al. [130] explored one of the assembly
paths using the MC simulated annealing technique. The starting model of 16S RNA
contained only the information on its secondary structure. The restraints of (41)
guided the ribosomal proteins from the initial random positions to their appropriate
binding sites on 16S RNA. The authors examined the changes in the fluctuations of
16S RNA upon binding of proteins and predicted the contributions of each protein to
the organization of its binding site. Cui et al. [22] also used this model to investigate
the assembly of ribosomal proteins but applied MD simulations and additionally
studied the flexibility of 16S RNA during adding the proteins at various orders. The
experimental assembly paths were reproduced even with such a simple CG model.

13 One–Bead Model for Protein-RNA Complexes

This model was developed to perform MD simulations of macromolecular complexes


of proteins and RNA on microsecond time scales. In the original publication, it was
applied to investigate the flexibility of the whole ribosome [136]. In this model a
single bead represents a nucleotide (centered on a phosphorus atom, see Fig. 5c) or
an amino acid (centered on a Cα atom).
The residues of the backbone are connected with the intrastrand harmonic potential
which is a sum of the pseudo–bond (4), pseudo–angle (5) and pseudo–dihedral (6)
terms as in Eq. 3. The classical Morse potential was also tested for the intrastrand
terms but since these terms connect the residues that are no more than four CG beads
apart the authors found that harmonic functions are sufficient.
The interstrand Vinterstrand energy term is based on an externally provided sec-
ondary structure for RNA and uses a harmonic function (see (4) and Fig. 1a). This
potential accounts for the canonical hydrogen bonds that appear in the RNA motifs.
The nonbonded potential is implemented using Morse functions and its general
form is:
ij ij
Vnb (ri j ) = A P,Cα (r0 )[1 − exp(−α(ri j − r0 ))]2 . (43)
Modeling Nucleic Acids at the Residue–Level Resolution 151

ij ij
The strength of this potential is adjusted by the A P,Cα (r0 ) = a exp(−r0 /b) function.
The constants a and b are based on the interacting bead types (different for P and
Cα ). For local short-range interactions (within a predefined cut–off of 12 Å for Cα
ij
and 20 Å for P pairs), the r0 equilibrium values are taken from the starting structure.
For all the other long-range nonbonded interactions beyond the short-range cut–off
ij
(but within a certain limit), r0 assumes three different values for P–P, Cα –Cα and
Cα –P pairs and does not depend on the starting conformation. Therefore, the model is
only locally biased toward the starting structure even though breaking of short-range
nonbonded contacts is also possible.
Overall, the model is an extension of an elastic network model but since the
nonbonded interactions are represented with the Morse potential it allows for larger
fluctuations from the initial conformation than the harmonic potential. The model
was parameterized based on the Boltzmann inversion procedure with the distribution
functions taken from a single ribosome structure so it is not immediately transferable
to other systems. This CG FF was used to perform half a microsecond MD simulations
of the ribosome and determine global collective motions of the ribosome fragments,
as well as their correlations. The movement of the distant ribosomal stalks, positioned
at the opposite sides of the tRNA path, appeared to be coupled with the ratchet-like
motion of the subunits.

14 One–Bead Model for Protein-DNA Complexes

Later a similar anharmonic elastic network methodology was applied in MD simu-


lations of the nucleosome [143]. The nucleosome is a basic unit of chromatin and is
composed of double-stranded DNA wrapped around histone proteins.
The interstrand and nonbonded functional terms are similar as in the model of
Trylska et al. [136]. However, in order to account for the helicity of the histone
proteins and DNA, the nucleosome model required slightly different formulation of
the intrastrand potential:

Vintrastrand = V1−2 + V1−3 + V1−4 + V1−5 , (44)

where V1−n terms are implemented using a harmonic potential (see (4) and Fig. 1a).
For the α–helical regions of the proteins, all terms in (44) are included. However,
in unstructured regions or loops only V1−2 and V1−3 are included, whereas V1−4 and
V1−5 are modeled as nonbonded interactions. For DNA beads, V1−5 is not required.
The model was parameterized with the Boltzmann inversion procedure based
on short 50 ns full-atomistic MD simulations of the nucleosome [143]. Next, it was
applied to perform multiple 10 microsecond scale MD simulations of the nucleosome
complex [142]. In these simulations a biologically relevant partial unwrapping of the
DNA from the nucleosome core was observed. Further remapping to all-atom model
provided a better insight into the interactions that are formed by histone tails after
the DNA detachment from the nucleosome core. One of the histone tails (H3) was
152 F. Leonarski and J. Trylska

seen to stabilize the nucleosome in the open state by interacting with the nucleosome
core. The removal of this H3 tail in the simulations precluded the formation of such
a long-lived detachment of the DNA terminal segment from the nucleosome protein
core. This suggests an active role of this tail not only in the detachment of the DNA
end from the nucleosome core but also in preventing the nucleosomal DNA from
rewrapping.

15 Conclusions

Residue resolution FFs may be applied to solve various kinds of problems in the
nucleic acid field, ranging from RNA structure prediction to global motions of large
ribonucleoprotein complexes. We have described a limited set of CG FFs, with the
number of beads ranging from one to three per nucleotide. Even in this bead range the
design and applicability of the FFs differ. In one bead models the interaction network
is based on an externally supplied secondary structure or native contacts from a
reference structure. Adding a second bead allows for the secondary structure to be
dynamically modified because the orientation of an interstrand bond with regards to
the backbone can be measured. Overall, increasing the number of beads corresponds
to removing the bias from the system. On the other hand, if one accepts the limitations
of one–bead models, problems on much larger spatial and temporal scales may be
investigated. For example, the Jonikas et al. [53] one-bead model was easily applied
to a 158–nucleotide structure but the three-bead Ding et al. [30] model only to RNA
chains shorter than 100 nucleotides. Also, the CG FFs used for large macromolecular
complexes, such as the nucleosome or ribosome, are one–bead FFs.
There are two other crucial things to consider when choosing one- to three-bead
models. First, with one bead models it is problematic to achieve a correct helical
twist. Creating bonds only between complementary pairs, which is easily applied in
two- or three-bead models, is not sufficient to keep the helicity in one-bead models.
The remedy is to create dummy atoms in the middle of a helix (as in the model
of Cui et al. [22]), provide multiple pseudo–bonds per single complementary pair
[122, 135] or use multi–body terms – angle and dihedral over the interstrand bonds
[53]. Such tricks were not required, in the model of Trylska et al. [136] because to
stabilize the helical structure the equilibrium distances were taken from the native
structure. Adding the terms that ensure the correct helicity may give reasonable
dynamics but requires higher computational time. Second thing to consider is that
neither of one-bead models applies interaction terms that are nucleotide-specific.3
Even if such interactions were implemented, they would be inefficient since there
is no information about the relative orientation of bases. The two- and three-bead
models easily incorporate the base specificity.

3 Someof one–bead models, e.g., Trovato et al. [135], assign a mass consistent with the base type
in MD simulations but it has a limited effect on the interactions.
Modeling Nucleic Acids at the Residue–Level Resolution 153

There are also residue-resolution nucleic acid models with more than three beads
per nucleotide, so one may ask if it is worth going beyond the FFs presented in this
chapter. The four- or more bead per nucleotide models include more details such
as base dipole moments [75] or non–canonical hydrogen bonding schemes [106].
Niewieczerzał et al. [94] compared three CG models with different number of beads
per nucleotide: two, three [56], and four/five (depending on the nucleotide type). All
three models were applied to a problem of mechanical stretching and twisting of the
DNA duplexes. The authors showed that the number of beads does not affect the
mechanical properties of DNA at low and moderate temperatures, but may become
an issue at room temperature.
When comparing the three-bead CG models we also have to consider their appli-
cability to other tasks than the ones they were designed for. Typically, their target is
narrow and CG FFs are not transferable to other problems or systems. For example,
the Hyeon et al. [48] potential was created to answer a specific question about a par-
ticular RNA hairpin. A desirable CG FF would be the one that could be easily applied
to different sets of problems, i.e. a FF with a clear parameterization procedure and
universal formulation of the potential energy function. A good example is the model
of Knotts et al. [56] since this model can be easily implemented and modified. This
task would be more difficult with the model of Ding et al. [30]. Despite promising
results for the RNA structure prediction, its applicability and possibilities for mod-
ifications are limited because its formulation using a non–standard engine, discrete
MD, makes this potential much harder to re-implement. There are multiple codes
available to provide classical MD or MC procedures, and to use the model of Ding
et al. [30] one would have to rely on the authors’ or own in-house made code. The
model of Hyeon et al. [48] was tuned for a particular molecule, however, the authors
show the parameterization so it should be possible to re-implement the model for a
different task. Another good example of an extendable model is the one designed
by Trylska et al. [136]. It was originally created and parameterized for a particular
complex—the ribosome. However, there are other studies that applied this model for
a large system involving long chains of DNA, not RNA—the nucleosome [142, 143].
The model is also implemented in a freely available software RedMD [41] (http://
bionano.cent.uw.edu.pl/Software).
The transferability of the present CG models is insufficient and new models will
certainly be needed for particular applications. However, future efforts have to be
also put to solve methodological problems. Just to mention two of such problems:
the definition of the reference state in the Boltzmann inversion procedure and gener-
alization of simulation results obtained for isolated, small systems to larger volumes.
In the first problem we go back to (1), where a function d0 (r ) has been introduced as
a reference state. The FF parameters depend on this function and its choice is often
arbitrary. The second problem, mentioned by Ouldridge et al. [100, 103], refers to
the fact that typically CG simulations are performed in small volumes with only a
single set of interacting molecules. The process of single DNA duplex formation
may give different melting temperatures than when using many duplexes in a larger
volume. The solutions to extrapolate the results of a small-size simulation to a larger
one have been proposed [100, 103].
154 F. Leonarski and J. Trylska

There is still room for improvement in the field of low–resolution nucleic acid
models. For example, creating an unbiased CG model of the ribosome is still an
open problem and it would provide better insight into the mechanics of this system
in comparison with the model based on the concept of native contacts. There is
also a need to create more formal protocols for the parameterization of CG FFs
and assessment of the quality of parameters. Unfortunately, most authors are vague
about the parameterization details. In some parameterizations there is no account
of how well the chosen potential was fitted to experimental data (by means of for
example the R2 regression parameter). The correctness of the model is proven only
by simulations of selected test cases but more details on the parameterization would
give better confidence in these models. Another issue is that most authors do not give
hard evidence why a certain potential energy functional term was used. Test cases
that would justify the use of a particular potential form would be of great value.
A good remedy for the parameterization problems might be the use of automated
procedures to derive the parameters, like the one mentioned by Savelyev et al. [122]
using renormalization group approach or developed by us [63–65] implementing the
evolutionary algorithm and particle swarm optimization.

Acknowledgements The authors acknowledge support from the Interdisciplinary Centre for
Mathematical and Computational Modelling, University of Warsaw (G31-4, GA65-16, GA65-
17, GB65-28 to JT), National Science Centre, Poland (2011/03/N/NZ2/02482 to FL, DEC-
2014/12/W/ST5/00589 Symfonia to JT, 2016/23/B/NZ1/03198 Opus to JT).

References

1. Adams, P.L., Stahley, M.R., Kosek, A.B., Wang, J., Strobel, S.A.: Crystal structure of a self-
splicing group I intron with both exons. Nature 430, 45–50 (2004)
2. Al-Hashimi, H.M., Walter, N.G.: RNA dynamics: it is about time. Curr. Opin. Struct. Biol. 18,
321–329 (2008)
3. Allison, S.A., McCammon, J.A.: Multistep Brownian dynamics: application to short wormlike
chains. Biopolymers 23, 363–375 (1984)
4. Arya, G., Zhang, Q., Schlick, T.: Flexible histone tails in a new mesoscopic oligonucleosome
model. Biophys. J. 91, 133–150 (2006)
5. Bath, J., Green, S.J., Allen, K.E., Turberfield, A.J.: Mechanism for a directional, processive,
and reversible DNA motor. Small 5, 1513–1516 (2009)
6. Berg, J.M., Tymoczko, J.L., Stryer, L.: Biochemistry, 7th edn. Freeman, W. H (2010)
7. Berman, H.M., Olson, W.K., Beveridge, D.L., Westbrook, J., Gelbin, A., Demeny, T., Hsieh,
S.H., Srinivasan, A.R., Schneider, B.: The nucleic acid database. A comprehensive relational
database of three-dimensional structures of nucleic acids. Biophys. J. 63, 751–759 (1992)
8. Bernstein, F.C., Koetzle, T.F., Williams, G.J., Meyer, E.F.: J., Brice, M.D., Rodgers, J.R.,
Kennard, O., Shimanouchi, T., Tasumi, M.: The protein data bank: A computer-based archival
file for macromolecular structures. Arch. Biochem. Biophys. 185, 584–591 (1978)
9. Biyun, S., Cho, S.S., Thirumalai, D.: Folding of human telomerase RNA pseudoknot using
ion-jump and temperature-quench simulations. J. Am. Chem. Soc. 133, 20634–20643 (2011)
10. Bloomfield, V.A., Crothers, D.M., Tinoco, I.J.: Nucleic acids : structures, properties and func-
tions, 1st edn. University Science Books (2000)
Modeling Nucleic Acids at the Residue–Level Resolution 155

11. Boniecki, M.J., Lach, G., Dawson, W.K., Tomala, K., Lukasz, P., Soltysinski, T., Rother, K.M.,
Bujnicki, J.M.: SimRNA: a coarse-grained method for RNA folding simulations and 3D struc-
ture prediction. Nucleic Acids Res. 44, e63 (2016)
12. Brion, P., Westhof, E.: Hierarchy and dynamics of RNA folding. Annu. Rev. Biophys. Biomol.
Struct. 26, 113–137 (1997)
13. Brooks, B.R., Brooks III, C., MacKerell Jr., A., Nilsson, L., Petrella, R., Roux, B., Won, Y.,
Archontis, G., Bartels, C., Boresch, S., Caflisch, A., Caves, L., Cui, Q., Dinner, A., Feig, M.,
Fischer, S., Gao, J., Hodoscek, M., Im, W., Kuczera, K., Lazaridis, T., Ma, J., Ovchinnikov, V.,
Paci, E., Pastor, R., Post, C., Pu, J., Schaefer, M., Tidor, B., Venable, R.M., Woodcock, H.L.,
Wu, X., Yang, W., York, D., Karplus, M.: CHARMM: the biomolecular simulation program.
J. Comput. Chem. 30, 1545–1614 (2009)
14. Bruant, N., Flatters, D., Lavery, R., Genest, D.: From atomic to mesoscopic descriptions of the
internal dynamics of DNA. Biophys. J. 77, 2366–2376 (1999)
15. Capriotti, E., Renom, M.M.: Quantifying the relationship between sequence and three-
dimensional structure conservation in RNA. BMC Bioinformatics 11, 322 (2010)
16. Case, D.A., Cheatham, T.E., Darden, T., Gohlke, H., Luo, R., Merz, K.M., Onufriev, A., Sim-
merling, C., Wang, B., Woods, R.J.: The Amber biomolecular simulation programs. J. Comput.
Chem. 26, 1668–1688 (2005)
17. Cheatham, T.E., Young, M.A.: Molecular dynamics simulation of nucleic acids: successes,
limitations, and promise. Biopolymers 56, 232–256 (2000)
18. Chen, Y., Ding, F., Nie, H., Serohijos, A.W.: S., S., Wilcox, K., Yin, S., Dokholyan, N.V.:
Protein folding: then and now. Arch. Biochem. Biophys. 469, 4–19 (2008)
19. Cho, S.S., Pincus, D.L., Thirumalai, D.: Assembly mechanisms of RNA pseudoknots are deter-
mined by the stabilities of constituent secondary structures. Proc. Natl. Acad. Sci. USA 106,
17349–17354 (2009)
20. Choi, C.H., Kalosakas, G., Rasmussen, K.O., Hiromura, M., Bishop, A.R., Usheva, A.: DNA
dynamically directs its own transcription initiation. Nucleic Acids Res. 32, 1584–90 (2004)
21. Cieplak, M., Sułkowska, J.I.: Structure-based models of biomolecules: stretching of proteins,
dynamics of knots, hydrodynamic effects, and indentation of virus capsids. In: A. Koliński (ed.)
Multiscale approaches to protein modeling: structure prediction, dynamics, thermodynamics
and macromolecular assemblies., chap. 8, pp. 179–208. Springer (2010)
22. Cui, Q., Tan, R.K.Z., Harvey, S.C., Case, D.A.: Low-Resolution Molecular Dynamics Simu-
lations of the 30S Ribosomal Subunit. Multiscale Model. Simul. 5, 1248–1263 (2006)
23. Dans, P.D., Zeida, A., Machado, M.R., Pantano, S.: A Coarse Grained Model for Atomic-
Detailed DNA Simulations with Explicit Electrostatics. J. Chem. Theory Comp. 6, 1711–1725
(2010)
24. Dauter, Z., Wlodawer, A., Minor, W., Jaskolski, M., Rupp, B.: Avoidable errors in deposited
macromolecular structures. IUCrJ 1, 179–193 (2014)
25. DeMille, R.C., Cheatham, T.E., Molinero, V.: A coarse-grained model of DNA with explicit
solvation by water and ions. J. Phys. Chem. B 115, 132–142 (2011)
26. DeMille, R.C., Molinero, V.: Coarse-grained ions without charges: reproducing the solvation
structure of NaCl in water using short-ranged potentials. J. Chem. Phys. 131, 034,107 (2009)
27. Denesyuk, N., Thirumalai, D.: Coarse-grained model for predicting rna folding thermodynam-
ics. J. Phys. Chem. B 117, 4901–4911 (2013)
28. Denesyuk, N., Thirumalai, D.: How do metal ions direct ribozyme folding? Nat. Chem. 7,
793–801 (2015)
29. Ding, D., Dokholyan, N.V.: Simple but predictive protein models. Trends Biotechnol. 23, 450–
455 (2005)
30. Ding, F., Sharma, S., Chalasani, P., Demidov, V.V., Broude, N.E., Dokholyan, N.V.: Ab initio
RNA folding by discrete molecular dynamics: from structure prediction to folding mechanisms.
RNA 14, 1164–1173 (2008)
31. Douglas, S.M., Marblestone, A.H., Teerapittayanon, S., Vazquez, A., Church, G.M., Shih,
W.M.: Rapid prototyping of 3D DNA-origami shapes with caDNAno. Nucleic Acids Res. 37,
5001–5006 (2009)
156 F. Leonarski and J. Trylska

32. Drukker, K., Schatz, G.C.: A Model for Simulating Dynamics of DNA Denaturation. J. Phys.
Chem. B 104, 6108–6111 (2000)
33. Drukker, K., Wu, G., Schatz, G.C.: Model simulations of DNA denaturation dynamics. J. Chem.
Phys. 114, 579 (2001)
34. Flicek, P., et al.: Ensembl 2011. Nucleic Acids Res. 39, D800–6 (2011)
35. Forrey, C., Muthukumar, M.: Langevin dynamics simulations of genome packing in bacterio-
phage. Biophys. J. 91, 25–41 (2006)
36. Freddolino, P.L., Liu, F., Gruebele, M., Schulten, K.: Ten-microsecond molecular dynamics
simulation of a fast-folding WW domain. Biophys. J. 94, L75–7 (2008)
37. Freeman, G.S., Hinckley, D.M., De Pablo, J.J.: A coarse-grain three-site-per-nucleotide model
for DNA with explicit ions. J. Chem. Phys. 135, 165,104 (2011)
38. Galas, D.J., Schmitz, A.: DNAse footprinting: a simple method for the detection of protein-
DNA binding specificity. Nucleic Acids Res. 5, 3157–3170 (1978)
39. Go, N.: Theoretical studies of protein folding. Annu. Rev. Biophys. Bioeng. 12, 183–210 (1983)
40. Goodman, R.P., Schaap, I.A.T., Tardin, C.F., Erben, C.M., Berry, R.M., Schmidt, C.F., Turber-
field, A.J.: Rapid chiral assembly of rigid DNA building blocks for molecular nanofabrication.
Science 310, 1661–1665 (2005)
41. Górecki, A., Szypowski, M., Długosz, M., Trylska, J.: RedMD – Reduced Molecular Dynamics
Package. J. Comput. Chem. 30, 2364–2373 (2009)
42. Green, S.J., Bath, J., Turberfield, A.J.: Coordinated chemomechanical cycles: A mechanism
for autonomous molecular motion. Phys. Rev. Lett. 101, 238,101 (2008)
43. Guvench, O., Brooks, C.L.: Efficient approximate all-atom solvent accessible surface area
method parameterized for folded and denatured protein conformations. J. Comput. Chem. 25,
1005–1014 (2004)
44. Harris, S.A., Laughton, C.A., Liverpool, T.B.: Mapping the phase diagram of the writhe of
DNA nanocircles using atomistic molecular dynamics simulations. Nucleic Acids Res. 36,
21–29 (2008)
45. He, Y., Maciejczyk, M., Oldziej, S., Scheraga, H.A., Liwo, A.: Mean-field interactions between
nucleic-acid-base dipoles can drive the formation of the double helix. Phys. Rev. Lett. 110,
098,101 (2013)
46. Hoang, T.X., Cieplak, M.: Molecular dynamics of folding of secondary structures in Go-type
models of proteins. J. Chem. Phys. 112, 6851 (2000)
47. Hülsmann, M., Köddermann, T., Vrabec, J., Reith, D.: GROW: A gradient-based optimization
workflow for the automated development of molecular models. Comput. Phys. Commun. 181,
499–513 (2010)
48. Hyeon, C., Thirumalai, D.: Mechanical unfolding of RNA hairpins. Proc. Natl. Acad. Sci. USA
102, 6789–6794 (2005)
49. Hyeon, C., Thirumalai, D.: Capturing the essence of folding and functions of biomolecules
using coarse-grained models. Nat. Comm. 2, 487 (2011)
50. International Human Genome Sequencing Consortium: Initial sequencing and analysis of the
human genome. Nature 409, 860–921 (2001)
51. Jian, H., Schlick, T., Vologodskii, A.: Internal motion of supercoiled DNA: brownian dynamics
simulations of site juxtaposition. J. Mol. Biol. 284, 287–296 (1998)
52. Jonikas, M.A., Radmer, R.J., Altman, R.B.: Knowledge-based instantiation of full atomic detail
into coarse-grain RNA 3D structural models. Bioinformatics 25, 3259–3266 (2009)
53. Jonikas, M.A., Radmer, R.J., Laederach, A., Das, R., Pearlman, S., Herschlag, D., Altman,
R.B.: Coarse-grained modeling of large RNA molecules with knowledge-based potentials and
structural filters. RNA 15, 189–199 (2009)
54. Kibbe, W.A.: OligoCalc: an online oligonucleotide properties calculator. Nucleic Acids Res.
35, W43–W46 (2007)
55. Klimov, D.K., Thirumalai, D.: Native topology determines force-induced unfolding pathways
in globular proteins. Proc. Natl. Acad. Sci. USA 97, 7254–7259 (2000)
56. Knotts, T.A., Rathore, N., Schwartz, D.C., De Pablo, J.J.: A coarse grain model for DNA. J.
Chem. Phys. 126, 084,901 (2007)
Modeling Nucleic Acids at the Residue–Level Resolution 157

57. Koliński, A., Skolnick, J.: Monte Carlo simulations of protein folding. I. Lattice model and
interaction scheme. Proteins 18, 338–352 (1994)
58. Kolk, M.H., Heus, H.A., Hilbers, C.W.: The structure of the isolated, central hairpin of the HDV
antigenomic ribozyme: novel structural features and similarity of the loop in the ribozyme and
free in solution. EMBO J. 16, 3685–92 (1997)
59. Kumar, S.: D, B., Swendsen, R.H., Kollman, P.A., Rosenberg, J.M.: The weighted histogram
analysis method for free-energy calculations on biomolecules. I. the method. J. Comput. Chem.
13, 1011–1021 (1992)
60. Lankas, F., Lavery, R., Maddocks, J.H.: Kinking occurs during molecular dynamics simulations
of small DNA minicircles. Structure 14, 1527–1534 (2006)
61. Leach, A.: Molecular Modelling: Principles and Applications (2nd Edition). Prentice Hall
(2001)
62. Leonarski, F., D’Ascenzo, L., Auffinger, P.: Mg2+ ions: do they bind to nucleobase nitrogens?
Nucleic Acids Res. 45, 987–1004 (2017)
63. Leonarski, F., Trovato, F., Tozzini, V., Leś, A., Trylska, J.: Evolutionary algorithm in the
optimization of a coarse-grained force field. J. Chem. Theory Comput. 9, 4874–4889 (2013)
64. Leonarski, F., Trovato, F., Tozzini, V., Trylska, J.: Genetic algorithm optimization of force field
parameters: application to a coarse-grained model of RNA. In: Proceedings of the 9th European
conference on Evolutionary computation, machine learning and data mining in bioinformatics,
EvoBIO’11, pp. 147–152. Springer-Verlag, Berlin, Heidelberg (2011)
65. Leonarski, F., Trylska, J.: RedMDStream: Parameterization and simulation toolbox for coarse-
grained molecular dynamics models. Biophys. J. 108, 1843–1847 (2015)
66. Leontis, N.B., Westhof, E.: Analysis of RNA motifs. Curr. Opin. Struct. Biol. 13, 300–308
(2003)
67. Liphardt, J., Dumont, S., Smith, S.B., Tinoco, I., Bustamante, C.: Equilibrium information
from nonequilibrium measurements in an experimental test of Jarzynski’s equality. Science
296, 1832–1835 (2002)
68. Liphardt, J., Onoa, B., Smith, S.B., Tinoco, I., Bustamante, C.: Reversible unfolding of single
RNA molecules by mechanical force. Science 292, 733–737 (2001)
69. Liwo, A., Czaplewski, C., Oldziej, S., Rojas, A., Kazmierkiewicz, R., Makowski, M., Murarka,
R., Scheraga, H.: Simulation of protein structure and dynamics with the coarse-grained unres
force field. In: G. Voth (ed.) Coarse-Graining of Condensed Phase and Biomolecular Systems.,
chap. 8, pp. 107–122. Taylor & Francis (2008)
70. Liwo, A., He, Y., Scheraga, H.A.: Coarse-grained force field: general folding theory. Phys.
Chem. Chem. Phys. 13(16), 890–901 (2011)
71. Lu, Z.J., Turner, D.H., Mathews, D.H.: A set of nearest neighbor parameters for predicting
the enthalpy change of rna secondary structure formation. Nucleic Acids Res. 34, 4912–4924
(2006)
72. Lyubartsev, A.P., Laaksonen, A.: Calculation of effective interaction potentials from radial
distribution functions: A reverse Monte Carlo approach. Phys. Rev. E 52, 3730–3737 (1995)
73. Ma, J.: Usefulness and limitations of normal mode analysis in modeling dynamics of biomolec-
ular complexes. Structure 13, 373–380 (2005)
74. Maciejczyk, M., Rudnicki, W.R., Lesyng, B.: A mezoscopic model of nucleic acids. Part 2. An
effective potential energy function for DNA. J. Biomol. Struct. Dyn. 17, 1109–1115 (2000)
75. Maciejczyk, M., Spasic, A., Liwo, A., Scheraga, H.A.: Coarse-grained model of nucleic acid
bases. J. Comp. Chem. 31, 1644–1655 (2010)
76. Maciejczyk, M., Spasic, A., Liwo, A., Scheraga, H.A.: DNA duplex formation with a coarse-
grained model. J. Chem. Theory Comput. 10, 5020–5035 (2014)
77. MacKerell, A.D., Banavali, N., Foloppe, N.: Development and current status of the CHARMM
force field for nucleic acids. Biopolymers 56, 257–265 (2000)
78. Malhotra, A., Harvey, S.C.: A quantitative model of the Escherichia coli 16 S RNA in the 30
S ribosomal subunit. J. Mol. Biol. 240, 308–340 (1994)
79. Malhotra, A., Tan, R.K., Harvey, S.C.: Modeling large RNAs and ribonucleoprotein particles
using molecular mechanics techniques. Biophys. J. 66, 1777–1795 (1994)
158 F. Leonarski and J. Trylska

80. Malo, J., Mitchell, J.C., Venien-Bryan, C., Harris, J.R., Wille, H., Sherratt, D.J., Turberfield,
A.J.: Engineering a 2D protein DNA crystal. Angew. Chem. Int. Ed. 44, 3057–3061 (2005)
81. Mathews, D.H., Sabina, J., Zuker, M., Turner, D.H.: Expanded sequence dependence of ther-
modynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 288,
911–940 (1999)
82. Mathews, D.H., Turner, D.H.: Prediction of RNA secondary structure by free energy mini-
mization. Curr. Opin. Struct. Biol. 16, 270–278 (2006)
83. Mattick, J.S., Makunin, I.V.: Non-coding RNA. Human Mol. Gen. 15 Spec No, R17–29 (2006)
84. Mazur, A.K.: Evaluation of elastic properties of atomistic DNA models. Biophys. J. 91, 4507–
4518 (2006)
85. McCammon, J.A., Gelin, B.R., Karplus, M.: Dynamics of folded proteins. Nature 267, 585–590
(1977)
86. Mergell, B., Ejtehadi, M.R., Everaers, R.: Modeling DNA structure, elasticity, and deformations
at the base-pair level. Phys Rev E Stat Nonlin Soft Matter Phys 68, 15 (2003)
87. Mergny, J.L., Lacroix, L.: Analysis of thermal melting curves. Oligonucleotides 13, 515–537
(2003)
88. Merino, E.J., Wilkinson, K.A., Coughlan, J.L., Weeks, K.M.: RNA structure analysis at single
nucleotide resolution by selective 2’-hydroxyl acylation and primer extension (SHAPE). J.
Am. Chem. Soc. 127, 4223–4231 (2005)
89. Miao, Z., Adamiak, R.W., Antczak, M., Batey, R.T., Becka, A.J., Biesiada, M., Boniecki, M.J.,
Bujnicki, J.M., Chen, S.J., Cheng, C.Y., Chou, F.C., Ferre-D’Amare, A.R., Das, R., Dawson,
W.K., Ding, F., Dokholyan, N.V., Dunin-Horkawicz, S., Geniesse, C., Kappel, K., Kladwang,
W., Krokhotin, A., Lach, G.E., Major, F., Mann, T.H., Magnus, M., Pachulska-Wieczorek,
K., Patel, D.J., Piccirilli, J.A., Popenda, M., Purzycka, K.J., Ren, A., Rice, G.M., Santalucia,
J., Sarzynska, J., Szachniuk, M., Tandon, A., Trausch, J.J., Tian, S., Wang, J., Weeks, K.M.,
Williams, B., Xiao, Y., Xu, X., Zhang, D., Zok, T., Westhof, E.: RNA-Puzzles round III: 3D
RNA structure prediction of five riboswitches and one ribozyme. RNA 23, 655–672 (2017)
90. Mizushima, T., Kataoka, K., Ogata, Y.: Inoue, R.i., Sekimizu, K.: Increase in negative super-
coiling of plasmid DNA in Escherichia coli exposed to cold shock. Mol. Microbiol. 23, 381–386
(1997)
91. Mizushima, T., Natori, S., Sekimizu, K.: Relaxation of supercoiled DNA associated with induc-
tion of heat shock proteins in Escherichia coli. Mol. Gen. Genet. 238, 1–5 (1993)
92. Morriss-Andrews, A., Rottler, J., Plotkin, S.S.: A systematically coarse-grained model for DNA
and its predictions for persistence length, stacking, twist, and chirality. J. Chem. Phys. 132, 30
(2010)
93. Narberhaus, F., Waldminghaus, T., Chowdhury, S.: RNA thermometers. FEMS Microbiol. Rev.
30, 3–16 (2006)
94. Niewieczerzał, S., Cieplak, M.: Stretching and twisting of the DNA duplexes in coarse-grained
dynamical models. J. Phys. Condens. Matter 21, 474,221 (2009)
95. Olson, W.K.: Configurational statistics of polynucleotide chains. a single virtual bond treatment.
Macromolecules 8, 272–275 (1975)
96. Olson, W.K.: Flexible dna double helix.1. average dimensions and distribution functions.
Biopolymers 18, 1213–1233 (1979)
97. Olson, W.K., Manning, G.S.: A configurational interpretation of the axial phosphate spacing
in polynucleotide helices and random coils. Biopolymers 15, 859–878 (1976)
98. Olson, W.K., Zhurkin, V.B.: Modeling DNA deformations. Curr. Opin. Struct. Biol. 10, 286–
297 (2000)
99. Omabegho, T., Sha, R., Seeman, N.C.: A bipedal DNA brownian motor with coordinated legs.
Science 324, 67–71 (2009)
100. Ouldridge, T. (ed.): Coarse-Grained Modelling of DNA and DNA Self-Assembly. Springer,
Berlin Heidelberg, Oxford, UK (2012)
101. Ouldridge, T.E., Johnston, I.G., Louis, A.A., Doye, J.P.K.: The self-assembly of DNA Holliday
junctions studied with a minimal model. J. Chem. Phys. 130, 065101 (2009)
Modeling Nucleic Acids at the Residue–Level Resolution 159

102. Ouldridge, T.E., Louis, A.A., Doye, J.P.K.: DNA nanotweezers studied with a coarse-grained
model of DNA. Phys. Rev. Lett. 104, 4 (2009)
103. Ouldridge, T.E., Louis, A.A., Doye, J.P.K.: Extracting bulk properties of self-assembling
systems from small simulations. J. Phys. Condens. Matter 22, 104,102 (2010)
104. Ouldridge, T.E., Louis, A.A., Doye, J.P.K.: Structural, mechanical, and thermodynamic prop-
erties of a coarse-grained DNA model. J. Chem. Phys 134, 085,101 (2010)
105. Parisien, M., Cruz, J.A., Westhof, E., Major, F.: New metrics for comparing and assessing
discrepancies between rna 3d structures and models. RNA 15, 1875–1885 (2009)
106. Pasquali, S., Derreumaux, P.: HiRE-RNA: a high resolution coarse-grained energy model for
RNA. J. Phys. Chem. B 114, 11957–11966 (2010)
107. Pérez, A., Marchán, I., Svozil, D., Sponer, J., Cheatham, T.E., Laughton, C.A., Orozco,
M.: Refinement of the AMBER force field for nucleic acids: improving the description of
alpha/gamma conformers. Biophys. J. 92, 3817–3829 (2007)
108. Poulain, P., Saladin, A., Hartmann, B., Prévost, C.: Insights on protein-DNA recognition by
coarse grain modelling. J. Comp. Chem. 29, 2582–2592 (2008)
109. Prytkova, T.R., Eryazici, I., Stepp, B., Nguyen, S.B., Schatz, G.C.: DNA melting in small-
molecule-DNA-hybrid dimer structures: experimental characterization and coarse-grained
molecular dynamics simulations. J. Phys. Chem. B 114, 2627–2634 (2010)
110. Ramachandran, A., Guo, Q., Iqbal, S.M., Liu, Y.: Coarse-grained molecular dynamics simula-
tion of DNA translocation in chemically modified nanopores. J. Phys. Chem. B 115, 6138–6148
(2011)
111. Reith, D.: CG-OPT: A software package for automatic force field design. Comput. Phys.
Commun. 148, 299–313 (2002)
112. Reith, D., Pütz, M., Müller-Plathe, F.: Deriving effective mesoscale potentials from atomistic
simulations. J. Comput. Chem. 24, 1624–1636 (2003)
113. Ren, A., Patel, D.J.: c-di-AMP binds the ydaO riboswitch in two pseudo-symmetry-related
pockets. Nat. Chem. Biol. 10, 780–786 (2014)
114. Richmond, T.J., Davey, C.A.: The structure of DNA in the nucleosome core. Nature 423,
145–150 (2003)
115. Romano, F., Hudson, A., Doye, J.P.K., Ouldridge, T.E., Louis, A.A.: The effect of topology
on the structure and free energy landscape of DNA kissing complexes. J. Chem. Phys. 136,
215102 (2012)
116. Rothemund, P.: Folding DNA to create nanoscale shapes and patterns. Nature 440, 297–302
(2006)
117. Rother, K., Rother, M., Boniecki, M., Puton, T., Bujnicki, J.M.: RNA and protein 3D structure
modeling: similarities and differences. J. Mol. Model. pp. 2325–2336 (2011)
118. Rüdisser, S., Tinoco, I.: Solution structure of Cobalt(III)hexammine complexed to the GAAA
tetraloop, and metal-ion binding to G.A mismatches. J. Mol. Biol. 295, 1211–1223 (2000)
119. Rudnicki, W.R., Bakalarski, G., Lesyng, B.: A mezoscopic model of nucleic acids. Part 1.
Lagrangian and quaternion molecular dynamics. J. Biomol. Struct. Dyn. 17, 1097–1108 (2000)
120. Russell, R., Millett, I.S., Doniach, S., Herschlag, D.: Small angle X-ray scattering reveals a
compact intermediate in RNA folding. Nat. Struct. Biol. 7, 367–370 (2000)
121. Sambriski, E.J., Schwartz, D.C., De Pablo, J.J.: A mesoscale model of DNA and its renatura-
tion. Biophys. J. 96, 1675–1690 (2009)
122. Savelyev, A., Papoian, G.A.: Molecular Renormalization Group Coarse-Graining of Polymer
Chains: Application to Double-Stranded DNA. Biophys. J. 96, 4044–4052 (2009)
123. Savelyev, A., Papoian, G.A.: Chemically accurate coarse graining of double-stranded DNA.
Proc. Natl. Acad. Sci. USA 107, 20340–20345 (2010)
124. Schlick, T.: Molecular Modeling and Simulation: An Interdisciplinary Guide (Interdisci-
plinary Applied Mathematics), 2nd edition. edn. Springer (2010)
125. Seeman, N.C.: DNA in a material world. Nature 421, 427–431 (2003)
126. Sharma, S., Ding, F., Dokholyan, N.V.: iFoldRNA: three-dimensional RNA structure predic-
tion and folding. Bioinformatics 24, 1951–1952 (2008)
160 F. Leonarski and J. Trylska

127. Shaw, D.E., Dror, R.O., Salmon, J.K., et al.: Millisecond-scale molecular dynamics simula-
tions on anton. In: Proceedings of the Conference on High Performance Computing Network-
ing, Storage and Analysis, SC ’09, pp. 39:1–39:11. ACM, New York, NY, USA (2009)
128. Shaw, D.E., Maragakis, P., Lindorff-Larsen, K., Piana, S., Dror, R.O., et al.: Atomic-level
characterization of the structural dynamics of proteins. Science 330, 341–346 (2010)
129. Skolnick, J., Koliński, A.: Simulations of the folding of a globular protein. Science 250,
1121–1125 (1990)
130. Stagg, S.M., Mears, J.A., Harvey, S.C.: A Structural Model for the Assembly of the 30S
Subunit of the Ribosome. J. Mol. Biol. 328, 49–61 (2003)
131. Sussman, J.L., Holbrook, S.R., Warrant, R.W., Church, G.M., Kim, S.H.: Crystal structure of
yeast phenylalanine transfer RNA. I. Crystallographic refinement. J. Mol. Biol. 123, 607–30
(1978)
132. Swendsen, R.H.: Monte Carlo renormalization group. Phys. Rev. Lett. 42, 859–861 (1979)
133. Swendsen, R.H., Wang, J.S.: Replica Monte Carlo simulation of spin-glasses. Phys. Rev. Lett.
57, 2607–2609 (1986)
134. Tan, R.K.Z., Harvey, S.C.: Molecular Mechanics Model of Supercoiled DNA. J. Mol. Biol.
205, 573–591 (1989)
135. Trovato, F., Tozzini, V.: Supercoiling and local denaturation of plasmids with a minimalist
DNA model. J. Phys. Chem. B 112, 13197–13200 (2008)
136. Trylska, J., Tozzini, V., McCammon, J.A.: Exploring global motions and correlations in the
ribosome. Biophys. J. 89, 1455–1463 (2005)
137. Tucker, B.J., Breaker, R.R.: Riboswitches as versatile gene control elements. Curr. Opin.
Struct. Biol. 15, 342–8 (2005)
138. Tullius, T.D.: DNA footprinting with hydroxyl radical. Nature 332, 663–664 (1988)
139. Turner, D.H., Mathews, D.H.: NNDB: the nearest neighbor parameter database for predicting
stability of nucleic acid secondary structure. Nucleic Acids Res. 38, D280–282 (2010)
140. Venter, J.C., et al.: The sequence of the human genome. Science 291, 1304–51 (2001)
141. Vinograd, J., Lebowitz, J., Radloff, R., Watson, R., Laipis, P.: The twisted circular form of
polyoma viral DNA. Proc. Natl. Acad. Sci. USA 53, 1104–1111 (1965)
142. Voltz, K., Trylska, J., Calimet, N., Smith, J.C., Langowski, J.: Unwrapping of nucleosomal
DNA ends: a multiscale molecular dynamics study. Biophys. J. 102, 849–858 (2012)
143. Voltz, K., Trylska, J., Tozzini, V., Kurkal-Siebert, V., Langowski, J., Smith, J.: Coarse-grained
force field for the nucleosome from self-consistent multiscaling. J. Comput. Chem. 29, 1429–
1439 (2008)
144. Vorobjev, Y.N.: Block-units method for conformational calculations of large nucleic acid
chains. i. block-units approximation of atomic structure and conformational energy of polynu-
cleotides. Biopolymers 29, 1503–1518 (1990)
145. Wang, J., Peck, L., Becherer, K.: DNA Supercoiling and Its Effects on DNA Structure and
Function. Cold Spring Harbor Symposia on Quantitative Biology 47, 85–91 (1983)
146. Whitelam, S., Feng, E.H., Hagan, M.F., Geissler, P.L.: The role of collective motion in exam-
ples of coarsening and self-assembly. Soft Matter 5, 1251–1262 (2009)
147. Wimberly, B.T., Bodersen, D.E., Clemons, W.M., Morgan-Warren, R.J., Carter, A.P., Von-
rhein, C., Hartsch, T., Ramakrishnan, V.: Structure of the 30S ribosomal subunit. Nature 407,
327–339 (2000)
148. Xia, Z., Gardner, D.P., Gutell, R.R., Ren, P.: Coarse-grained model for simulation of RNA
three-dimensional structures. J. Phys. Chem. B 114, 13497–13506 (2010)
149. Yu, I., Mori, T., Ando, T., Harada, R., Jung, J., Sugita, Y., Feig, M.: Biomolecular interactions
modulate macromolecular structure and dynamics in atomistic model of a bacterial cytoplasm.
eLife 5, e19,274 (2016)
150. Yurke, B., Turberfield, A.J., Mills Jr, A.P., Simmel, F.C., Neumann, J.L.: A DNA-fuelled
molecular machine made of DNA. Nature pp. 605–608 (2000)
151. Zheng, H., Chordia, M.D., Cooper, D.R., Chruszcz, M., Mueller, P., Sheldrick, G.M., Minor,
W.: Validation of metal-binding sites in macromolecular structures with the CheckMyMetal
web server. Nat. Protoc. 9, 156–170 (2014)
Modeling Nucleic Acids at the Residue–Level Resolution 161

152. Zou, J., Liang, W., Zhang, S.: Coarse-grained molecular dynamics modeling of DNA-carbon
nanotube complexes. Int. J. Numer. Meth. Eng. 0600661, 968–985 (2010)
153. Zuker, M.: Mfold web server for nucleic acid folding and hybridization prediction. Nucleic
Acids Res. 31, 3406–3415 (2003)
Modeling of Electrostatic Effects
in Macromolecules

Yury N. Vorobjev

Abstract Electrostatic energy and forces are primary important factors defining
macromolecular interactions and its’ self-organization in an aqueous solution. The
unique property of electrostatic forces is it’s long-range character. Therefore an
accurate modeling of the long-range electrostatic interactions and related energy
of macromolecule in an aqueous solvent at given temperature, salt and hydrogen
ion concentration is the long-standing problem. One of the most advanced solu-
tion of macromolecular electrostatics is a single-molecule approach with an implicit
solvent electrostatic model for macromolecular simulations in water proton bath is
considered here. The fundamental quantity that implicit electrostatic models approx-
imate is the solute potential of mean force, which is obtained by averaging over
solvent degrees of freedom. The implicit solvent models suggest practical ways to
calculate free energies of macromolecular conformations taking into account equi-
librium interactions with water solvent and proton bath, while the explicit solvent
approach is unable to do that due to the need to account for a large number of solvent
degrees of freedom and long-range nature of the electrostatic interactions. The most
advanced realizations of the implicit continuum electrostatic models by different
research groups are discussed, their accuracy are examined and some applications
of the implicit solvent electrostatic models to macromolecular modeling, such as
protein free energy calculations, protein folding, ionization equilibria and pKa ’s of
ionizable groups and constant pH molecular dynamics are highlighted.

1 Introduction

Computer simulations with explicit solvent molecules represents one of the most
detailed approach to model the structure and energy of biomolecules [21]. However,
an accurate description of the aqueous environment for realistic simulations, e.g. with

Y. N. Vorobjev (B)
Institute of Chemical Biology and Fundamental Medicine, Siberian Branch of the Russian
Academy of Science, Lavrentiev Ave. 8, Novosibirsk 630090, Russia
e-mail: ynvorob@niboch.nsc.ru

© Springer Nature Switzerland AG 2019 163


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_6
164 Y. N. Vorobjev

method of molecular dynamics (MD), requires a large number of solvent molecules


to be placed around biomolecule [71, 94]. Thereby a large fraction of computer time
is spent calculating a detailed trajectories of the solvent molecules, while it is the
solute behavior is primarily of interest. Despite their cost, computer simulations with
explicit solvent molecules use approximations, for example, difficulties arise in cal-
culations involving polar or charged atomic groups when long-range electrostatic
interactions are truncated or summed over periodic array of simulation boxes using
Ewald techniques [58]. While free energy perturbation methods, based on micro-
scopic simulation of a macromolecule with explicit solvent, may in principle be
suitable for free energy calculations [76, 114], this in practice meets with tremen-
dous difficulties due to the large molecular size, the need to sample adequately over
large number of solvent and solute conformations and properly evaluate long-range
electrostatic interactions [23, 77]. An accurate calculation of the free energy of a
macromolecule in an aqueous solution requires sampling over the whole volume of
accessible phase space what is difficult task for a solvent. Modeling of salt and pH
effects in explicit solvent models tremendously increases demands to adequate sam-
pling. Due to these difficulties, approximate schemes treating the solvent implicitly
has been developed in last decades, some of them are reviewed [12, 27, 107, 122,
152]. Elaboration of adequate implicit models of water-proton bath as a solvent media
is important task for reliable simulation of electrostatic effects of proteins with many
titratable groups at a given solvent pH.

2 Formulation of General Model for Calculating


Electrostatic Energy of Macromolecule in Water Solvent

2.1 Basic Model of Macromolecule Charge Distribution

The real continuous charge density distribution of a macromolecule in a conformation


x  (x1 , …, xN ), where xi is coordinate of atom i, i  1, …, N are approximated by
the point charges qi centered at atoms. The set of atomic gas-phase charges q0  (q01 ,
…, q0N ) is calculated by the RESPA method [25] which optimally approximates the
rigorous quantum mechanical electrostatic potential of macromolecule in vacuum.
The atomic charges q0 are subject to the electro-neutrality condition and considered
as independent on the conformation x of macromolecule. The commonly used model
of charge density distribution is the simplest monopole approximation which ignores
coupling between a conformation x and the charge distribution q0 . Respectively,
the electrostatic energy of macromolecule in gas-phase is defined by the classical
coulomb electrostatic energy Eel

1  qi qj
N
Eel  (1)
2 ij rij
Modeling of Electrostatic Effects in Macromolecules 165

The commonly used modern molecular mechanical force fields use monopole
atomic charges and ignore internal molecular polarization effects [22, 25]. The sim-
plicity of the charge distribution adopted for macromolecular modeling, partially can
be explained by the computational efficiency of simulations.

2.2 Transport of a Protein from Gas Phase into Water-Proton


Bath

The process of dissolving a gas-phase protein in water in the presence of hydrogen


ions can be modeled as a four-stage thermodynamic process [54, 119, 150], (stage
1) creation of a solute-sized cavity in water; (stage 2) insertion of the zero charged
protein (with all atoms having zero partial charges) into the cavity in water; (stage
3) charging of the protein to the gas-phase partial atomic charges q0  (q01 , …, q0N )
in which all ionizable groups are maintained neutral, and (stage 4) an equilibrium
titration of the protein at a given pH, (Fig. 1). The first three stages of this partition
describes the solvation free energy of a protein with fixed gas-phase partial charges
on all atoms q0

W(x, q0 )  Gcav (x) + Gvwd (x) + Gpol (x, q0 ) (2)

where, Gcav (x) is the free energy for creation of the molecular cavity in water (stage
1), Gvdw (x) is the free energy of van der Waals interactions between the solute and
the water solvent (stage 2), Gpol (x, q0 ) is the free energy of polarization of the
water solvent by the protein with gas phase partial charges on all atoms (stage 3),
Ginz (x, pH) is the free energy of equilibrium titration of protein for a given pH and
conformation x which leads to a change of the protein gas phase partial atomic charges
q0 of the neutral ionization microstate z0  (z01 , …, z0ζ ), where ζ is the total number of
titratable protons (or groups), to a new values qinz for equilibrium ionization state <z>
which is coupled with conformation x and pH value. The thermodynamic process
defines the free energy Gt (x, pH) of transport of a single protein molecule into water
at a given pH in an instantaneous microscopic conformation x:

G t (x, p H )  W (x, qo ) + G inz (x, p H ) (3)

It should be noted that transport of a neutral protein molecule from gas phase into
water solvent at a given pH is not accompanied by the transfer of a net charge. The
protein molecule becomes being charged in water proton bath due to equilibrium
proton binding (releasing), i.e. by means of equilibrium redistribution of protons
between the solvent and the solute in a given conformation x. The total free energy
of protein for a given conformation x in the solvent at given pH is equal to
   
G(x, p H )  Um x, q0 + W x, q0 + Ginz(x, pH) (4)
166 Y. N. Vorobjev

Fig. 1 Thermodynamic process of transport of protein from gas phase into water proton-bath; q0
is atomic charges in the gas phase with all ionizable groups are neutral; (stage 1) creation of a
solute-sized cavity in water; (stage 2) insertion of the zero charged protein (with all atoms having
zero partial charge) into the cavity in water; (stage 3) charging of the protein to the gas-phase partial
atomic charges q0  (q01 , …, q0N ), and (stage 4) an equilibrium titration of the protein at a given pH
value
Modeling of Electrostatic Effects in Macromolecules 167

where Um (x, q0 ) is the intra-molecular conformational potential energy of the protein


computed in the gas-phase with gas-phase atomic charges (q0 ).

2.3 Free Energy of Molecule in a Solvent

The partition function, Z of a solute molecule (atomic coordinates x) in a solvent


(coordinates y) is the ratio of the partition functions for solution and pure solvent
systems with identical numbers of solvent molecules
    
dx dy exp −β Um (x) + Ums (x, y) + Uss (y)
Z    (5)
dy exp −β Uss (y)

Here U m (x) is the intra-molecular potential energy, U ms (x, y) is the potential


energy of the solute-solvent interactions and U ss (y) is the potential energy of the
solvent-solvent interactions. The partition function expression can be rewritten with
solvent-mediated interactions

Z dx exp{−β[Um (x) + W(x)]} (6)

where W (x) is the free energy of solvation of the solute molecule


   
dy exp −β Ums (x, y) + Uss (y)
exp[−βW(x)]     (7)
dy exp −β Uss (y)

The solvation free energy W (x) can be written in the framework of the free energy
perturbation method
1    
Ums (x, y)dy exp −β λUms (x, y) + Uss (y)
W(x)  dλ     (8)
dy exp −β λUms (x, y) + Uss (y)
0

Considering a multi-step sequential ‘turning on’ of different types of solute-


solvent interactions in Eq. (7), one can see that the process of dissolving a gas-phase
protein in water in the presence of hydrogen ions can be modeled as a four-stage
thermodynamic process.
Considering all phase space of a solute molecule as a sum of sub-spaces A, B, …,
each of which describes a distinct macroscopic solute conformation, it follows from
Eq. (6) that the free energy GA of a solute molecule in a macroscopic conformation
A can generally be presented in terms of average configurational energy and entropy
over the molecular degrees of freedom
   
G A  <Um x; q0 >A + <W x, q0 + Ginz (x, pH)>A −T Sconf,A (9)
168 Y. N. Vorobjev

where < >A denotes an average over micro-configurations of the conformation A, S A


is the entropy of the conformation A, which can be estimated over MD trajectory in
quasi-harmonic approximation [77, 133, 147].

3 Continuum Solvent Models

While the all intra- and inter-molecular interactions are electrostatic in the nature at
the quantum mechanical level, they are considered as the sum of electrostatic and
non-electrostatic terms in molecular mechanical force fields [22, 25]. The total free
energy of solvation of macromolecule consists of two parts, namely, the free energy
of non-electrostatic interactions, the first two terms of Eq. (2), which is mainly
independent on atomic charges, and the free energy of electrostatic interactions,
which is a function of the atomic charges q which is equal to zero for zero charge
distribution q  0. For completeness we consider non-electrostatic and electrostatic
parts of the free energy of solvation.

3.1 Free Energy of Non-polar Interactions

The sum of free energy of solvent cavity formation and solute-solvent van der Waals
interactions is the free energy of nonpolar solvation Gnp

Gnp  Gcav + Gvdw (10)

The nonpolar solvation has a complex physical nature and the associated energy
has smaller amplitude than the electrostatic counterpart, however, hydrophobic asso-
ciation is one of the principal interaction that determines biomolecular structures
[127]. The nonpolar solvation includes two terms i.e., the free energy of solvent
cavity Gcav formation and solute-solvent van der Waals free energy Gvdw . These
two terms depend differently on structure and conformation of interacting chemical
groups [26, 27, 147].

3.2 Free Energy of a Solvent Cavity

Experimental data [15, 16, 50, 62], microscopic simulations on small systems [56,
57, 157, 158] and scaled particle theory [111, 112] show that the cavity free energy
changes linearly with the surface S of the solvent excluded cavity

Gcav ≈ γmicro S (11)

where the cavity surface is defined as a smooth molecular surface (MS) confining the
molecular solvent excluded volume (SEV) [30, 146] or in some applications as a sol-
Modeling of Electrostatic Effects in Macromolecules 169

vent accessible surface (SAS) [29, 117]. The SAS is generated by the center of water
solvent probe molecule, modeled as a rigid sphere of radius Rw  1.4 Å, when this
rolls about external van der Waals (VDW) surface of protein atoms, each represented
by a spherical ball of atomic van der Waals radius Rvdw,i . It is common approxima-
tion that the atomic van der Waals radii are independent on atomic charges. The
proportionality factor, γ micro is a microscopic surface tension. An optimum choice
for the proportionality factor, γ micro , between surface area and cavity free energy
depends on the choice of a type of surface, the MS or the SAS. Simulations with an
explicit water model show the free energy of creating an uncharged small gas-bubble
in an aqueous solution to be proportional to the macroscopic surface of the cavity
with an interfacial surface tension γ macro similar to the experimental gas-solvent
surface tension [62]. The value of the microscopic surface free energy, γ micro used
to compute Gcav is smaller because, on a molecular scale, the microscopic surface
of an interface is much more irregular and somewhat larger than the corresponding
macroscopic surface by the average factor of ~1.5 [147, 148]. Correspondingly, the
microscopic surface free energy should be smaller than the macroscopic surface ten-
sion of water by the same factor. With experimental γ macro equal to 102 cal/(mol Å2 ),
this gives a value of 67 cal/(mol Å2 ) for γ micro , in good agreement with the estimate
of 70 cal/(mol Å2 ) that has been found to optimize the correlation between protein
stability experimental data and protein-protein binding constants of mutant proteins
[63, 104].

3.3 The Solute-Solvent van der Waals Interactions

The first hydration shell gives up to 85% of the energy Gvdw due to a short-range
nature of van der Waals interactions with solvent. Therefore the energy Gvdw can be
approximated by the linear expression over area of molecular surface S,

Gvdw (x)  −γvdw S (12)

The average proportionality factor γvdw  −30 (±17) cal/mol/Å ´ 2 has been found
from MD simulations of the solute-solvent van der Waals energy for a set of medium
size proteins in an explicit SPC water [147]. An agreement between the distance
dependence of the implicit solvent PMF of non-polar interactions between two
methane molecules on the distance r in water [148] with the PMF calculated by micro-
scopic simulations via Monte Carlo and molecular dynamics shows self-consistency
the cavity term and solute solvent van der Waals energy defined by Eqs. (11)–(12).
A recent computational study [131] showed that the MS area in the Eqs. (11)–(12)
provides a reasonable description of hydrophobic association of hydrocarbons and
reproduces desolvation maximum of the rigorous PMF calculated by the free energy
simulation in an explicit water solvent. The total non-polar hydration free energy of
170 Y. N. Vorobjev

Eq. (10) is still modeled by the SAS area [33, 41, 67, 80, 110, 154] which does not
reproduce the PMF’s desolvation maximum of hydrophobic association.
The cavity formation free energy term Gcav is presented as a sum over partial
atomic SAS surfaces si with atom-dependent scaling factors γi [42–44, 85]

G cav  γi si (13)
i

A set of atomic factors γi are adjusted empirically on a training set of small


molecules, a uniform value γ i  117 cal/(mol Å2 ) independent on atom type. Solvent
accessible atomic surfaces si have been calculated as VDW surface of atoms with
increased atomic radii Ri  σ i /2 + 0.5 Å, where σ i is OPLS force field van der Waals
parameter [66]. The improved implicit solvation model AGBNP2 [45] describes the
cavity formation free energy by Eq. (11) with various γi which are obtained from
fitting Eq. (11) to the hydration energies of alkane cavities. The atomic parameters γ i
are in the range of 117–129 (cal/mol/Å2 ). The AGBNP model describes the solute-
solvent van der Waals free energy as solute-solvent interactions over the solvent
volume modeled as a uniform continuum [85]
 ai
G vdw  αi (14)
i
(Bi + Rw )3

where,
16
ai  − πρw εiw σiw
6
(15)
3

where ρw  0.033428 Å−3 and σiw and εiw are the OPLS force field parameters
[65] for van der Waals potential between atom i and water oxygen, Bi is the Born
radius of atom i in the molecule of given conformation and Rw  1.4 Å is radius
of water molecule. The values of parameters αi (which is in average ~1) have been
set so as to reproduce as best as possible the solute-solvent van der Waals energies
of individual atoms of a large set of proteins and small molecules obtained from
the results of explicit solvent simulations with TIP4P3 [45, 65, 85]. The description
of the nonpolar hydrations via Eqs. (13)–(14) with atomic scaling factors αi and γ i
empirically accounts for a dependence of atomic van der Waals radii Ri , which define
the SAS, on atomic charges.

3.4 Free Energy of Solvent Polarization

The atomic charges q for protein conformation x induce in the solvent a polarization
charge density, <ρpol (r)> which produces the reaction field electrostatic potential,
V pol (xi ) at the protein’s atoms i,
Modeling of Electrostatic Effects in Macromolecules 171

<ρpol (r)>
<Vpol (xi )>  dr (16)
|r − xi |

The polarization free energy is a work done in a charging process in which the
charges of the protein are gradually “turned on” by factor λ
1
 
Gpol  dλ qi <Vpol (xi )>λ (17)
0 i

With the linear response approximation for solvent polarization, V pol and ρ pol
both are proportional to λ, and this gives
1 <ρpol (r)>
Gpol  q dr (18)
2 i i |r − xi |

In a simulation with explicit solvent, ρ pol is identical with the distribution of the
average charges of the solvent atoms, and a common approach is to use Eq. (17) to
compute Gpol with thermodynamic perturbation method [76]. The validity of the lin-
ear response approximation Eq. (18) for the solvent reaction potential of an aqueous
solvent has been tested by direct simulations of its dependence on λ in molecular
dynamics free energy simulations [3, 56, 64, 84, 118, 121, 148]. In a majority of sim-
ulations of charged and polar molecules a nearly linear response has been observed
for a moderately charged solute.

3.5 Continuum Electrostatic Poisson Model

The validity of linear response approximation suggests that the calculation of the
average induced polarization charge density, <ρ pol (r)> can be done in the framework
of macroscopic electrostatics i.e., with an implicit continuum solvent description. The
average electrostatic potential (r) contains contributions from the fixed charges q
of the protein and the induced polarization charges in the solvent, according to the
Poisson equation,

∇ 2 (r)  −4π qi δ(r − xi ) − 4π <ρpol (r)> (19)
i

and with use of standard relations connecting the average induced charge density
<ρ pol (r)> with the average polarization, and the polarization with the electric field
E(r) [61, 79], one obtains Poisson equation with a position-dependent dielectric
constant D(r)

∇D(r)∇Φ(r)  −4π qi δ(r − xi ) (20)
i
172 Y. N. Vorobjev

If the position-dependent dielectric constant D(r) is known, Eqs. (19)–(20) define


the distribution of <ρ pol (r)> for a given conformation of the protein,

D(r) − 1
<ρpol (r)>  −∇ E(r) (21)

so that Gpol can be calculated with Eq. (18).


A fundamental question is the modeling of distribution of the dielectric constant,
D(r). Inside the protein molecule solvent-excluded volume the dielectric constant DI
 1, because the solvation free energy has to be calculated for a fixed internal degrees
of freedom and nonpolarizable charge distribution, in a single conformation [122].
In the solvent space it is common practice to use the bulk water solvent dielectric
constant D0  80. Near the water-solute interface, the density of water drops sharply,
over a distance of about 0.5 Å, from the bulk density to zero, as it has been shown by
extensive MD simulation of solvent density around proteins [87]. Therefore a model
with a sharp step-wise approximation to the solvent density is reasonable. Using
integral equation theory of liquids [13, 14] it was shown that the position-dependent
dielectric constant D(r) can be modeled by equation

D(r)  DI + θ(r)(D0 − DI ) (22)

where θ(r) is a sharp switching function equal to zero inside the solvent excluded
volume. The exact choice of where to locate the solute-solvent dielectric boundary is
empirical and can compensate for deviations of the actual dependence of the dielectric
constant from the assumed step function near the protein surface. An optimal set of
atomic radii defining dielectric interface MS has been calculated by fitting the implicit
model polarization free energy to a set of experimental data [130] and data obtained
by calculations with explicit solvent for a training set of small molecules and for the
20 standard amino acids [100, 101, 150]. The obtained sets of atomic radii allow one
to reproduce polarization free energies of the 20 standard amino acids within errors
of 1–2% from free energy simulation by thermodynamic perturbation method with
explicit water.

3.6 A Smooth Solute-Solvent Dielectric Surface Interface

The dielectric surface interface defining the border between solvent and molecular
interior of Eq. (22) is a smooth molecular surface confining the molecular solvent
excluded volume [30–32, 146]. It is shown that the smooth MS is a good approx-
imation of the dielectric surface border between the high dielectric polar solvent
and low-dielectric interior of solute molecule in continuum dielectric method based
on a numerical solution of the Poisson equation Eq. (20) [145, 146]. Calculation of
molecular properties on the MS and integration of a function over the MS requires a
numerical representation of the MS as a manifold S(si , ni , si ) of boundary elements
Modeling of Electrostatic Effects in Macromolecules 173

(BE’s) where si , ni , si are coordinates, normal vector in outward direction and area
of a small surface element. Due to complexity, the formally defined Connolly’s MS of
a protein may contain hundreds of unphysical regions with singularities (discontin-
ues) in the direction of the normal vector. Singularities called cusps and holes appear
when the probe can almost, but not quite pass through a group of two or three atoms
of the protein [32, 146, 164]. It has been shown [145–147] that accurate solution
of Poisson equation via boundary element method needs MS with smoothed singu-
larities. None of programs, MSROLL [32], MSEED [109], MS [142] and MSMS
[123] were specifically designed for the boundary element method application and
provide a dot MS of poor quality as was tested by Vorobjev and Hermans [146] to be
used with BE method. The Connolly’s method of MS calculation [30–32] has been
revised and the new method generating Sooth Invariant Molecular Surface (SIMS)
[146] has been developed. The SIMS method, (i) produces a near-homogeneous dot
distribution, (ii) is invariant to molecular rotation and translation and, (iii) recognizes
all types of singularities of the MS and smoothed them with specified minimal radius
of curvature. An optimal practical choice of the radius of the smoothing sphere is
~0.4 Å. The SIMS method generates a dot MS of good numerical quality, which
can be used in a variety of implicit continuum models for calculating solvation free
energy and for molecular electrostatics with Poisson equation. The influence of a
choice and composition of boundary elements on convergence of the solution of
the Poisson equation by numerical methods has been investigated in details using
Connolly’s MSROLL [32] and SIMS programs to generate BE on the solute-solvent
dielectric surface [70]. It has been found that the SIMS program generates the BE’s
of better quality and achieves convergence faster using smaller number of the surface
elements than the MSROLL program, by a factor ~1.5–2.0, in the test on a set of
35 medium size proteins. A complete description of the SIMS method can be found
elsewhere [146]. The CPU time of the SIMS method scales as the number of atoms
in the molecule [147]. The SIMS program is available from the authors on request
(ynvorob@niboch.nsc.ru).

3.7 Numerical Solution of Poisson Equation

The finite difference (FD) method solves Poisson (or Poisson-Boltzmann) equation
in differential form Eq. (20) using multigrid volume elements in a rectangular box
which includes the solute and a volume of solvent around it [51–54, 93, 120, 129,
130]. The alternative is a boundary element (BE) method which is used for numerical
solution of an integral equation over the dielectric boundary, to which the original
Poisson Eq. (19) can be analytically converted [18]. The BE method finds a solution
in terms of solvent polarization charge density induced or electrostatic potential on
boundary elements tessellated the solute-solvent dielectric surface [18, 68, 88, 89,
144, 145, 147, 150, 163]. The boundary element method shows it’s invariance to
rotation and translation of the solute molecule. The BE method exhibits a higher
degree of consistency in comparison with numerical results of multigrid BE and FD
methods [18, 145]. Improved methods of solving the Poisson equation for inhomo-
174 Y. N. Vorobjev

geneous dielectric media using multigrid and multilevel finite-difference techniques


have been developed [35, 46, 51, 52, 95, 120, 167, 168]. Multilevel and multi-sized
BE techniques have been applied to the iterative BE method [115, 116, 144, 165].
Several new efficient implementations of the BE method have been developed
recently [88, 89, 150]. The BE integral equation, to which the Poisson Eq. (19)
is analytically converted [18], is solved by the Fast Adaptive Multigrid Boundary
Element (FAMBE) method [145, 150] for the induced surface polarization charge
density σ (t)
σ(s)(t − s) n(t)ds f 
σ(t)  f + n(t)Ei (t) (23)
S |t − s|3 DI i

where f  (1/2π)(DI − D0 )/(DI + D0 ) and n(t) is the outward normal vector to the
molecular surface at point t, Ei (t) is electrostatic field generated by the charge i at the
surface point t. The induced charge density σ (t) approximates the average solvent
induced charge density, in Eq. (16). The solvent polarization free energy GFM pol of the
FAMBE method can be found with Eq. (18), replacing volume integral and volume
charge density with surface integral and surface charge density σ (s)
1 σ(s)
pol (x) 
GFM q ds
2 i i |s − xi |
S
1 σi (s)ds 1  σj (s)ds
 q + q
2 i i |xi − s| 2 ij i |xi − s|
S S
 1  FM
 gFM
i (x) + w (x) (24)
i
2 ij ij

where gFM i (x) is the energy of solvent polarization by atom i, i.e. the energy of
self-polarization, and wFM
ij is the pair PMF of interaction of atoms i, j due to the
solvent polarization. The FAMBE is an efficient method to calculate a set of partial
atomic polarization densities σi (s), polarization energy and atomic forces for a given
protein conformation x. The FAMBE method for calculation of the induced surface
polarization charge density σ (t) splits the σ (t) given by Eq. (23) into a sum of
terms σ i (t), each one of which represents the induced polarization charge density,
generated by a single group of charges qi , since the term Ei (t) is linear in the charges
qi . The FAMBE method splits Eq. (24) into set of independent minor BE equations,
one each for the induced polarization charge density generated by a single charge
(or small compact group of charges)

σi (s)(t − s) n(t)ds f
σi (t)  f + nt Ei (t), i  1, 2, . . . (25)
|t − s|3 D I
S
Modeling of Electrostatic Effects in Macromolecules 175

the total surface charge, σ (t) is the sum of the components σ i (t). The reason for such
decomposition is that the integral equation, Eq. (25), for each component σ i (t), can
be converted into a discrete linear equation of low dimensionality of a matrix Mi
over the set i of adaptive multi sized boundary elements

σi  Mi σi + Ei (26)

For each charge, qi the size of the boundary elements steadily increases with
distance R from the source of the molecular electrostatic field. Thereby the MS is
tessellated by the unique set of multisized BE’s, so that, for any given single charge
qi the dimensions of the vector of surface charge densities σi and of the matrix Mi is
significantly lower, than the total number of surface elements that would be used if
the surface were tessellated by the finest uniform boundary elements in the Eq. (25).
The number of multisized boundary elements N MBE , i.e. the matrix Mi size for any
single charge qi , which tessellates an MS with area AS scales as

NMBE ≈ n loc ln(AS /Aloc ) (27)

where, nloc and Aloc are an average number of boundary elements and size for the
local area with finest tessellation. Each minor matrix Eq. (26) is solved by the pre-
conditioned bi-conjugate gradient method [113]. A few iterations (5 or 6) are needed
to find a solution of linear Eq. (26) with a relative accuracy of 10−4 –10−5 . The
computational complexity of the FAMBE method scales as

complexity ≈ Nz [n loc ln(AS /Aloc )]2 (28)

where nloc is the average number of boundary elements, AS is the MS area and Aloc is
the size of the local area with the finest tessellation, N z is the number of charges (or
charged groups) in the solute molecule. Test calculations for several proteins show
that the CPU time of the FAMBE method scales approximately linearly with the
number of atoms of the molecule. The FAMBE method [150] shows a high degree
of internal self-consistency, accuracy and speed of calculations in comparison with
one of the latest realization of BE method by other authors [88, 89]. The free energy
of solvent polarization calculated with the FAMBE method includes dependence on
salt effects implicitly [150]. A good numerical quality and a high speed recommend
the FAMBE method as good tool for a post processing of molecular dynamics trajec-
tories for free energy estimations via Eq. (9) with important applications for systems
undergoing a large conformational changes. The FAMBE program is available from
the authors on request (ynvorob@niboch.nsc.ru).

3.8 Generalized Born Model

A solution of the Poisson equation by the fastest available methods for a medium size
protein takes 10–30 s CPU time on a single processor unit. However, this CPU time
176 Y. N. Vorobjev

is to large to use the Poisson equation for calculation of solvation energy and atomic
forces on the fly in the MD method. Therefore other faster simplified approaches like
the generalized Born (GB) method has received attention [12, 135]. The GB model
defines the free energy of solvent polarization by protein charges analytically

1 1 1  qi q j
G pol  − − (29)
2 DI D0 i, j
f G B (ri j )

where fGB (r) is a function that interpolates between the effective Born radius Bij ,
of atoms i, j when the distance between atoms rij is short, and rij itself at the large
distances rij [135]
 1/2
fGB (rij )  r2ij + Bi Bj exp −r2ij /4Bi Bj (30)

where Bi, Bj are effective Born radii of atoms i and j. The basic idea of the GB
approach can be viewed as an interpolation formula between analytical solutions
for a single sphere and for widely separated spheres. The total energy of solvent
polarization of the GB method is a sum of atomic self-polarization energies, gGB i ,
and the energy of polarization interactions, wGB
ij , of pair of atoms i, j similar to the
Eq. (24)

1 1  q2 1 qi qj 1 1
pol (r)  − −
i
GGB +
D0 DI i
Bi 2 ij fGB (rij , Bi , Bj ) D0 DI
 1  GB
 gGB
i + w (31)
i
2 ij ij

Equation (31) defines the self-polarization energy gGB


i as

q2i 1 1
gGB
i − − (32)
2Bi DI D0

Comparing Eqs. (31) and (24) one obtains a formal way to define Poisson-ideal
(or FAMBE-ideal) effective Born radius Bi of atom i of the protein in particular
conformation

q2i 1 1
Bi  − FM D
− (33)
2gi I D0

Salt effect correction is included in the GB model by the simple substitution [12]

1 1 1 exp(−κ f(rij ))
− → − (34)
DI D0 DI D0
Modeling of Electrostatic Effects in Macromolecules 177

where κ is the Debye-Hükcel screening parameter. The goal of the GB model can be
thought of as an interpolation to find a relatively simple analytical formula, which
for real molecular conformations will reproduce, as much as possible, the results
of the Poisson equation. The GB model using the Poisson-ideal Born atomic radii
Bi provides an accurate approximation of the Poisson polarization free energy of
proteins [38, 105] with errors within ~1–3%. A calculation of the Poisson-ideal
Born radii set on the base of Eq. (33), i.e. by solving Poisson equation is impractical
[12], therefore a rapid and still reasonable approximations for the effective Born
radii to its Poisson-ideal values is needed. If an accurate effective Born radii can
be computed for each atom of molecule at low CPU time, than the computational
advantage of the analytical GB model relative to the numerical FD or BE solution
becomes obvious.
The original GB method [135] estimates the effective Born radii Bi by expression
using Coulomb field approximation (CFA) for electrostatic field in a solvent and
protein volume. The CFA self-polarization free energy GCFA i of a charge qi

q2i 1 1 dV
GCFA  − (35)
i
2 × 4π D0 DI r>SEV |r − ri |4

where SEV is the solvent excluded volume. The effective Born radius in the CFA
approximation is defined as
SEV
1 dV
B−1
i  R−1
i,vdW − (36)
4π |r − ri |4
r>Rvdw,i

where, Rvdw,i is van der Waals radius of atom i. The CFA approximations is exact for
a charge located in the center of spherical volume of excluded solvent. The further
approximation is the evaluation of volume integral of CFA energy density Eq. (36)
by numerical integration [135] over the volume of the van der Waals spheres of the
solute atoms instead of the SEV volume, i.e.
V DW
1 dV
Bi−1  −1
Ri,vdW − (37)
4π |r − ri |4
r >Rvdw,i

A closed form analytical expressions for the volume integral Eq. (37) over a set of
overlapping spheres has been derived in the pair-wise approximation [49, 124]. The
GB model with HTC [49] Born radii formula, Eq. (37), has been developed for small
molecules, where it was found to reproduce solvation energies and individual charge-
charge interactions quite well [33, 49] if a reduced values for atomic van der Waals
radii R*i,vdw  Ri,vdw —0.09 Å are used. For macromolecules, the HTC approximation
tends to underestimate the values of Born radii for burried atoms [105] because the
integration procedure for Eq. (37) treats small vacuum-filled crevices between the
VDW spheres of protein atoms as being filled with water. The HTC formula assigns
178 Y. N. Vorobjev

the Born radii for medium size proteins in quite narrow interval ~1.5–4.0 Å, while
the range of values for the Poisson ideal Born radii is much large ~1.5–12 Å.

3.9 Improved Generalized Born Methods

Improved GB models try to increase accuracy of estimation of atomic Born radii.


The GBSV/MS model [59, 81] use, (i) definition of a protein volume as a union of
smoothed solvent exclusion functions centered on atoms, to approximate the rigorous
SEV more accurately, but still computationally effectively, and (ii) corrected CFA
is used for definition of self-polarization free energy of charged atoms [38, 81].
The corrected GBSV models demonstrate great improvement over the Coulomb
field approximation for the calculated effective Born radii. The GBSV model [98,
99] have a good agreement for polarization free energy with calculations by the
Poisson equation method, showing relative errors of about 3–5%. The analytical
OBC-Born radii model [106] defines Born radii by an empirical function of volume
integral Eq. (37) with empirical parameters which are optimized for a training set of
proteins and moderately improves accuracy of Born radii estimation for proteins. The
corrected GB models are implemented in a modern simulation packages AMBER
and CHARMM. The algorithmic simplicity and reasonable accuracy have made
them a commonly used in many applications [107]. A recent study [28] presented the
GBSV/MS2 model as empirical expression for Born radii with three parameters. The
empirical parameters of the GBSV/MS2 model are optimized by minimizing the root-
mean-square deviation (RMSD) error between GB and Poisson results for effective
Born radii and self-polarization free energy of all atoms for 22 small proteins. The
average relative unsigned error for GBSV/MS2 Born radii B  <|Bi (GBSV/MS2) −
Bi (Poisson-ideal)|/Bi (Poisson-ideal)> is equal to ~0.25, for buried atoms with Bi
> 4 Å. However, many buried atoms still have lower effective Born radii in the
GBSV/MS and GBSV/MS2 models up to factor 2.0, compare to the Poisson-ideal
Born radii.
During the last decade Levy’s group developed analytical version of GB model
[42–45, 85]. The AGBNP2 (Analytical GB NonPolar) model [45] is based on the
HTC pairwise descreening and introduces innovations to the nonpolar and electro-
static components of solvation free energy. The AGBNP method approximates the
solute volume as a set of overlapping atomic spheres with continuous density which
in turn are approximated by the Gaussian density functions proposed by Grant and
Pickup [47]. The model defines analytically the self-volume and VWD surface of
atom i with a set of empirically adjusted switching functions. The Born radii of
the AGBNP model are obtained by analytical evaluation of the integral Eq. (37)
over the volume occupied by the solute atoms [44]. The AGBNP2 model [45] intro-
duces method to approximate the true solvent excluded volume by the VDW inte-
gration volume of Eq. (37) using empirically increased van der Waals radii and
rescaling factors, while keeping the analytical expressions obtained for VDW inter-
secting spheres. The average ratio Bi (AGBNP2)/Bi (SEV) ~ 1.2–2.0 while the ration
Modeling of Electrostatic Effects in Macromolecules 179

Bi (AGBNP)/Bi (SEV) ~ 1.4–3.0 for buried atoms with Born radiuses Bi (SEV) > 5 Å.
The AGBNP2 model is implemented in the MD package and shows a reasonable per-
formance on a large set of test proteins [45].
A simple and quite accurate expression to compute the effective Born radii was
proposed in the study [33], the R6 Born radii method,
⎛ SEV
⎞1/3
3 dV ⎠
Bi−1  ⎝ Ri,vdW
−3
− (38)
4π |r − ri |6
r >Rvdw,i

The R6 radii formula are exact for any location of a charged atom within a perfect
spherical solute in the limit D0 /DI  1 [1, 99]. It have been shown that R6 Born radii
are computed by accurate numerical integration over exact MS or SEV [99] are in
very close agreement with Poisson-ideal Born radii. The study of [1] suggests a new
analytical method (AR6) to compute the effective Born radii as empirical function
based on R6 integral of Eq. (38) with pairwise VDW approximation of the SEV
molecular volume and several molecular volume correction terms to approximate
more exactly the true solvent excluded volume in a vicinity of the atom in question.
The AR6 effective Born radii are defined by empirical function with several param-
eters which were optimized by parametrization. The RMSD between the inverse
effective AR6 and the Poisson-ideal Born radii for medium size protein lysozyme is
about 0.064. The Born radii of buried atoms with Born radii Bi > 3.3 Å are estimated
by the AR6 model with errors more than 20% and the error is increased up to 50%
for deeply buried atom with Born radii Bi > 6 Å. For the small drug-like molecules
the AR6 model with cavity term, of Eq. (13), and van der Waals solvation term, of
Eq. (14), reproduces the experimental solvation free energies with good accuracy,
the RMSD error is equal to 1.73 kcal/mol.
The accurate and fast version of the MSR6 method [1] for calculation of the
volume integral of Eq. (38) is developed recently [153]. The atomic Born radius
Bi (MSR6) of atom at position ri is defined by the integral over the protein MS [98]
⎛ ⎞1/3
1 (s − ri )n(s) ds ⎠
B−1
i ⎝ (39)
4π |s − ri |6
S

where n(s) is a normal vector to the MS at the point s. The MSR6 formula, Eq. (39),
follows from the Eq. (38). It has been shown that when the MSR6 atomic Born radii
are computed by accurate numerical integration over the exact MS [98] they are
in very close agreement with Poisson-ideal Born radii. Calculation of the surface
integral in Eq. (39) with uniform tessellation of protein MS by surface elements
used by Aguilar et al. [1] is a procedure of numerical complexity of O(N5/3 ) for
a protein with N atoms. The fast method for calculation of the surface integral in
Eq. (39) is based on the FAMBE adaptive tessellation of the protein MS by the
multi-sized boundary elements. The FAMBE adaptive tessellation reduces numerical
180 Y. N. Vorobjev

complexity of calculation of atomic Born radii to the order of O(N log N), because
the number of multi-sized surface elements scales as O(log N) [150]. Furthermore,
the MSR6 approximation of Eq. (39) has been empirically corrected, so that the
corrected approximation, MSR6c,

Bi (MSR 6c)  0.9129 Bi (MSR 6) + 0.0969 (40)

where Bi (MSR6) is the Born radii in (Å) defined by Eq. (39) over protein MS calcu-
lated by the SIMS method [146] with solvent probe radius of 2.0 Å. The last value
of the solvent probe radius was found to be optimal for approximation of dielectric
surface interface to reproduce the explicit water solvent polarization free energy [1].
Figure 2 shows that the correlation between the two sets of radii Bi (MSR6c) and
FAMBE-ideal Bi (FAMBE) is very high, R2  0.9989. The corrected MSR6 method
gives atomic Born radii, which agree with the FAMBE-ideal atomic Born radii with
average error of 2.5%, i.e. practically with numerical accuracy of solution of the
Poisson equation due to the finite size of boundary elements or 3D-grid [145]. Cal-
culation of almost FAMBE-ideal atomic Born radii Bi (MSR6c) is approximately
100 times faster, than calculation of FAMBE-ideal atomic Born radii by the FAMBE
method, i.e. solving Eq. (23).

Fig. 2 Comparison of FAMBE-ideal atomic Born radii B(FAMBE) with atomic Born radii
B(MSR6c)—red open circles and B(MSR6c)—blue open squares, for several conformations of pro-
teins BPTI, HEWL and RnaseA. The B(MSR6) radii are calculated using Eq. (39); the B(MSR6c)
radii are calculated using Eq. (40). The diagonal lines correspond to exact equality between
B(MSR6c) and B(FAMBE)
Modeling of Electrostatic Effects in Macromolecules 181

4 Protein Ionization

4.1 Potential of Mean Force of Equilibrium Titration

Transport of protein molecule from gas phase into a water proton bath is accompanied
by (de)protonation and ionization of titratable residues. The work required for the
equilibrium ionization is the free energy of ionization Ginz , Eq. (6) or it is the
implicit titration potential of mean force (IT-PMF) for the protein in water proton
bath. A rigorous statistical mechanical formulation of IT-PMF has been considered
by Baptista et al. [7] in terms, which eliminate the explicit reference to a variable
number of protons. The IT-PMF free energy G0inz (x, pH) of protein ionization (from
neutral gas phase state) at a given pH in water-proton bath is defined as
  
G0inz (x, pH)  −kT ln exp (n(z)μ − G0 (x, z))/kT (41)
n,z

where G0 (z, x) is a free energy of a protein at ionization microstate z = (z1 , …, zζ )


relative to the reference (neutral) state z0 in water, for the conformation x,

G0 (x, z)  G(x, z) − G(x, z0 ) (42)

n(z) is a total number of bound protons for the ionization microstate z, μ is a chemical
potential of protons, μ  −kT·ln10)pH. A canonical MD simulation of a protein with
free energy described by Eq. (41) at constant temperature is the constant pH MD
(CpHMD) simulation of the titratable system in the implicit titration potential of mean
force. To perform such simulation the free energy Ginz (x, pH) should be expressed
in terms of quantities that can be computed on the fly. The first implementation of the
implicit titration potential Ginz (x, pH) for CpHMD method developed by Baptista
et al. [7] was based on the mean field approximation for the ionization degrees and
Tanford-Kirkwood spherical model [138] for the protein.
An accurate implementation of the IT-PMF is provided by the method FAMBEpH
[150, 153] which generalizes FAMBE method [145] for calculating the free energies
of solvent polarization Gpol (x) and protein ionization Ginz (x, pH). The MSR6c
method Eqs. (39)–(40) is used for a fast evaluation of the Born atomic radii. The
GB method with MSR6c Born radii allows one to calculate solvent polarization and
protein ionization free energies and perform analytical calculation of all electrostatic
atomic forces for MD simulation. The FAMBEpH and the GB MSR6c method pro-
vides one with, (i) the solvation free energies of the ionizable residues in water, (ii)
a realistic estimation of an average ionization degrees, their pair correlations and,
(iii) the free energy of ionization and respective atomic forces due to the IT-PMF.
The IT-PMF gives an instant equilibrium response of the proton bath at given pH,
therefore the CpHMD with the IT-PMF can be more effective then the commonly
used explicit stochastic titration method which considers a vast number of randomly
generated ionization microstates [90, 97, 159].
182 Y. N. Vorobjev

4.2 Practical Calculation of Potential of Mean Force


of Implicit Titration

The ionization free energy, Ginz (x, pH), can be calculated by thermodynamic inte-
gration method as a titration process from zero hydrogen-ion concentration to a given
value of pH via the Tanford-Schellman integral [126, 137]

 ξ
∂Ginz (x, pH)
 kT(ln 10) θi zi (x, pH) (43)
∂pH i1

where <zi (x, pH)> is the average ionization degree of site i in the protein in confor-
mation x; parameter θ i is equal to 1 or −1 if the ionizing group is a base or an acid,
respectively. Integrating over pH one obtains practically treatable expression [150,
162] to calculate the free energy of ionization

Ginz (x, pH) − Ginz (x, ∞)


pH

N
 
 kT(ln 10) θi zi (x, pH) − zi,mod (pH) dpH (44)
i1 ∞

 
where the functions zi (x, pH) and zi,mod (x, pH) are the average ionization degree
of site i in the protein in conformation x, and in the isolated model compound,
respectively. The energy Ginz is the free energy of ionization of protein relative
to the total free energy of ionization of the all titratable residues of the respective
model compounds, i.e. isolated amino acids

Ginz (x, pH)  Ginz (x, pH) − Ginz,mod (x, pH) (45)

For the site i in protein conformation x at a given pH, the average ionization
degrees <zi (x, pH)> are calculated by a Monte Carlo random walk in the space of
ionization microstates z
ξ
1   
2
zi (x, pH)  δ(zi ) exp (n(z)μ − G0 (x, z, pH))/kT (46)
Zinz z

where δ(zi ) is occupation (0, 1) of the ionization microstate zi , Zinz is the partition
function over all ionization microstates. It is shown [150] that a direct calculation of
the free energy from partition function from Eq. (41) and calculation by the integral,
Eq. (44), give well coincided numerical values for protein BPTI.
The total energy Ginz (x, pH) of the Eq. (41) can be presented relative to any
reference ionization microstate zr . Assuming that the Grinz (x, pH) is the free energy
of ionization of the protein at given pH with respect to the reference ionization
microstate zr , from Eq. (41), one obtains
Modeling of Electrostatic Effects in Macromolecules 183

Grinz (x, pH) + G(x, zr , pH)  G0inz (x, pH) + G(x, z0 , pH) (47)

It follows from Eqs. (41), (47) that the energy Grinz (x, pH) has a minimal absolute
value if the reference ionization microstate zr is equal to the most probable ionization
microstate zp with minimal energy G(x, zp , pH). Thereby the most probable ionization
microstate zp is the optimal one-state approximation of the equilibrium ensemble of
ionization states.
Finally, the total free energy G(x, pH) of a protein in water-proton bath can be
presented relative to the most probable ionization microstate zp

p p p
G(x, pH)  Umol (x) + Gpcav (x) + Gpol (x) + Ginz (x, pH) (48)

The first three terms of that equation describes physically real protein structure
p
in the ionization microstate zp . The IT-PMF Ginz (z, pH) has a minimal amplitude
for the optimal ionization microstate and describes correction due to deviation the
microstate zp from the equilibrium ensemble of ionization microstates.

4.3 Calculation of Ionization Equilibria

The protonation state of a protein with ξ protonatable sites is represented as a vector


z  (z1 , z2 , …, zξ ), zi denotes the protonation state of site i. It should be noted that
neutral state of some site i is not unique, due to proton tautomerism [9], therefore
zi can have many states, instead of 0,1 for site without tautomerism. In the case
of carboxyl sites (Asp, Glu, C-terminus) four tautomers are existed with proton
bonded in either syn or atni conformation (HOD1-OD1-CD-OD2) to each carboxyl
oxygen atoms. The syn:anty pair is assigned ratio 94.5:5.5 from experimental and
theoretical data [9, 97] for isolated amino acid with blocked termini in water solvent.
Two tautomers of histidine have the proton in either Nδ1 or Nε2 atoms, the Nδ1 : Nε2
pair is assigned the ratio 30:70 measured for His with blocked termini [139]. The
proton tautomers for the neutral Tyr, Cys and Lys and N-terminus are equivalent. The
tautomerism of the neutral states of sites suggests that the neutral state in not unique
and instead the totally charged state of protein is unique and can be considered as
a reference state. Thereby the ionization states zi  0, 1, …, τi , where, zi  0 refers
to the ionized state, while remaining τi values refer to alternative tautomers with
different proton positions.
The free energy of dissociation of hydrogen ions from amino acid side chains Si
of the protein can be defined relative to the dissociation of hydrogen ions from the
isolated amino acids Si considering the thermodynamic cycle
184 Y. N. Vorobjev

 GS (z) 
Si (0) + n(z)H+ −→ Si (z i )
i i
↓ G(0) ↓ G(z) (49)
GP (X,z)
P(X, 0) + n(z)H+ −→ P(X, z)

where P(X, z) is the protein in the macroscopic conformation X and fixed ionization
state z, the Si (zi ) is the model compound site i in the state zi and GS (z) is the
free energy of protonation (deprotonation) of model compounds with n(z) protons,
(n(z) may be positive or negative), from the initial Si (0) fully ionized state; GP (X,
z) is the free energy of protonation reaction of the protein from its fully ionized state
P(0); G(0), G(z) are the free energy difference between model compounds and
protein in the fully ionized and in the protonated states, respectively

G(z)  GP (X, z) − GSi (zi ) (50)
i

where G(X, z) (GSi (zi )) are the free energy of protein (the model compound) in the
fixed ionization state z (zi ), respectively. The fundamental assumption behind the use
of model compounds is that the quantum contribution for the (de)protonation of site
PSi in the protein is the same as in its corresponding model compound Si , so that only
classical contributions (from molecular mechanical model) need to be considered in
Eq. (41). The free energy of molecule in the solvent at the fixed ionization state (i.e.
fixed atomic charges) at particular macroscopic conformation is given by expression
[147]

G(X, z)  < Um (x, z) + W(x, zi ) >X −TSconf (X, z) (51)

where Um is molecular internal potential energy in vacuum, W is the solvation free


energy, Eq. (2), and Sconf is the conformational entropy of the molecule in the given
macroscopic conformation X, and the average is taking over all microstates x of the
conformation X. From the thermodynamic cycle (49) one can write

GP (X, z)  GS (z) + G(z) − G(0) (52)

The model compounds in solution contribute independently to the energy GS (z),
thus
ξ
  
GS (z)  ln(10)kT θi pH − pKSi (zi ) (53)
i,zi

where, pKSi (zi ) is the pKa value of the deprotonation (protonation) reaction involving
the neutral tautomeric form Si (zi ) related to its macroscopic experimental pKa [9,
92]
Modeling of Electrostatic Effects in Macromolecules 185

pKSi (zi )  pKSi − θi log fi (zi ) (54)

where fi (zi ) is the fraction of the tautomer zi among all neutral tautomers of the model
compound Si , the pKSi is the macroscopic pKa of the model compound [103]. The
modern practice [92, 102] is to consider the thermodynamic cycle (49) assuming
the next approximations: (1) the protein is frozen in a particular conformational
microstate x, (2) the protein is considered as set ζ + 1 nonoverlapping fragments of
protonatable amino acids plus the remaining nonprotonatable background (B), (3) the
total protein free energy the Eq. (51) is approximated by the molecular-mechanical or
electrostatic energy of the protein in solution. The electrostatic energy is calculated
with the linear Poisson-Boltzmann equation in the continuum dielectric model,

G(X, z) ≈ Uel (x, z)  Ucoul


m (x, z) + Gpol (x, zi ) (54)

where Ucoul
m is the molecular electrostatic energy in vacuum, Gpol is the solvent
polarization free energy Eq. (24). The linearity of the Poisson-Boltzmann equation
implies that the superposition principle holds for these fragments, giving for energy
UP of the protein

UP (x, z)
ξ
 ξ
 ξ

 UPBB (x) + UPiB (x, zi ) + UPii (x, zi ) + UPij (x, zi , zj ) (55)
i i i>j

where UPαβ denotes the energy of interactions between fragments α and β. Finally,
the free energy of microstate z of the protein protonation reaction GP
ξ 
   
GP (x, z)  ln(10)kT δ(z, i) θi pH − pKSi − log fi (zi )
i zi
ξ 
   
+ δ(z, i) (UPiB (x, zi ) + UPii (x, zi ) − USii (x, zi )
i zi
ξ 

+ δ(z, i)δ(z, j)UPij (x, zi , zj ) (56)
i>j zi ,zj

where the δ(z, i)  0,1 is the occupation number of the state i in the ionization
microstate z; θi = −1, 1, 0 if the state i is acid, base or neutral tautomer, respec-
tively. The first sum of the Eq. (56) is the model compounds energy of protonation
corrected on the entropy factor, Eq. (54), due to the neutral tautomer fraction fi (zi );
the second sum is the effect of protein environment on ionizable site i in the state zi ;
the third sum is the energy of interaction of ionizable sites i, j in the isomeric states
zi , zj . A similar expression for the free energy of ionization microstate is consid-
186 Y. N. Vorobjev

ered by Song et al. [132] in the MCCE2 method, which considers both neutral state
tautomerism and side chain rotamers.
Probability p(x, z) to find protein in conformation x in the ionization state z is
defined by Boltzmann factor
 
p(x, z)  exp −GP (x, z)/kT Zinz (57)

Calculation of the Boltzmann distribution of the ionization states z is achieved by


the Monte Carlo method [132, 150]. A random walk in the ionization phase space
consists of random chose of a move from the set of predefined types of moves, e.g.
one site flip (acid, base, tautomer) and several types of double site flips, i.e. base/base,
acid/acid or tautomer/tautomer opposite flip, base/acid annihilation (creation), etc.
An effective way of generation of equilibrium ionization states distribution is to start
calculations from a high (or low) pH, when all acid (base) groups are charged. Then
by a small step ~0.25–0.5 pH units, the MC simulation proceeds over the wide pH
range, e.g. (−10, 20). The pKa of titratable residues can be determined as the root of
equation for the average occupation of the ionized states <δ(z, i)(pH)>z  ½ [150],
or by fitting the titration curve <δ(z, i)(pH)>z to the Henderson-Hasselbalch equation
[132].

4.4 Constant pH MD Method with the Potential of Mean


Force of Implicit Titration

Molecular dynamics of a protein molecule at fixed pH in the potential defined by


Eq. (48) is atomic dynamics in the instantaneous optimal ionization microstate zp
driven by conventional intramolecular atom-atom interactions, implicit solvation
p
potential and the IT-PMF Ginz (x, pH) at given pH. Atomic forces due to the IT-
p
PMF Ginz (x, pH), Eq. (46), can be calculated analytically [153]
  ξ ξ
1 p p ∂
p
∂Ginz (x; pH) ∂G0 (x, z, pH) p ∂
 − δi gi (x) − δ δ wij (x)
∂rk ∂rk z i1
∂rk 2 ij i j ∂rk


 1
ξ  ∂
p ∂ p p
 <δi > − δi gi (x) + <δi δj > − δi δj wij (x) (58)
i1
∂rk 2 ij ∂rk

where, gi (x) is the electrostatic energy of ionization of the titratable group i, Δwij (x)
is the energies of pair interactions of titratable groups i, j, <δi > is the average occu-
p
pation of the state i and δi is occupation of the state i in the most probable optimal
ionization microstate; <δi δj > is the pair correlation of occupations of titratable groups
i and j, which are calculated by the method FAMBEpH [150]. An effective calcula-
tion of the gradients gi (x) and wij (x) over coordinate of atom ri is done in the
Modeling of Electrostatic Effects in Macromolecules 187

framework of the GB method with the Born radii defined by the MSR6c method
Eqs. (39), (40).
The CpHMD-IT method is implemented as a sequential algorithm [153], which
consists of the following 5 steps: (1) for a given protein conformation x0 at the time
t0 , the optimal ionization microstate zp , average occupation degrees <δi >, pair corre-
p
lation matrix <δi ·δj > and the PMF Ginz (x, pH) are calculated using the FAMBEpH
method [150], (2) initialization of the molecular topology of the protein molecule
in the optimal ionization microstate zp , (3) assignment of a velocity for each new
bound proton as the one equal to the velocity of the respective heavy atom; (4) MD
simulation of the protein molecule in the fixed ionization microstate zp in the force
field defined by Eq. (48) during the time τzfix ~ 2–4 ps, (5) return to the step (1).
The CpHMD-IT simulations were carried out at constant temperature of 300 K
using the in-house MD program BISON [151]. The optimal ionization microstate zp ,
p
average ionization degrees <δi >, pair correlation matrix <δi ·δj > and the PMF Ginz (x,
pH) are calculated using the FAMBEpH [150] method with the salt concentration
0.15 M and the dielectric constants D0  80 and DI  16. The large value of DI which
is used for calculation of ionization equilibrium for a fixed protein conformation x
accounts for reorganization due to nonstructural responses (e.g., charge redistribution
due to ionization) not captured by the current method [8]. The AMBER99 force field

Fig. 3 Comparison of PMFs W(FAMBE) of the FAMBE method with PMFs W(MSR6c) of the
Generalized Born model with almost-ideal atomic Born radii B(MSR6c) for pairs of atoms from
several conformations of proteins BPTI, HEWL and RnaseA. The diagonal solid line corresponds
to exact equality between values of two PMFs
188 Y. N. Vorobjev

[155] was used for calculations of intramolecular energy and forces. A consistent set
of atomic charges for protein residues in neutral and ionized states was computed
by the RESPA method [5]. Intramolecular electrostatic, solvent polarization energies
and all electrostatic atomic forces of Eq. (48) were calculated by the GB method with
salt effects using the almost FAMBE-ideal atomic Born radii Bi (MSR6c) with the
dielectric constants D0  80, DI  1 and salt concentration of 0.15 M. The optimal
update time-step for atomic Born radii τB  0.02–0.04 ps, which allows one to
generate a stable CpHMD-IT trajectory corresponding to RMSD about of 2 Å from
crystal structure [153] for a set of test proteins BPTI, HEWL and RNase A (Fig. 3).

5 Examples of Simulations with Continuum Electrostatic


Models

5.1 Advantages of Implicit Solvent Models

The implicit solvent models have several advantages over the explicit molecular water
representation in MD simulation [106, 122, 148], (i) the implicit models describe an
instantaneous solvent dielectric response, which eliminate the need for the lengthy
equilibration of water that is necessary in explicit water simulations, (ii) the absence
of solvent reorganization energy barriers and dynamical viscosity associated with
explicit water environment allows the solute molecule more quickly explore the
available conformational phase space, (iii) the implicit dielectric continuum model
corresponds to solvation in an infinite volume of solvent avoiding possible artifacts
of solute replica electrostatic interactions in the periodic systems typically used with
explicit solvent models [58], (iv) the implicit titration method describes an instant
response of proton bath and eliminate the need for a vast number of ionization
microstates to model equilibrium ionization state, (v) estimating free energies of sol-
vated structures is much more straightforward than it can be done with explicit water
models, (vi) the computational cost associated with the use of implicit models is con-
siderably smaller than the cost of simulation representing water explicitly. Therefore
a realistic implicit models representing electrostatic effects find a wide applications
in biomolecular simulations. A reliable implicit solvent model should be carefully
optimized in conjunction with particular force field to reproduce the experimental
solvation energies for representative set of small molecules, the potential of mean
force of interactions between pairs of protein side chains in explicit solvent and the
secondary structure equilibrium for peptides [27, 28, 45].

5.2 Free Energy of Protein Decoys

The growing gap between the number of known protein sequences and the number
of structures solved by the X-ray or the NMR methods increases the interest in the
Modeling of Electrostatic Effects in Macromolecules 189

development of reliable computational methods to predict and validate unknown


structures. All-atom force fields and implicit solvation models represent a valuable
tool for refining and scoring protein models produced by coarse grain methods such as
TASSER [166], 3D-SHOTGAN [39], ROSETTA [20], etc. These methods produce
sets of models which contain relatively accurate native-like structures, but these
methods are usually not able to identify the native-like conformations reliably among
a set of other non-native conformations. The necessary requirement for free energy
prediction method is that the method must recognize the native state of the protein
or a set of similar native-like conformations as models having lowest free energies.
Tests on a set of misfolded proteins have shown that the solvation term and its
electrostatic term are important parts of the total free energy of protein in a solvent and
improves success rate of discrimination native structure from decoys [147–149]. The
CHARMM 19 force field with GB solvent model was able to identify the misfolded
structures with more than 90% accuracy [37]. A high success rate have been reported
for discrimination test of a set of protein decoys performed by Felts et al. [40] using
a local energy minimization with OPLS all-atom force field and GBNP implicit
solvent model [43]. Native structures have a lowest free energy for almost 90% of
proteins considered [40]. Later, more rigorous tests of [161] have shown that a long
MD relaxation of protein decoys with AMBER/GB force field led to significant
deterioration of discriminative ability of the force field. The lowest energy structures
were obtained from the short ~5 ps native MD trajectories for 70% proteins, while
a longer relaxation up to ~2 ns decreases the success rate of discrimination of the
native structures up to 20%. It was found that for all proteins of Park and Levitt
[108] decoy set and for a set of the CASP3 protein models the 100% of native
structures were correctly found to be more stable than decoy structures for all proteins
considered with the FAMBE method of calculation of solvation free energy <W(x,
q)>, Eq. (5), [147, 149]. Calculations on a large set of misfolded proteins have lead to
conclusion that the total electrostatic energy of protein in water solvent, i.e. the sum
of internal electrostatic energy and solvent polarization energy are minimal for the
native and native-like protein conformations [149], Fig. 4. Thereby the long range
electrostatic interactions in solution are the essential factor defining the global protein
fold, free energy landscape and probably the folding pathway for partially folded
protein structures. The experimental studies of protein stability and charge-charge
interactions lead to conclusion that the global long-range charge-charge interactions
in protein might be more important than the interactions between charged adjacent
residues [48, 86, 136].
The discriminative ability of a force field and solvation model depends on quality
of protein decoy set and on the protocol used to compute free energies of protein
decoys [5, 143]. Decoy conformations become to be well relaxed within a given
force field and solvation model, unfavorable atom-atom contacts disappear and dis-
crimination of native-like structure from a set of competing decoys becomes being a
real challenge when local energy minimization or a short MD trajectory are substi-
tuted by a long MD trajectory of ns time scale. It was shown [5] that discriminative
accuracy on a high quality independently generated decoy set of the ECEPP05 force
field [4] combined with FAMBEpH solvation-ionization model, Eq. (9), [150] and
190 Y. N. Vorobjev

Fig. 4 The total electrostatic energy electrostatic energy of protein decoys versus decoy’s RMSD
from native structures

structure relaxation is superior with success rate ~89%, compare to other less real-
istic solvation models. This result confirms the importance of a reliable model for
electrostatic energy of protein in water solvent. The long-range nature of electro-
static interactions in large extent depends on the optimum of the global distribution
of charged and neutral residues over the protein volume and the shape of protein
molecular surface, compare to that dependence for the short-range van der Waals
interactions.

6 Predictions of pKa Values of Ionizable Groups

6.1 Modern Methods of pKa Calculations

An accurate prediction of pKa is crucial for reliable modeling of virtually all bio-
logical processes. The current methods of pKa prediction have reached an average
accuracy (RMSD with experimental data) of less than 1 pH unit as reported in
benchmarking papers [11, 34, 69, 83, 128, 134, 140, 141]. However, the reported
benchmark databases are predominantly made of pKa values of surface exposed ion-
izable groups, while an analysis of failures showed that the most problematic are the
predictions of pK’s of buried amino acids. The first pKa-cooperative meeting [2, 102]
Modeling of Electrostatic Effects in Macromolecules 191

indicated that none of existing methods can predict the pKa values for buried amino
acids with the same level of accuracy, i.e. ~1 pK unit [160]. Ionization of the surface
amino acids negligible affects protein stability due to water screening. Ionization of
buried group could in principle significantly reduce protein stability by more than
tenth of kcal/mol. Such an energy change is comparable with typical folding free
energy and could cause partial unfolding or significant structural changes. Therefore
any attempt to predict the pKa value of such groups using static 3D structure will be
potentially wrong. For accurate pKa predictions the methods have to be able to model
induced structural rearrangement or protein structure reorganization and dielectric
response.
The most successful modern practical methods for calculation of pKa of ionizable
group of proteins are based on the continuum electrostatic model described in the
previous sections and take into account neutral state tautomers and conformational
sampling [2]. The conformational sampling can be taken by two different ways.
The first one is the uncorrelated sampling from a set of predefined conformational
states, which are uncorrelated with ionization microstates. The second one is the
conformational sampling by the method of molecular dynamics at constant pH with
conformational states which are correlated on the fly with ionization microstates.

6.2 Predefined Uncorrelated Sampling of Protein


Conformations

Methods of predefined conformational sampling use a set of side chain rotamers


and a restricted set of perturbations of back-bone conformations. The extended set
of ionization-conformation states are constructed as a combination of ionization
states of titratable group, its tautomeric forms and side-chain (main-chain) confor-
mational states in the MCCE2 method [132, 160]. The preselected set of M extended
states are subject to Monte Carlo sampling to generate the Boltzmann distribution
of the extended states using the energy function which is similar to one described
by Eq. (56). The look-up energy tables of the several symmetric M × M matrices
are calculated for electrostatic and non-bonded LJ interactions. The electrostatic
interaction matrix is obtained by solving the Poisson-Boltzmann equation by the
DelPhi [125] for each extended state. The calculation of electrostatic interactions
in the MCCE2 method is done with protein dielectric constant DI  4.0, D0  80.0
and experimental salt concentrations. The 340 pKa s were calculated for 36 proteins
different in size from 56 up to 324 residues. The MCCE2 conformers with alter-
native hydrogen positions and side chain conformers improve calculated pKa s. The
MCCE2 adds side chain conformer search optimized by global packing as well as
local minimization. The MCCE2 accuracy of the pKa s predictions is different for
surface exposed (desolvation penalty < 2 kcal/mol) and buried residues (desolvation
penalty > 2 kcal/mol), with RMSD between calculated and experimental values equal
to 0.78 and 1.31 pK units, respectively. About 10% of calculated pKa s have absolute
errors >2 pK units. The MCCE2 R2 value for the correlation between experimental
192 Y. N. Vorobjev

and calculated pKa s is 0.53 is quite low. The improved version of the hybrid MCCE2
uses intensive generation of the side-chain and main-chain conformations by the MD
simulation of protein with ionizable buried residues [160] at all neutral and all ion-
ized states to extend conformational sampling. The hybrid MCCE2 method shows
some minor improvement over the original MCCE2 method.

6.3 Correlated Sampling of Protein Conformations

One of the major factors affecting the modeling of the protein protonation is the
coupling between ionization and conformational states which is explicitly addressed
by the constant-pH molecular dynamics methods [7, 24, 73, 153, 159]. The CpHMD
methods inherit the problems of accuracy of the underlying atom-atom force field
and the parameters of PB or GB methods to compute the protonation free energies
using continuum dielectric model. The constant-pH MD methods can be classified
into two categories: (i) methods of explicit titration, [8, 10, 36, 90, 92, 97, 159] that
consider physical discrete ionization microstates z and (ii) methods of implicit con-
tinuous titration [72, 74, 82] that work with continuous average ionization degrees
<z> of titratable groups. Progress in the molecular simulation of pH-dependent bio-
logical processes and prediction of the pKa values of protein residues were reviewed
recently [75, 156]. Methods of explicit titration consider random walk in the dis-
crete space of ionization microstates using the Monte Carlo method. For a given
protein conformation x, a Markov chain of ionization microstates zα is generated
by the Metropolis method on the basis of the free energy difference G(x, z1 , z2 ,
pH) between two ionization microstates z1 and z2 . Then, a general MD method is
applied to sample the conformational space x of the protein in the accepted ioniza-
tion microstate. Thus, by the periodic repetition of the MC sampling of ionization
states z and the MD sampling of conformational states x, a distribution of states
(x, z) corresponding to the grand canonical ensemble of ionization-conformational
microstates is generated [10]. Methods of such explicit stochastic titration differ one
from another in several details, such as: (i) method used for calculation of the energy
difference G(x, z1 , z2 , pH) between two ionization microstates z1 and z2 , (ii) MC
method to sample ionization microstates and, (iii) MD program and/or protocol of
MD simulation at a given ionization microstate z. The MD GROMACS package
[17] has been used for MD with explicit water at constant temperature and pressure
to study ionization-conformation coupling in decalysine [90], cytochrome c3 [91]
and lysozyme [92]. The continuum electrostatic model was used for MC sampling
of ionization microstates. The methods employing explicit solvent model for MD
simulation and CEP model for calculation of protonation state energies are computa-
tionally expensive, and MC trial moves are attempted relatively infrequently, causing
long convergence time for systems with multiple titration sites. The GB implicit sol-
vent model employed in both the MC and MD steps via the CHARMM-MD package
[36]. McCammon group [159] used the GB solvent model for both the MC step
and MD simulations with AMBER8 package [25]. Predictions of pKa of titratable
Modeling of Electrostatic Effects in Macromolecules 193

residues were obtained from a set of 5 ns MD simulations at 300 K with about 5 × 105
MC trials of changing ionization microstate of one randomly chosen residue repeated
every 10 fs. This hybrid MD/MC constant pH simulation scheme has a limitation due
to a frequent, ~10 fs, periodic abrupt switch in the protonation state which introduces
a discontinuity in energy and atomic forces and may result in conformational and
energetic instabilities during the MD sampling of conformational states.
The recent works [6, 75, 82] rely on the explicit λ-titration method using λ-
dynamics method [78] to simulate proton binding/release by a set of titratable sites.
The replica exchange (REX) protocol [74] is able to enhance sampling of protonation
and conformational states. After completing all REX-CpHMD cycles for a wide pH
range, the titration coordinates are collected into values of probability of protonated
(unprotonated) state of the site. The calculated pKa of residues are obtained by the
fit of the probability of (de)protonation versus pH to the Henderson-Hasselbalch
equation. The REX-CpHMD method with an improved GBSW solvent model and
salt-screening with the CHARMM molecular modeling package was used for titration
simulation of 10 proteins. The experimental pKa values of residues of these proteins

Fig. 5 Dependence of the average ionization free energy Ginz (pH) of the protein HEWL versus
pH. Solid line is calculated values; filled black bars show the standard deviations (fluctuations) of
the free energy Ginz (x, pH) for ensemble of protein structures x for a given pH calculated over
2 ns trajectories; open circles and dotted line is experimental free energy of ionization Gexp (pH)
computed from experimental titration curve
194 Y. N. Vorobjev

were reproduced with rmsd of 0.6–1.2 with maximum errors of 1.0–4.2 pK units
for buried residues. Recently [6] the REX-CpHMD method was used for predicting
extreme pKa shifts in staphylococcal nucleases mutants. The experimental highly
perturbed pKa values were predicted with average unsigned error of 1.5 pK units,
while the maximum errors is still ~4 pK units for buried residues.
The recently developed CpHMD method with implicit titration potential of mean
force [153] described in the Sect. 4.4 is tested on three proteins, BPTY, HEWL and
RNase A. The developed implicit model of water-proton bath provides an efficient
way to study thermodynamics of biomolecular systems as a function of pH, Fig. 5.

7 Limitations of Current Electrostatic Models

The theoretical framework of the current electrostatic model are based on three
approaches: (i) continuum dielectric model for protein with low uniform dielectric
constant in the interior protein volume, DI , and bulk solvent dielectric constant, D0 ,
in the outside volume; (ii) linear Poisson-Boltzmann equation, and (iii) empirical
atom-atom force field for CpH-MD simulations.
The assumption of the uniform dielectric constant in the protein volume has a
limited accuracy, because the protein environment, local flexibility and dielectric
response is not uniform through the protein volume [19]. Moreover the dielectric
response can be modulated by the small internal cavities [96] presumably filled
with water molecules. The pKa values of protein surface residues are tend to be
very similar to the pKa values of isolated amino acids in water and are governed
by negligible desolvation of the highly flexible protein-water interface. They are
predicted optimally by the model with high value of protein dielectric constant DI 
16–20 [9, 36, 150]. The pKa shifts of the buried ionizable groups in staphylococcal
nuclease (SNase) are always in the direction that promotes the neutral form of the
ionizable groups. This suggested that pKa values are primarily determined by the
desolvation of the buried groups. The desolvation of the buried groups appears to
be poorly counterbalanced by compensating factors to stabilize charged states of
residues [102]. The apparent dielectric constant are varied through protein volume
in the range of 20–8 for surface and buried residues, respectively, as shown by
estimations of the required desolvation penalty using the GB model and experimental
pKa of buried lysine residues in the SNase mutants [60]. A simulation of dielectric
properties of solvated proteins via MD showed that the dielectric response varies
through protein volume for surface and hydrophobic core regions of protein [19],
with average protein dielectric constant ~14–15 units.
The linear Poisson-Boltzmann equation has a limited accuracy to account for ion-
ion correlation and salt effects for protein with highly charged surface, e.g. when the
pH is far from the isoelectric point. The counter-ion condensation effect becomes
significant for a such conditions [150] and certainly can not be ignored. The atomic
radii defining solute-solvent dielectric interface on atomic charges are dependent on
atomic charges [55].
Modeling of Electrostatic Effects in Macromolecules 195

The quality of calculation of pH dependent properties of proteins via CpHMD


simulation depends on the overall accuracy of the atom-atom force field and implicit
solvent model. An important issue is an accuracy of the PMF between pairs of
polar or charged side chains making salt bridge or hydrogen bonds as a function of
separation distance. Chen and Brooks [27], have found that accurate balance between
the nonpolar and electrostatic terms of an implicit solvation model is important for
modeling of the experimental side chain solvation energies and PMF of side-side
chain interactions. Other words the electrostatic model for calculation of the energy
of ionization states z and atom-atom force-field for calculation of the conformational
sampling should be mutually adjusted and optimized.
The modern implicit electrostatic solvent models demonstrate a number of options
for self-improvements to become more accurate and fast in approximations of the
most detailed explicit solvent model. It is likely that improvements in the implicit sol-
vent models accompanied by careful optimizations of the model empirical parameters
will make the implicit electrostatic solvent models a standard well-defined powerful
option of a modern simulation packages for computational structural biology.

Acknowledgements This work was supported by a grant from the Russian Fund of Basic Research
#12-04-00135a, by grant #130-2012 from the Siberian Brunch of Russian Academy of Science and
exchange visitor program P-1-00043 of the Cornell University.

References

1. Aguilar, B., Shadrach, R., Onufriev, A.V.: Reducing the secondary structure bias in the gen-
eralized Born model via R6 effective Radii. J. Chem. Theory Comput. 6, 3613–3630 (2010)
2. Alexov, E., Mehler, E.L., Backer, N., Baptista, A.M., et al.: Progress in the prediction of pKa
values in proteins. Proteins 79, 3260–3275 (2011)
3. Aqvist, J., Hansson, T.: On the validity of electrostatic linear response in polar solvent. J.
Phys. Chem. 100, 9512–9521 (1996)
4. Arnautova, E.Y., Jagielska, A., Scheraga, H.A.: A new force field ECEPP05 for peptides,
proteins and organic molecules. J Phys. Chem. B 110, 5025–5044 (2006)
5. Arnautova, E.Y., Vorobjev, Y.N., Vila, J.A., Scheraga, H.A.: Identifying native-like protein
structures with scoring functions based on all-atom ECEPP force fields, implicit solvent
models and structure relaxation. Proteins 77, 38–51 (2009)
6. Arthur, E.J., Yesselman, J.D., Brooks III, C.L.: Predicting extreme pKa shifts in staphylococcal
nuclease mutants with constant pH molecular dynamics. Proteins 79, 3276–3286 (2011)
7. Baptista, M., Martel, P.J., Petersen, S.B.: Simulation of protein conformation freedom as a
function of pH: constant-pH molecular dynamics using implicit titration. Proteins 27, 523–544
(1997)
8. Baptista, M., Martel, P.J., Soares, C.M.: Simulation of electron-proton coupling with a Monte
Carlo method: application to cytochrome c(3) using continuum electrostatics. Biophys. J. 76,
2978–2998 (1999)
9. Baptista, M., Soares, C.M.: Some theoretical and computational aspects of inclusion of proton
tautomerism in the protonation equilibrium of proteins. J Phys. Chem. B 105, 293–309 (2001)
10. Baptista, A.M., Teixeira, V.H., Soares, C.M.: Constant-pH molecular dynamics using stochas-
tic titration. J. Chem. Phys. 2002(117), 4184–4200 (2002)
11. Bashford, D., Gerwert, K.: Electrostatic calculations of the pKa values of ionizable group in
bacteriorodopsin. J. Mol. Biol. 224, 473–486 (1992)
196 Y. N. Vorobjev

12. Bashford, D., Case, A.D.: Generalized born models of macromolecular solvation effects.
Annu. Rev. Phys. Chem. 51, 129–152 (2000)
13. Beglov, D., Roux, B.: An integral equation to describe the solvation of polar molecules in
liquid water. J. Chem. Phys. 104, 8678–8689 (1996)
14. Beglov, D., Roux, B.: Solvation of complex molecules in a polar liquid: an integral equation
theory. J. Phys. Chem. 101, 7821–7826 (1997)
15. Ben-Naim, A., Marcus, Y.: Solvation thermodynamics of nonionic solutes. J. Chem. Phys.
81, 2016–2027 (1984)
16. Ben-Naim, A.: Solvent effects on protein association and protein folding. Biopolymers 29,
567–596 (1990)
17. Berendsen, H.J.C., Van der Spoel, D., Van Drunen, R.: GROMACS: a message passing parallel
molecular dynamics implementation. Comput. Phys. Commun. 1995(91), 43–56 (1995)
18. Bharadwaj, R., Windemuth, A., Sridharan, S., Honig, B., Nicholls, A.: The fast multipole
boundary element method for molecular electrostatics: an optimal approach for large systems.
J. Comput. Chem. 16, 898–913 (1995)
19. Boresch, S., Ringhofer, S., Hochtl, P., Steinhauser, O.: Toward better description and under-
standing of biomolecular solvation. Biophys. Chem. 78, 43–68 (1999)
20. Bradley, P., Misura, K.M., Baker, D.: Towards high-resolution de nova structure prediction
for small proteins. Science 309, 1868–1871 (2005)
21. Brooks III, C.L., Karplus, M., Pettitt, B.M.: Proteins a theoretical perspectives of dynamics,
structure and thermodynamics. In: Prigogine, I., Rice, S.A. (eds.) Advances in Chemical
Physics, vol. LXXI. Wiley, New York (1988)
22. Brooks, B.R., Brooks III, C.L., Mackerell, A.D., Nilsson, L., Petrella, R.J., Roux, B., Won, Y.,
Archontis, G., Bartels, C., Boresch, S., Caflisch, A., Caves, L., Cui, Q., Dinner, A.R., Feig, M.,
Fischer, S., Gao, J., Hodoscek, M., Im, W., Kuczera, K., Lazaridis, T., Ma, J., Ovchinnikov, V.,
Paci, E., Pastor, R.W., Post, C.B., Pu, J.Z., Schaefer, M., Tidor, B., Venable, R.M., Woodcock,
H.L., Wu, X., Yang, W., York, D.M., Karplus, M.: CHARMM: the biomolecular simulation
program. J. Comput. Chem. 30, 1545–1615 (2009)
23. Bogusz, S., Cheatham III, T.E., Brooks, R.R.: Removal of pressure and free energy artifacts
in charged periodic system via net charge corrections to the Ewald potential. J. Chem. Phys.
108, 7070–7084 (2007)
24. Bürgi, R., Kollman, P.A., Van Gunsteren, V.F.: Simulating proteins at constant pH: an approach
combining molecular dynamics and Monte Carlo simulations. Proteins 47, 469–480 (2002)
25. Case, D.A., Darden, T., Cheatham III, T.E., Simmerling, C., Wang, J., Merz, K.M., Wang, B.,
Pearlman, D.A., Duke, R.E., Crowley, M., Brozell, S., Luo, R., Tsui, V., Gohlke, H., Mongan,
J., Hornak, V., Caldwell, J.W., Ross, W.S., Kollman, P.A.: Amber8. University of California,
San Francisco (2004)
26. Chen, J., Brooks, C.: Critical importance of length-scale dependence in implicit modeling of
hydrophobic interactions. J. Am. Chem. Soc. 129, 2444–2445 (2007)
27. Chen, J., Brooks, C.: Implicit modeling of nonpolar solvation for simulating protein folding
and conformational transitions. Phys. Chem. Chem. Phys. 10, 471–481 (2008)
28. Chen, J.: Effective approximation of molecular volume using atom-centered dielectric func-
tions in generalized Born models. J. Chem. Theory Comput. 6, 2790–2803 (2010)
29. Chothia, C.H.: Hydrophobic bonding and accessible area in proteins. Nature 248, 338–339
(1974)
30. Connolly, M.L.: Analytical molecular surface calculation. J. Appl. Crystallogr. 16, 548–558
(1983)
31. Connolly, M.L.: Solvent-accessible surfaces of proteins and nucleic acids. Science 221,
709–713 (1983)
32. Connolly, M.L.: Computation of molecular volume. J. Am. Chem. Soc. 107, 1118–1124
(1985). http://www.netsci.org/Science/Compchem/feature14e.html
33. Curutchet, C., Cramer, C.J., Truhlar, D.G., Ruiz-Lopez, M.F., Rinaldi, D., Orozco, M., Luque,
F.J.: Electrostatic component of solvation: comparison of SCRF continuum models. J. Com-
put. Chem. 24, 284–297 (2003)
Modeling of Electrostatic Effects in Macromolecules 197

34. Davies, M.N., Toseland, C.P., Moss, D.S., Flower, D.R.: Benchmarking pKa prediction. BMC
Biochem. 7, 18–30 (2006)
35. Douglas, C.C.: Multigrid methods in science and engineering. Comput. Sci. Eng. 3, 55–68
(1996)
36. Dlugosz, M., Antosiewicz, J.M.: Constant pH molecular dynamics simulations: test case of
succinic acid. Chem. Phys. 302, 161–170 (2004)
37. Dominy, B.N., Brooks, C.L.: Identifying native-like protein structures using physics-based
potentials. J. Comput. Chem. 23, 147–160 (2002)
38. Feig, M., Onufriev, A., Lee, M., Im, W.: Performance comparison of Generalized Born and
Poisson methods in the calculation of electrostatic solvation energies for protein structures.
J. Comput. Chem. 25, 265–284 (2004)
39. Fisher, D.: 3D-SHORTGUN: a novel, cooperative, fold-recognition meta-predictor. Proteins
51, 434–444 (2003)
40. Felts, A.K., Gallicchio, E., Wallqvist, A., Levy, R.M.: Distinquishing native conformations
of proteins from decoys with an effective free energy estimator based on the OPLS all-atom
force field and the surface generalized Born solvent model. Proteins 48, 404–422 (2002)
41. Fogolary, F., Esposito, G., Viglino, P., Molinari, H.: Molecular mechanics and dynamics of
biomolecules using a solvent continuum model. J. Comput. Chem. 22, 1830–1842 (2001)
42. Gallicchio, E., Kubo, M.M., Levy, R.M.: Enthalpy-entropy and cavity decomposition of alkane
hydration free energies: numerical results and implications for theories of hydrophobic sol-
vation. J. Phys. Chem. B. 104, 6271–6285 (2000)
43. Gallicchio, E., Zhang, L.Y., Levy, R.M.: The SGB/NP hydration free energy model based on
the surface genaralized Born solvent reaction field and novel nonpolar hydration free energy
estimators. J. Comput. Chem. 23, 517–529 (2002)
44. Gallicchio, E., Levy, R.: AGBNP: an analytic implicit solvent model suitable for molecular
dynamics simulations and high-resolution modeling. J. Comput. Chem. 25, 479–499 (2004)
45. Gallicchio, E., Paris, K., Levy, R.: The AGBNP2 implicit solvation model. J. Chem. Theory
Comput. 5, 2544–2564 (2009)
46. Goel, N.S., Gang, F., Ko, Z.: Electrostatic field in inhomogeneous dielectric media. Indirect
boundary element method. J. Comput. Phys. 118, 172–179 (1995)
47. Grant, J.A., Pickup, B.T.: A Gaussian description of molecular shape. J. Phys. Chem. 99,
3503–3510 (1995)
48. Gribenko, A.V., Patel, M.M., Liu, J., McCallum, S.A., Makhatadze, G.I.: Rational stabilization
of enzymes by computational redesign of surface charge-charge interactions. Proc. Natl. Acad.
Sci. U.S.A. 106, 2601–2606 (2009)
49. Hawkins, G.D., Cramer, C.J., Truhlar, D.G.: Parametrized models of aqueous free energies
of solvation based pairwise solute descreening of solute atomic charges from a dielectric
medium. J. Phys. Chem. 100, 19824–19836 (1996)
50. Hermann, R.B.: Theory of hydrophobic bonding. II. The correlation of hydrocarbon solubility
in water with solvent cavity surface area. J. Phys. Chem. 76, 2754–2759 (1972)
51. Holst, M., Kozack, R.E., Saied, F., Subramaniam, S.: Treatment of electrostatic effects in
proteins: multigrid-based Newton iterative method for solution of the full nonlinear Poisson-
Boltzmann equation. Proteins 18, 231–245 (1994)
52. Holst, M., Saied, F.: Numerical solution of the nonlinear Poisson-Boltzmann equation: devel-
oping more robust and efficient methods. J. Comput. Chem. 16, 337–364 (1995)
53. Holst, M., Baker, N., Wang, M.: Adaptive multilevel finite element solution of the Pois-
son–Boltzmann equation I. Algorithms and examples. J. Comput. Chem. 21, 1319–1342
(2000)
54. Honig, B., Sharp, K., Yang, A.S.: Macroscopic models of aqueous solutions: biological and
chemical applications. J. Phys. Chem. 97, 1101–1109 (1993)
55. Hou, G., Zhu, X., Cui, Q.: An implicit solvent model for SCC-DFTB with charge-dependent
radii. J. Chem. Theory Comput. 6, 2303–2314 (2010)
56. Hummer, G., Pratt, L.R., Garcia, A.E.: Hydration free energy of water. J. Phys. Chem. 99,
14188–14194 (1995)
198 Y. N. Vorobjev

57. Hummer, G., Pratt, L.R., Garcia, A.E.: Free energy of ionic hydration. J. Phys. Chem. 100,
1206–1215 (1996)
58. Hűnnenberg, P.H., McCammon, J.A.: Effect of artificial periodicity in simulations of
biomolecules under Ewald boundary conditions: a continuum electrostatic study. Biophys.
Chem. 78, 69–88 (1999)
59. Im, W., Lee, M.S., Brooks III, C.L.: Generalized Born model with a simple smoothing func-
tion. J. Comput. Chem. 24, 1691–1702 (2003)
60. Isom, D.G., Castaneda, C.A., Cannon, B.R., Garcia-Moreno, B.E.: Large shifts in pKa values
of lysine residues buried inside a protein. PNAS 108, 5260–5265 (2011)
61. Jackson, J.D.: Classical electrodynamics. Wiley, New York (1975)
62. Jackson, R.M., Sternberg, J.E.: Application of scaled particle theory to model the hydrophobic
effect: implications for molecular association and protein stability. Protein Eng. 7, 371–383
(1994)
63. Jackson, R.M., Sternberg, J.E.: A continuum model for protein-protein interactions: applica-
tions to the docking problem. J. Mol. Biol. 250, 258–275 (1995)
64. Jayaram, B., Fine, R., Sharp, K., Honig, B.: Free energy calculations of ion hydration: an
analysis of the Born model in terms of microscopic simulations. J. Phys. Chem. 93, 4320–4327
(1989)
65. Jorgensen, W.L., Madura, J.D.: Temperature and size dependence for Monte Carlo simulations
of TIP4P water. Mol. Phys. 56, 1381–1392 (1985)
66. Jorgensen, W.L., Maxwell, D.S., Tirado-Rives, J.J.: Development and testing of the OPLS
all-atom force field on conformational energetics and properties of organic liquids. J. Am.
Chem. Soc. 118, 11225–11236 (1996)
67. Jorgensen, W., Tirado-Rives, J.: Free energies of hydration from a generalized born model
and an all-atom force field. J. Phys. Chem. B 108, 16264–16270 (2004)
68. Juffer, A.H., Botta, E.F.F., Bert, A.M., van Keulen, B.A.M., van der Ploeg, A., Berendsen,
H.J.C.: The electric potential of a macromolecule in a solvent: a fundamental approach. J.
Comput. Phys. 97, 144–171 (1991)
69. Juffer, A.H., Eisenbaher, S.J., Hubbard, S.J., Walter, D., Argos, P.: Comparison of atomic
solvation parametric sets: applicability and limitations in protein folding and binding. Protein
Sci. 4, 2499–2509 (1995)
70. Kar, P., Wei, Y., Hansmann, U.E., Höfinger, S.: Systematic study of the boundary composition
in Poisson Boltzmann calculations. J. Comput. Chem. 28, 2538–2544 (2007)
71. Karplus, M., McCammon, A.: Molecular dynamics simulations of biomolecules. Nat. Struct.
Biol. 9, 646–652 (2002)
72. Khandogin, J., Brooks III, C.L.: Constant pH molecular dynamics with proton tautomerism.
Biophys. J. 89, 141–157 (2005)
73. Khandogin, J., Chen, J., Brooks III, C.L.: Exploring atomistic details of pH-dependent peptide
folding. PNAS 103, 18546–18550 (2006)
74. Khandogin, J., Brooks III, C.L.: Toward the accurate first-principles prediction of ionization
equilibria in proteins. Biochemistry 45, 9363–9373 (2006)
75. Khandogin, J., Brooks III, C.L.: Molecular simulation pH-mediated biological processes.
Annu. Rep. Comput. Chem. 3, 3–12 (2007)
76. Kollman, P.: Free energy calculations: applications to chemical and biochemical phenomena.
Chem. Rev. 93, 2395–2417 (1993)
77. Kollman, P., Massova, I., Reyes, C., Kuhn, B., Huo, S., Chong, L., Lee, M., Lee, T., Dua, Y.,
Wang, L., Donini, O., Cieplak, P., Srinivasan, J., Case, D., Cheatham III, T.E.: Calculating
structures and free energies of complex molecules: combining molecular mechanics and
continuum models. Acc. Chem. Res. 33, 889–897 (2000)
78. Kong, X., Brooks III, C.L.: λ-dynamics: a new approach to free energy calculations. J. Chem.
Phys. 105, 2414–2423 (1996)
79. Landau, L.D., Lifshitz, E.M.: Electrodynamics of Continuous Media. V. 8. Course of theo-
retical physics. Translated from the Russian. Pergamon Press, Oxford (1988)
Modeling of Electrostatic Effects in Macromolecules 199

80. Lee, M.R., Duan, Y., Kollman, P.A.: Use of MM-PB/SA in estimating the free Energies of
proteins: application to native, intermediates, and unfolded villin headpiece. Proteins 39,
309–316 (2000)
81. Lee, M.S., Feig, M., Salsbury Jr., F.R., Brooks III, C.L.: New analytic approximation to the
standart molecular volume definition and its application to generalized Born calculations. J.
Comput. Chem. 24, 1348–1356 (2003)
82. Lee, M.S., Salsbury Jr., F.R., Brooks III, C.L.: Constant-pH molecular dynamics using con-
tinuous titration coordinates. Proteins 56, 738–752 (2004)
83. Lee, M.S., Olson, M.A.: Protein folding simulations combining self-guided Langevin dynam-
ics and temperature-based replica exchange. J. Chem. Theory Comput. 6, 2477–2487 (2010)
84. Levy, R.M., Belhadj, M., Kitchen, D.B.: Gaussian fluctuation formula for electrostatic free
energy changes in solution. J. Chem. Phys. 95, 3627–3633 (1991)
85. Levy, R.M., Zhanh, L.Y., Gallicchio, E., Felts, A.: On the non polar hydration free energy of
proteins: surface area and continuum solvent models for the solute-solvent interaction energy.
J. Am. Chem. Soc. 25, 9523–9530 (2003)
86. Loladze, V.V., Makhatadze, G.I.: Energetics of charge-charge interactions between residues
adjacent in sequence. Proteins 79, 3494–3499 (2011)
87. Lounnas, V., Pettitt, B.M., Phillips Jr., B.M.: A global model of protein-water interface.
Biophys. J. 66, 601–614 (1994)
88. Lu, B., Cheng, X.L., Hang, J.F., McCammon, A.: Order N algorithm for computation
of electrostatic interactions in biomolecular systems. Proc. Natl. Acad. Sci. U.S.A. 103,
19314–19319 (2006)
89. Lu, B., McCammon, A.: Improved boundary element method for Poisson-Boltzman electro-
static potential and force calculatins. J. Chem. Theory Comput. 3, 1134–1142 (2007)
90. Machuqueiro, M., Baptista, A.M.: Constant-pH molecular dynamics with ionic strength
effects: Protonation–Conformation coupling in decalysine. J. Phys. Chem. 110, 2927–2933
(2006)
91. Machuqueiro, M., Baptista, A.M.: Molecular dynamics at constant pH and reduction potential:
application to cytochrome c3. J. Am. Chem. Soc. 131, 12586–12594 (2009)
92. Machuqueiro, M., Baptista, A.M.: Is the prediction of pKa values by the constant-pH molec-
ular dynamics being hindered by inherited problems? Proteins 79, 3437–3447 (2011)
93. Madura, J.D., Davis, M.E., Gilson, M.K., Wade, R.C., Luty, B.A., McCammon, J.A.: Bio-
logical application of electrostatic calculations and Brownian dynamics simulations. Rev.
Comput. Chem. 5, 229–267 (1994)
94. McDowell, S.C., Špackova, N., Šponer, J., Walter, N.G.: Molecular dynamics simulations of
RNA: an in silico single molecule approach. Biopolymers 85, 169–184 (2007)
95. McKenney, A., Greengard, L.: A fast Poisson solver for complex geometries. J. Comput.
Phys. 118, 348–355 (1995)
96. Meyer, T., Kieseritzky, G., Knapp, E.W.: Electrostatic pKa computations in protein: role of
internal cavities. Proteins 79, 3320–3332 (2011). https://doi.org/10.1002/prot.23092
97. Mongan, J., Case, D.A., McCammon, J.A.: Constant pH molecular dynamics in generalized
Born implicit solvent. J. Comput. Chem. 25, 2038–2064 (2004)
98. Mongan, J., Simmerling, C., McCammon, J., Case, D., Onufriev, A.: A generalized Born
model with a simple, robust molecular volume correction. J. Chem. Theory Comput. 3,
156–159 (2007)
99. Mongan, J., Svrcek-Seiler, W.A., Onufriev, A.: Analysis of integral expressions for effective
Born radii. J Chem. Phys. 127, 18510–18521 (2007)
100. Nina, M., Beglov, D., Roux, B.: Atomic radii for continuum electrostatic calculations based
on molecular dynamics free energy simulations. J. Phys. Chem. 101, 5239–5248 (1997)
101. Nina, M., Im, W., Roux, B.: Optimized atomic radii for protein contiuum electrostatic solvation
forces. Biophys. Chem. 78, 89–96 (1999)
102. Nielesen, J.E., Gunner, M.R., Garcia-Moreno, B.E.: The pKa Cooperative: a collaborative
effort to advance structure-based calculation of pKa values and electrostatic effects in proteins.
Proteins 79, 3249–3259 (2011)
200 Y. N. Vorobjev

103. Nozaki, Y., Tanford, C.: Examination of titration behavior. Methods Enzymol. 11, 715–734
(1967)
104. Novotny, J., Brucooleri, R.E., Davis, M., Sharp, K.A.: Empirical free energy calculations: a
blind test and further improvements of the method. J. Mol. Biol. 268, 401–411 (1997)
105. Onufriev, A., Case, D., Bashford, D.: Effective Born radii the generalized Born approximation:
the importance of being perfect. J. Comput. Chem. 23, 1297–1304 (2002)
106. Onufriev, A., Bashford, D., Case, D.: Eploring protein native states and large scale confor-
mational changes with modified generalized Born model. Proteins 55, 383–394 (2004)
107. Onufriev, A.: Implicit solvent models in molecular dynamics simulations: a brief overview.
Annu. Rep. Comp. Chem. 4, 125–137 (2008)
108. Park, B.H., Levitt, M.: Decoys of globular proteins. J. Mol. Biol. 258, 367–392 (1996)
109. Perrot, G.B., Cheng, B., Gibson, K.D., Vila, J., Palmer, K.A., Nayeem, A., Maigret, B.,
Scheraga, H.A.: MSEED: a program for rapid analytical determination of accessible surface
areas and their derivatives. J. Comput. Chem. 13, 1–11 (1992)
110. Pellegrini, E., Field, M.J.: A generalized-born solvation model for macromolecular hybrid-
potential calculations. J. Phys. Chem. A 106, 1316–1326 (2002)
111. Pierotti, R.A.: A scaled particle theory of aqueous and non-aqueous solutions. Chem. Rev.
76, 717–726 (1976)
112. Postma, J.P.M., Berendsen, H.J.C., Haak, J.R.: Thermodynamics of cavity formation in water.
Faraday Symp. Chem. Soc. 17, 55–67 (1982)
113. Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical recipes in C. Cam-
bridge University Press, Cambridge (1988)
114. Radmer, R.J., Kollman, P.A.: Free energy calculation methods: a theoretical and empirical
comparison of numerical errors and a new method for qualitative estimates of free energy
changes. J. Comput. Chem. 18, 902–919 (1997)
115. Rashin, A.A.: Hydration phenomena, classical electrostatics, and the boundary element
method. J. Phys. Chem. 94, 1725–1733 (1990)
116. Rashin, A.A., Young, L., Topol, I.A.: Quantitative evaluation of hydration thermodynamics
with continuum model. Biophys. Chem. 51, 359–374 (1994)
117. Richards, F.M.: Areas, volume, packing and protein structures. Annu. Rev. Biophys. Biophys.
Chem. 19, 301–332 (1977)
118. Rick, S.W., Berne, B.J.: The aqueous solvation of water: a comparison of continuum methods
with molecular dynamics. J. Am. Chem. Soc. 116, 3949–3954 (1994)
119. Ripoll, D.R., Vorobjev, Y.N., Liwo, A., Vila, J.A., Scheraga, H.A.: Coupling between folding
and ionization equilibria: effect of pH on the conformational preferences of polypeptides. J.
Mol. Biol. 264, 770–783 (1996)
120. Rocchia, W., Sridharan, S., Nicholls, A., Alexov, E., Chiabrera, A., Honig, B.: Rapid grid-
based construction of the molecular surface and the use of induced surface charge to calculate
reaction field energies: applications to the molecular systems and geometric objects. J. Com-
put. Chem. 23, 128–137 (2002)
121. Roux, B., Yu, H.A., Karplus, M.: Molecular basis for the Born model of ion solvation. J. Phys.
Chem. 94, 4683–4688 (1990)
122. Roux, B., Simonson, T.: Implicit solvent models. Biopys. Chem. 78, 1–20 (1999)
123. Sanner, M.F., Olson, A.J., Spehner, J.C.: Reduced surface: an efficient way to compute molec-
ular surfaces. Biopolymers 38, 305–320 (1996)
124. Schaefer, M., Froemmel, C.: A precise analytical method for calculating the electrostatic
energy of macromolecules in aqueous solution. J. Mol. Biol. 216, 1045–1066 (1990)
125. Sharp, K.A., Honig, B.: Electrostatic interactions in macromolecules: theory and applications.
Annu. Rev. Biophys. Chem. 19, 301–332 (1990)
126. Schellman, J.A.: Macromolecular binding. Biopolymers 14, 999–1018 (1975)
127. Scheraga, H.A.: Theory of hydrophobic interactions. J. Biomol. Struct. Dyn. 16, 447–460
(1998)
128. Simmerling, C., Strockbine, B., Roitberg, A.E.: All-atom structure prediction and folding
simulations of a stable protein. J. Am. Chem. Soc. 124, 11258–11259 (2002)
Modeling of Electrostatic Effects in Macromolecules 201

129. Simonson, T., Brünger, A.: Solvation free energies estimated from macroscopic continuum
theory: an accuracy assessment. J. Phys. Chem. 98, 4683–4694 (1994)
130. Sitkoff, D., Sharp, K.A., Honig, B.: Accurate calculation of hydration free energies using
macroscopic solvent models. J. Phys. Chem. 98, 1978–1988 (1994)
131. Sobolevski, E., Makowski, M., Czaplewski, C., Liwo, A., Oldziej, S., Scheraga, H.A.: Poten-
tial of mean force of hydrophobic association: dependence on solute Size. J. Phys. Chem. B
111, 10765–10774 (2007)
132. Song, W., Mao, J., Gunner, M.R.: MCCE2: Improved protein pKa calculations with extensive
side chain rotamer sampling. J. Comput. Chem. 30, 2231–2247 (2011)
133. Srinivasan, J., Cheatham, T.E., Cieplak, P., Kollman, P.A., Case, D.A.: Continuum solvent
studies of stability of DNA, RNA and phosphoramide DNA helicases. J. Am. Chem. Soc.
120, 9401–9409 (1998)
134. Stanton, C., Houk, K.: Benchmarking pKa prediction methods for residues in proteins. J.
Chem. Theory Comput. 3, 951–966 (2008)
135. Still, W.C., Tempczyk, A., Hawley, R.C., Hendricson, T.: Semianalytical treatment of solvation
for molecular mechanics and dynamics. J. Am. Chem. Soc. 112, 6127–6129 (1990)
136. Strickler, S.S., Gribenko, A.V., Keiffer, T.R., Tomlinson, J., Reihle, T., Loladze, V.V.,
Makhatadze, G.I.: Protein stability and surface electrostatics: a charged relationship. Biochm-
istry 45, 2761–2766 (2006)
137. Tanford, C.: Protein denaturation: part C. Theoretical models for denaturation. Adv. Protein
Chem. 24, 1–95 (1970)
138. Tanford, C., Roxby, R.: The interpretation of protein titration curves. Application to lysozyme.
Biochemistry 11, 2192–2198 (1972)
139. Tanokura, M.: 1 H-NMR study of the tautomerism of the imidazole ring of histidine residues:
1. Microscopic pK values and molar ratios of tautomers in histidine containing peptides.
Biochim. Biophys. Acta 742, 576–585 (1983)
140. Teixeira, V.H., Cunha, C.A., Machuqueiro, M., Oliveira, A.S.V., Victor, B.L., Soares, C.M.,
Baptista, A.A.: On the use of different dielectric constants for computing individual and
pairwise terms in Poisson-Bolzman studies of protein ionization equilibrium. J. Phys Chem
B 109, 14691–14706 (2005)
141. Tomasi, J., Persico, M.: Molecular interactions in solution: overview of methods based on
continuum distribution of the solvent. Chem. Rev. 94, 2027–2094 (1994)
142. Varshney, A., Brooks, F.P., Wright, W.V.: Computing smooth molecular surface. IEEE Com-
put. Graph. Appl. 14, 19–25 (1994)
143. Vila, j, Ripoll, D.R., Arnautova, Y.A., Vorobjev, Y.N., Scheraga, H.A.: Coupling between
conformation and proton binding in proteins. Proteins 61, 56–68 (2005)
144. Vorobjev, Y.N., Grant, J.A., Scheraga, H.A.: A combined iterative and boundary element
approach for solution of the nonlinear Poisson-Boltzmann equation. J. Am. Chem. Soc. 114,
3189–3196 (1992)
145. Vorobjev, Y.N., Scheraga, H.A.: A fast adaptive multigrid boundary element method for
macromolecular electrostatics in a solvent. J. Comput. Chem. 18, 569–583 (1997)
146. Vorobjev, Y.N., Hermans, J.: SIMS, computation of a smooth invariant molecular surface.
Biophys. J. 73, 722–732 (1997)
147. Vorobjev, Y.N., Almagro, J.C., Hermans, J.: Discrimination between native and intentionally
misfolded conformation of proteins: ES/IS, new method for calculating conformational free
energy that uses both dynamic s simulations with an explicit solvent and implicit solvent
continuum model. Proteins 32, 399–413 (1998)
148. Vorobjev, Y.N., Hermans, J.: ES/IS: estimation of conformational free energy by combining
dynamics simulations with explicit solvent with an implicit solvent continuum model. Biopys.
Chem. 78, 195–205 (1999)
149. Vorobjev, Y.N., Hermans, J.: Free energies of protein decoys provide insight into determinant
of protein stability. Protein Sci. 10, 2498–2506 (2001)
150. Vorobjev, Y.N., Vila, J., Scheraga, H.A.: FAMBE-pH: a fast and accurate method to compute
the total solvation free energies of proteins. J. Phys. Chem. B 112, 11122–11136 (2008)
202 Y. N. Vorobjev

151. Vorobjev, Y.N.: Blind docking method combining search of low-resolution binding sites with
ligand pose refinement by molecular dynamics-based global optimization. J. Comput. Chem.
31, 1080–1092 (2010)
152. Vorobjev, Y.N.: Advances in implicit models of water solvent to compute conformational free
energy and molecular dynamics of proteins at constant pH. Adv. Protein Chem. Struct. Biol.
85, 282–322 (2011)
153. Vorobjev, Y.N.: Potential of mean force of water-proton bath and molecular dynamic simula-
tion of proteins at constant pH. J. Comput. Chem. 33, 832–842 (2012)
154. Wagoner, J., Baker, N.: Assessing implicit models for nonpolar mean solvation forces: the
importance of dispersion and volume terms. Proc. Nat. Acad. Sci. U.S.A. 103, 8331–8336
(2006)
155. Wang, J., Cieplak, P., Kollman, P.A.: How well does a restrained electrostatic potential (RESP)
model perform in calculating conformational energies of organic and biological molecules?
J. Comput. Chem. 21, 1049–1074 (2000)
156. Wallace, J.A., Shen, J.K.: Predicting pKa values with continuous constant pH molecular
dynamics. Methods Enzymol. 466, 455–475 (2009)
157. Wallqvist, W., Berne, B.J.: Molecular dynamics study of the dependence of water solvation
free energy on solute curvature and surface area. J. Phys. Chem. 99, 2885–2892 (1995)
158. Wallqvist, W., Berne, B.J.: Computer simulation of hydrophobic hydration forces on stacked
plates at short range. J. Phys. Chem. 99, 2893–2899 (1995)
159. Williams, S.L., Oliveira, C.A.F., McCammon, J.A.: Coupling constant pH molecular dynamics
with accelerated molecular dynamics. J. Chem. Theory. Comput. 6, 560–568 (2010)
160. Wihtam, S., Talley, K., Wang, L., Zhang, Z., Sarkar, S., Gao, D., Yang, W., Alexov, E.:
Developing of hybrid approaches to predict pKa values of ionizable groups. Proteins 79,
3389–3399 (2011)
161. Wroblewska, L., Skolnick, J.: Can a physics-based, all-atom potential find a protein’s native
structure among misfolded structures? I. Large scale AMBER benchmarking. J. Comput.
Chem. 28, 2059–2066 (2007)
162. Yang, S.A., Honig, B.: On the pH dependence of protein stability. J. Mol. Biol. 231, 459–474
(1993)
163. Yoon, B.J., Lenhoff, A.M.: A boundary element method for molecular electrostatics with
electrolyte effects. J. Comput. Chem. 11, 1080–1086 (1990)
164. Zauhar, R.J.: SMATR: a solvent-accessible triangulated surface generator for molecular graph-
ics and boundary element applications. J. Comput. Aided Mol. Des. 9, 149–159 (1995)
165. Zauhar, R.J., Varnek, A.A.: Fast and space-efficient boundary element method for computing
electrostatics and hydration effects in large molecules. J. Comput. Chem. 17, 864–877 (1996)
166. Zhang, Y., Skolnick, J.: Automated structure prediction of weakly gomologous proteins on a
genomic scale. Proc. Natl. Acad. Sci. U.S.A. 101, 7594–7599 (2003)
167. Zhou, Z., Payne, P., Vasquez, M., Kuhn, N., Levitt, M.: Finite-difference solution of the
Poisson-Boltzmann equation: complete elimination of self-energy. J. Comput. Chem. 17,
1344–1353 (1996)
168. Zhou, Y.C., Feig, M., Wei, G.W.: Higly accurate biomolecular electrostatics in continuum
dielectric environments. J. Comput. Chem. 29, 87–97 (2008)
Optimizations of Protein Force Fields

Yoshitake Sakae and Yuko Okamoto

Abstract In this Chapter we review our works on force fields for molecular simula-
tions of protein systems. We first discuss the functional forms of the force fields and
present some extensions of the conventional ones. We then present various methods
for force-field parameter optimizations. Finally, some examples of our applications
of these parameter optimization methods are given and they are compared with the
results from the existing force fields.

1 Introduction

Computer simulations of protein folding into native structures can be achieved when
both of the following two requirements are met: (1) potential energy functions
(or, force fields) for the protein systems are sufficiently accurate and (2) sufficiently
powerful conformational sampling methods are available. Professor Harold A. Scher-
aga has been one of the most important pioneers in studies of both of the above
requirements [1, 2]. By the developments of the generalized-ensemble algorithms

Y. Sakae
Department of Theoretical and Computational Molecular Science,
Institute for Molecular Science, Okazaki, Aichi 444-8585, Japan
e-mail: sakae@tb.phys.nagoya-u.ac.jp
Y. Sakae · Y. Okamoto (B)
Department of Physics, Graduate School of Science, Nagoya University,
Nagoya, Aichi 464-8602, Japan
e-mail: okamoto@phys.nagoya-u.ac.jp
Y. Okamoto
Structural Biology Research Center, Graduate School of Science,
Nagoya University, Nagoya, Aichi 464-8602, Japan
Y. Okamoto
Center for Computational Science, Graduate School of Engineering,
Nagoya University, Nagoya, Aichi 464-8603, Japan
Y. Okamoto
Information Technology Center, Nagoya University, Nagoya, Aichi 464-8601, Japan
© Springer Nature Switzerland AG 2019 203
A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_7
204 Y. Sakae and Y. Okamoto

(for reviews, see, e.g., Refs. [3–6]) and related methods, Requirement (2) seems to be
almost fulfilled. In this chapter, we therefore concentrate our attention on Require-
ment (1).
There are several well-known all-atom (or united-atom) force fields, such as
AMBER [7–11], CHARMM [12–14], OPLS [15, 16], GROMOS [17, 18], GRO-
MACS [19, 20], and ECEPP [21, 22]. Generally, the force-field parameters are
determined based on experimental results for small molecules and theoretical results
using quantum chemistry calculations of small peptides such as alanine dipeptide.
However, the simulations using different force-field parameters will give different
results. We have performed detailed comparisons of three version of AMBER (ff94
[7], ff96 [8], and ff99 [9]), CHARMM [12], OPLS-AA/L [16], and GROMOS [17]
by generalized-ensemble simulations of two small peptides in explicit solvent [23,
24]. We saw that these force fields showed clearly different behaviors especially
with respect to secondary-structure-forming tendencies. The folding simulations of
the two peptides with implicit solvent model also showed similar results [25–27]. For
instance, the ff94 [7] and ff96 [8] versions of AMBER yield very different behaviors
about the secondary-structure-forming tendencies, although these force fields differ
only in the main-chain torsion-energy terms. Many researchers have thus studied
the main-chain torsion-energy terms and their force-field parameters. For example,
newer force-field parameters for the main-chain torsion-energy terms about φ and
ψ angles have been developed, which are, e.g., AMBER ff99SB [10], AMBER
ff03 [11], CHARMM22/CMAP [13, 14] and OPLS-AA/L [16]. The methods of
the force-field optimization thus mainly concentrate on the torsion-energy terms.
These modifications of the torsion energy are usually based on quantum chemistry
calculations [13, 14, 28–31] or NMR experimental results [32, 33].
We have proposed a new main-chain torsion-energy term, which is represented
by a double Fourier series in two variables, the main-chain dihedral angles φ and
ψ [34, 35]. This expression gives a natural representation of the torsion energy in
the Ramachandran space [36] in the sense that any two-dimensional energy surface
periodic in both φ and ψ can be expanded by the double Fourier series. We can
then easily control secondary-structure-forming tendencies by modifying the main-
chain torsion-energy surface. We have presented preliminary results for AMBER
ff94 and AMBER ff96 [34, 35].
Moreover, we have introduced several optimization methods of force-field param-
eters [25–27, 38, 39]. These methods are based on the minimization of some score
functions by simulations in the force-field parameter space, where the score functions
are derived from the protein coordinate data in the Protein Data Bank (PDB). Our
methods are different from most of previous knowledge based optimization methods
mainly in two points: We use only the PDB data without introducing decoys such
as Z-score method [37] and we use larger and more proteins than one or a few pep-
tides such as alanine dipeptide for estimating our score functions. One of the score
functions consists of the sum of the square of the force acting on each atom in the
proteins with the structures from the PDB [25–27]. Other score functions are taken
from the root-mean-square deviations between the original PDB structures and the
corresponding minimized structures [38, 39].
Optimizations of Protein Force Fields 205

We have also proposed a new type of the main-chain torsion-energy terms for
protein systems, which can have amino-acid-dependent force-field parameters [40].
As an example of this formulation, we applied this approach to the AMBER ff03
force field and determined new amino-acid-dependent main-chain torsion-energy
parameters for ψ (N–Cα –C–N) and ψ  (Cβ –Cα –C–N) by using our optimization
method in Refs. [25–27].
In this chapter, we review our works on protein force fields. In Sect. 2 the details of
the new main-chain torsion-energy terms and the methods for refinements of force-
field parameters are given. In Sect. 3 examples of the applications of these methods
are presented. Section 4 is devoted to conclusions.

2 Methods

2.1 General Force Field for Protein Systems

The all-atom force fields for protein systems such as AMBER, CHARMM, OPLS,
and ECEPP use essentially the same functional forms for the potential energy except
for minor differences. The commonly used total conformational potential energy
E conf is given by

E conf = E BL + E BA + E torsion + E nonbond , (1)

where

E BL = K  ( − eq )2 , (2)
bond length 

E BA = K θ (θ − θeq )2 , (3)
bond angle θ
  Vn
E torsion = [1 + cos(nΦ − γn )] , (4)
dihedral angle Φ n
2
 
 Ai j Bi j 332qi q j
E nonbond = − 6 + . (5)
i< j
ri12j ri j εri j

Here, E BL , E BA , and E torsion represent the bond-stretching term, the bond-bending


term, and the torsion-energy term, respectively. The bond-stretching and bond-
bending energies are given by harmonic terms with the force constants, K  and
K θ , and the equilibrium positions, eq and θeq . The torsion energy is, on the other
hand, described by the Fourier series in Eq. (4), where the sum is taken over all
dihedral angles Φ, n is the number of waves, γn is the phase, and Vn is the Fourier
coefficient. The nonbonded energy in Eq. (5) is represented by the Lennard-Jones
206 Y. Sakae and Y. Okamoto

and Coulomb terms between pairs of atoms, i and j, separated by the distance ri j
(in Å). The parameters Ai j and Bi j in Eq. (5) are the coefficients for the Lennard-
Jones term, qi (in units of electronic charges) is the partial charge of the i-th atom,
and ε is the dielectric constant, where we usually set ε = 1 (the value in vacuum).
The factor 332 in the electrostatic term is a constant to express energy in units of
kcal/mol. Hence, we have five classes of force-field parameters, namely, those in
the bond-stretching term (K  and eq ), those in the bond-bending term (K θ and θeq ),
those in the torsion term (Vn and γn ), those in the Lennard-Jones term (Ai j and Bi j ),
and those in the electrostatic term (qi ).
Equation (1) represents a standard set of the potential energy terms. As mentioned
above, there are minor differences in the energy functions among different force
fields. For instance, the Urey-Bradley term is used in CHARMM and OPLS, but
not in AMBER. In our parameter refinement methods, we try to optimize a certain
set of parameters in the existing force fields without changing the functional forms.
Therefore, if the original force field has non-standard terms, then the optimized one
also has them.

2.2 New Torsion-Energy Terms

2.2.1 Representation by a Double Fourier Series [34, 35]

Separating the contributions E(φ, ψ) of the backbone dihedral angles φ and ψ from
the rest of the torsion terms E rest , we can write the torsion energy term in Eq. (4) as

E torsion = E(φ, ψ) + E rest , (6)

where we have
 Vm  Vn
E(φ, ψ) = [1 + cos(mφ − γm )] + [1 + cos(nψ − γn )] . (7)
m
2 n
2

For example, the coefficients for the cases of six force fields namely, AMBER
parm94, AMBER parm96, AMBER parm99, CHARMM27, OPLS-AA, and OPLS-
AA/L, are summarized in Table 1, and we can explicitly write E(φ, ψ) in Eq. (7) as
follows:

E parm94 (φ, ψ) = 2.7 − 0.2 cos 2φ − 0.75 cos ψ − 1.35 cos 2ψ − 0.4 cos 4ψ , (8)
E parm96 (φ, ψ) = 2.3 + 0.85 cos φ − 0.3 cos 2φ + 0.85 cos ψ − 0.3 cos 2ψ ,
(9)
E parm99 (φ, ψ) = 5.35 + 0.8 cos φ − 0.85 cos 2φ − 1.7 cos ψ − 2.0 cos 2ψ , (10)
E CHARMM (φ, ψ) = 0.8 − 0.2 cos φ + 0.6 cos ψ , (11)
Optimizations of Protein Force Fields 207

Table 1 Torsion-energy parameters for the backbone dihedral angles φ and ψ for AMBER parm94,
AMBER parm96, AMBER parm99, CHARMM27, OPLS-AA, and OPLS-AA/L in Eq. (7)
φ ψ
Vm Vn
m γm n γn (radians)
2 2
(kcal/mol) (radians) (kcal/mol)
parm94 2 0.2 π 1 0.75 π
2 1.35 π
4 0.4 π
parm96 1 0.85 0 1 0.85 0
2 0.3 π 2 0.3 π
parm99 1 0.8 0 1 1.7 π
2 0.85 π 2 2.0 π
charmm 1 0.2 π 1 0.6 0
opls-aa 1 −1.1825 0 1 0.908 0
2 0.456 π 2 0.611 π
3 −0.425 0 3 0.7905 0
opls-aal 1 −0.298 0 1 0.3715 0
2 0.1395 π 2 1.254 π
3 −2.4565 0 3 −0.4025 0

E OPLS−AA (φ, ψ) = 1.158 − 1.1825 cos φ − 0.456 cos 2φ − 0.425 cos 3φ


+ 0.908 cos ψ − 0.611 cos 2ψ + 0.7905 cos 3ψ , (12)
E OPLS−AA/L (φ, ψ) = 0.81885 − 0.298 cos φ − 0.1395 cos 2φ − 2.4565 cos 3φ
+ 0.3715 cos ψ − 1.254 cos 2ψ − 0.4025 cos 3ψ . (13)

The backbone torsion-energy term E(φ, ψ) in Eq. (7) is a sum of two one-
dimensional Fourier series: one is for φ and the other for ψ. The two variables φ and
ψ are decoupled, and no correlation of φ and ψ can be incorporated. On the other
hand, any periodic function of φ and ψ with period 2π can be expanded by a dou-
ble Fourier series. As a simple generalization of E(φ, ψ), we therefore proposed to
express this backbone torsion energy by the following double Fourier series [34, 35]:


E (φ, ψ) = a + (bm cos mφ + cm sin mφ)
m=1


+ (dn cos nψ + en sin nψ)
n=1
∞  ∞
+ ( f mn cos mφ cos nψ + gmn cos mφ sin nψ
m=1 n=1
+h mn sin mφ cos nψ + i mn sin mφ sin nψ) . (14)
208 Y. Sakae and Y. Okamoto

Here, m and n are the numbers of waves, a, bm , cm , dn , en , f mn , gmn , h mn , and i mn


are the Fourier coefficients. This equation includes cross terms in φ and ψ, while the
original term in Eq. (7) has no mixing of φ and ψ. Therefore, our new torsion-energy
term can represent more complex energy surface than the conventional ones. The
Fourier coefficients, by definition, are given by
  π
1 π
c= dφ dψ E (φ, ψ)x(φ, ψ)
α −π −π
 π 2 1  180  180  π π   π π 
= d φ̃ d ψ̃ E φ̃, ψ̃ x φ̃, ψ̃ , (15)
180 α −180 −180 180 180 180 180

where α are the normalization constants and x(φ, ψ) are the basis functions for the
Fourier series. Table 2 summarizes these coefficients and functions. Here, φ and ψ are
π π
given in radians, and φ̃ and ψ̃ are in degrees (φ = 180 φ̃, ψ = 180 ψ̃). Hereafter, angu-
lar quantities without tilde and with tilde are in radians and in degrees, respectively.
Finally, E (φ, ψ) in Eq. (14) and E rest in Eq. (6) define our torsion-energy term in
Eq. (1) [(instead of Eq. (4)]:

E torsion = E (φ, ψ) + E rest . (16)

The double Fourier series in Eq. (14) is particularly useful, because it describes
the backbone torsion-energy surface in the Ramachandran space. The Fourier series
can express the torsion-energy surface E (φ, ψ) that was obtained by any method
including quantum chemistry calculations [13, 14, 16, 28–31].
Moreover, one can refine the existing backbone torsion-energy term and con-
trol the secondary-structure-forming tendencies of the force fields. For example,
α-helix is obtained for (φ̃, ψ̃) ≈ (−57◦ , −47◦ ), 310 -helix for (φ̃, ψ̃) ≈ (−49◦ ,
−26◦ ), π -helix for (φ̃, ψ̃) ≈ (−57◦ , −70◦ ), parallel β-sheet for (φ̃, ψ̃) ≈ (−119◦ ,
113◦ ), antiparallel β-sheet for (φ̃, ψ̃) ≈ (−139◦ , 135◦ ), and so on [36]. Hence, if

Table 2 Fourier coefficients c, normalization constants α, and the basis functions x(φ, ψ) for the
double Fourier series of the backbone torsion energy E (φ, ψ) in Eqs. (14) and (15)
c α x(φ, ψ)
a 4π 2 1
bm 2π 2 cos mφ
cm 2π 2 sin mφ
dn 2π 2 cos nψ
en 2π 2 sin nψ
f mn π2 cos mφ cos nψ
gmn π2 cos mφ sin nψ
h mn π2 sin mφ cos nψ
i mn π2 sin mφ sin nψ
Optimizations of Protein Force Fields 209

the existing force field gives, say, too little α-helix-forming tendency compared
to experimental results, one can lower the backbone torsion-energy surface near
(φ̃, ψ̃) = (−57◦ , −47◦ ) in order to enhance α-helix formations.
We can thus write
E (φ, ψ) = E(φ, ψ) − f (φ, ψ) , (17)

where E(φ, ψ) is the existing backbone torsion-energy term that we want to refine
and f (φ, ψ) is a function that has peaks around the corresponding regions where
specific secondary structures are to be enhanced. There are many possible choices
for f (φ, ψ). For instance, one can use the following function when one wants to
lower the torsion-energy surface in a single region near (φ, ψ) = (φ0 , ψ0 ):

⎨ A exp B
, for (φ − φ0 )2 + (ψ − ψ0 )2 < r0 2 ,
f (φ, ψ) = (φ − φ0 )2 + (ψ − ψ0 )2 − r0 2

0, otherwise ,
(18)
where A, B, and r0 are constants that we adjust for refinement. In this case, the energy
surface is lowered by f (φ, ψ) in a circular region of radius r0 , which is centered at
(φ, ψ) = (φ0 , ψ0 ). Note that we should also impose periodic boundary conditions
on f (φ, ψ).
We then express E (φ, ψ) in Eq. (17) in terms of the double Fourier series in
Eq. (14), where the Fourier coefficients are obtained from Eq. (15). Hence, we can
fine-tune the backbone torsion-energy term by the above procedure so that it yields
correct secondary-structure-forming tendencies.
Some remark about the computation time is now in order. It may appear that we
have to expect great increase in computation time by the introduction of the double
Fourier series, because the number of terms are much larger. However, because most
of the computation time for the force-field evaluations is spent in the calculations
of distances between pairs of atoms in the system, the increase in computation time
due to the double Fourier series is essentially negligible compared to these main
computational efforts.

2.2.2 Amino-Acid-Dependent Main-Chain Torsion-Energy Terms [40]

By writing the dihedral-angle dependence of the parameters explicitly, we can rewrite


the torsion-energy term in Eq. (4) as
  Vn (Φ)  
E torsion = 1 + cos nΦ − γn (Φ) , (19)
Φ n
2

where the first summation is taken over all dihedral angles Φ (both in the main
chain and in the side chains), n is the number of waves, γn is the phase, and Vn is
the Fourier coefficient. Namely, the energy term E torsion has γn (Φ) and Vn (Φ) as
force-field parameters.
210 Y. Sakae and Y. Okamoto

We can further write the torsion-energy term as


(MC) (SC)
E torsion = E torsion + E torsion , (20)

(MC) (SC)
where E torsion and E torsion are the torsion-energy terms for dihedral angles around
main-chain bonds and around side-chain bonds, respectively. Examples of the dihe-
(MC)
dral angles in E torsion are φ (C–N–Cα –C), ψ (N–Cα –C–N), φ  (Cβ –Cα –N–C), ψ 
(SC)
(Cβ –Cα –C–N), and ω (Cα –C–N–Cα ). The force-field parameters in E torsion can read-
(MC)
ily depend on amino-acid residues. However, those in E torsion are usually taken to
be independent of amino-acid residues and the common parameter values are used
for all the amino-acid residues (except for proline). This is because the amino-acid
dependence of the force field is believed to be taken care of by the very existence of
side chains. In Table 3, we list examples of the parameter values for ψ (N–Cα –C–N)
and ψ  (Cβ –Cα –C–N) in general AMBER force fields.
However, this amino-acid independence of the main-chain torsion-energy terms
is not an absolute requirement, because we are representing the entire force field
by rather a small number of classical-mechanical terms. In order to reproduce the
exact quantum-mechanical contributions, one can introduce amino-acid dependence
on any force-field term including the main-chain torsion-energy terms. Hence, we
(MC)
can generalize E torsion in Eq. (20) from the expression in Eq. (19) to the following
amino-acid-dependent form:

Table 3 Torsion-energy parameters (Vn and γn ) for the main-chain dihedral angles ψ and ψ  in
Eq. (19) for the original AMBER ff94, ff96, ff99, ff99SB, and ff03 force fields. The values are
common among the amino-acid residues for each force field. Only the parameters for non-zero Vn
are listed
Force field ψ (N–Cα –C–N) ψ  (Cβ –Cα –C–N)
n Vn /2 γn n Vn /2 γn
ff94 1 0.75 π 2 0.07 0
2 1.35 π 4 0.10 0
4 0.40 π
ff96 1 0.85 0 2 0.07 0
2 0.30 π 4 0.10 0
ff99 1 1.70 π 2 0.07 0
2 2.00 π 4 0.10 0
ff99SB 1 0.45 π 1 0.20 0
2 1.58 π 2 0.20 0
3 0.55 π 3 0.40 0
ff03 1 0.6839 π 1 0.7784 π
2 1.4537 π 2 0.0657 π
3 0.4615 π 3 0.0560 0
Optimizations of Protein Force Fields 211
 
20   Vn Φ (k) 
   
(MC) MC (k) (k)
E torsion = 1 + cos nΦMC − γn ΦMC , (21)
k=1 Φ (k) n
2
MC

(k)
where k (= 1, 2, . . . , 20) is the label for the 20 kinds of amino-acid residues and ΦMC
are dihedral angles around the main-chain bonds in the k-th amino-acid residue.

2.3 Optimization of Force-Field Parameters

2.3.1 Use of Force Acting on Each Atom with the PDB Coordinates
[25–27, 41]

In the previous section, we presented functional forms of the force fields. Given a
fixed set of force-field functions, we try to optimize a certain set of parameters in the
force fields without changing the functional forms.
Our optimization method for these force-field parameters is now described [25].
We first retrieve N native structures (one structure per protein) from PDB. We try to
choose proteins from different folds (such as all α-helix, all β-sheet, α/β, etc.) and
different homology classes as much as possible. If the force-field parameters are of
ideal values, then all the chosen native structures are stable without any force acting
on each atom in the molecules on the average. Hence, we expect

F =0, (22)

where
N
1 
Nm
 2
F= f i  , (23)
m
m=1
N m i =1
m

and
{m}
∂ E tot
f im = − . (24)
∂xim

{m}
Here, Nm is the total number of atoms in molecule m, E tot is the total potential
energy for molecule m, xi is the Cartesian coordinate vector of atom i, and f i is the
force acting on atom i. In reality, F = 0, and because F ≥ 0, we can optimize the
force-field parameters by minimizing F with respect to these parameters. In practice,
we perform a simulation in the force-field parameter space for this minimization.
Proteins are usually in aqueous solution, and hence we also have to incorporate
some kind of solvent effects. Because the more the total number of proteins (N )
is, the better the force-field parameter optimizations are expected to be, we want to
minimize our efforts in the calculations of the solvent effects. Here, we employ the
212 Y. Sakae and Y. Okamoto

generalized-Born/surface area (GB/SA) terms for the solvent contributions [42, 43].
Hence, we use in Eq. (24) (we suppress the label m for each molecule)

E tot = E conf + E solv , (25)

where
E solv = E GB + E SA , (26)

1  qi q j
E GB = −166 1 −  , (27)
εs i, j ri j + αi2j e−Di j
2


E SA = σk Ak . (28)
k

Namely, in the GB/SA model, the total solvation free energy in Eq. (26) is given by
the sum of a solute-solvent electrostatic polarization term, a solvent-solvent cavity
term, and a solute-solvent van der Waals term. A solute-solvent electrostatic polar-
ization term can be calculated by the generalized Born equation in Eq. (27), where

αi j = αi α j , αi is the so-called Born radius of atom i, Di j = ri2j /(2αi j )2 , and εs
is the dielectric constant of bulk water (we take εs = 78.3). A solvent-solvent cav-
ity term and a solute-solvent van der Waals term can be approximated by the term
in Eq. (28) that is proportional to the solvent accessible surface area. Here, Ak is
the total solvent-accessible surface area of atoms of type k and σk is an empirically
determined proportionality constant [42, 43].
The flowchart of our method for the optimization of force-field parameters is
shown in Fig. 1.
In Step 1 of the flowchart we try to obtain as many structures as possible from
PDB. The number is limited by the computer power that we have available in our
laboratory. We want to choose proteins with different sizes (numbers of amino acids),
different folds, and different homology classes as much as possible. We also want to
use only those with high experimental resolutions. Note that only atomic coordinates
of proteins are extracted from PDB (and coordinates from other molecules such as
crystal water are neglected).
If we use data from X-ray experiments, hydrogen atoms are missing, and thus
in Step 2 we have to add hydrogen coordinates. Many protein simulation software
packages provide with routines that add hydrogen atoms to the PDB coordinates,
and one can use one of such routines.
We now have N protein coordinates ready, but usually such “raw data” result in
very high total potential energy and strong forces will be acting on some of the atoms
in the molecules. This is because the hydrogen coordinates that we added as above are
not based on experimental results and have rather large uncertainties. The coordinates
of heavy atoms from PDB also have experimental errors. We take the position that
we leave the coordinates of heavy atoms as they are in PDB as much as possible,
and adjust the hydrogen coordinates to reduce this mismatch. This is why we want
Optimizations of Protein Force Fields 213

1. Retrieve N native structures ( one structure per protein ) from PDB

2. Add hydrogen atoms if not available in PDB

3. Refine each structure in 2. by minimizing the total potential energy


(with the optimized force-field parameters if already optimized)
with respect to their coordinates with predefined constraints on coordinates

4. Optimize the first set of force-field parameters by minimizing F in Eq. (23) (calculated
from the refined structures obtained in 3.) with respect to these first set of parameters

5. Refine each structure in 2. by minimizing the total potential energy


(with the optimized force-field parameters) with respect to their coordinates
with predefined constraints on coordinates

6. Optimize the second set of force-field parameters by minimizing F in Eq. (23) (calculated
from the refined structures obtained in 5.) with respect to these second set of parameters

No
Convergent ?

Yes

New force-field parameters

Fig. 1 The flowchart of our method for the optimization of force-field parameters
214 Y. Sakae and Y. Okamoto

to include as many PDB data as possible with high experimental resolutions (so that
the effects of experimental errors in PDB may be minimal). We thus minimize the
total potential energy E tot = E conf + E solv + E constr with respect to the coordinates
for each protein conformation, where E constr is the constraint energy term that is
imposed on the heavy atoms in PDB (it is referred to as the “predefined constraints”
in Steps 3 and 5 in Fig. 1):

E constr = K x (x − x0 )2 . (29)
heavy atom

Here, K x is the force constant of the restriction, and x0 are the original coordinate
vectors of heavy atoms in PDB. Because we are searching for the nearest local-
minimum states, usual minimization routines such as the conjugate-gradient method
and Newton-Raphson method can be employed here. As one can see from Eq. (29),
the coordinates of hydrogen atoms will be mainly adjusted, but unnaturally displaced
heavy-atom coordinates will also be modified.
Given N set of “ideal” reference coordinates in Step 3 of the flowchart, we now
optimize the first set of force-field parameters in Step 4. In Eq. (1) we have five classes
of force-field parameters as mentioned above. Namely, the force-field parameters are
those in the bond-stretching term (K  and eq ), those in the bond-bending term (K θ
and θeq ), those in the torsion term (Vn and γn ), those in the Lennard-Jones term (Ai j
and Bi j ), and those in the electrostatic term (qi ). Because they are of very different
nature, we believe that it is better to optimize these classes of force-field parameters
separately (as in Steps 4, 6, and so on in Fig. 1). Note also that if we optimize all
the parameters simultaneously, the null result (with all the parameter values equal to
zero) is a solution to Eq. (22). This is the main reason why we optimize each class
of parameters separately.
For each set of force-field parameters, the optimization is carried out by minimiz-
ing F in Eq. (23) with respect to these parameters. Here, E tot in Eq. (24) is given by
Eq. (25). For this purpose usual minimization routines such as the conjugate-gradient
method are not adequate, because we need a global optimization. One should employ
more powerful methods such as simulated annealing [44] and generalized-ensemble
algorithms [4]. We perform this minimization simulation in the above parameter
space to obtain the parameter values that give the global minimum of F.
These processes are repeated until the optimized force-field parameters converge.
We can, in principle, optimize all the force-field parameters following the flowchart
in Fig. 1. In the examples given below, however, we just optimize two classes of the
force-field parameters for simplicity; namely, the partial charges and the backbone
torsion-energy parameters. For the optimization of the partial charges (qi ), we impose
a condition that the total charge of each amino acid remains constant, which is the
usual assumption adopted by the force fields of Eq. (1) based on classical mechanics.
As for the main chain torsion-energy parameters, we use the following functional
form for each backbone dihedral angle φ and ψ [see Eq. (4)]:
Optimizations of Protein Force Fields 215

Va   Vb  
E Φ=φ,ψ = 1 + cos(n a Φ − γa ) + 1 + cos(n b Φ − γb )
2 2 (30)
Vc  
+ 1 + cos(n c Φ − γc ) .
2
We optimize only the parameters (Va , Vb , and Vc ) and fix the number of waves (n a ,
n b , and n c ) and the phases (γa , γb , and γc ) as in the original force field. This torsion-
energy parameter optimization strongly depends on the values of the force constant
K x of the constraint energy in Eq. (29). The larger the values of K x are, the larger
those of Va , Vb , and Vc tend to be. In order to minimize such dependences, we impose
the constraint that the total area enclosed by the curve of |E Φ | (from Φ = −180◦ to
180◦ ) remains less than or equal to the original value during the optimization.
We believe that these two classes of parameters have the most uncertainty among
all the force-field parameters. This is because partial charges are usually obtained
by quantum chemistry calculations of an isolated amino acid in vacuum separately,
which is a very different condition from that in amino acids of proteins in aqueous
solution, and because the torsion-energy term is the most problematic (for instance,
the parm94, parm96, and parm99 versions of AMBER differ mainly in backbone
torsion-energy parameters).
Moreover, when we perform the optimizations of force-field parameters by using
F in Eq. (23), we can neglect unnaturally large forces acting on atoms in order to
remove the errors of PDB structures. Namely, we can exclude the term for f im in
Eq. (23) that satisfies  
f i  > f cut . (31)
m

We determine the cutoff value f cut by using the following function:



 n
1 
ΦRMSD =  (Φ native − Φimin )2 . (32)
n i=1 i

Here, n is the total number of backbone dihedral angles (φ and ψ angles) in all
molecules, Φinative is the i-th backbone dihedral angle of the native structures and Φimin
is the corresponding i-th backbone dihedral angle of the minimized structures using
the trial force-field parameters. The optimal value of f cut is chosen so that ΦRMSD
is the minimal value with f cut ≤ f cut
max max
, where f cut is obtained in an appropriate way
(see an example below).

2.3.2 Use of CRMSD [39]

We now describe our second method for optimizing the force-field parameters. We
use N proteins again from PDB. If the force-field parameters are of ideal values,
we expect that all the chosen native structures minimized by the ideal force field do
not change after minimizations. Namely, we believe that force-field parameters are
216 Y. Sakae and Y. Okamoto

better, if they have smaller deviations obtained by minimizations of protein structures.


Hence, we expect
CRMSD = 0, (33)

where N
R M S Di
CRMSD = i=1
. (34)
N
Here, R M S Di is the root-mean-square deviation of coordinates between the native
structure of protein i and the corresponding minimized structure using the trial force-
field parameters. In reality, CRMSD = 0, and because CRMSD ≥ 0, we expect that
we can optimize the force-field parameters by minimizing CRMSD with respect to
these force-field parameters. In practice, we perform a simulation in the force-field
parameter space for this minimization. Namely, in the previous method we minimize
F in Eq. (23), and in the present method we minimize CRMSD in Eq. (34) instead.

2.3.3 Use of ΦRMSD [38]

We now describe our third method for optimizing the force-field parameters. We
first select N proteins from PDB as in the previous two methods. If the force-field
parameters are of ideal values, we expect that all the chosen native structures min-
imized by the ideal force field do not change. Namely, we believe that force-field
parameters are better, if they have lower deviations obtained from minimizations of
protein structures. Hence, we expect

ΦRMSD = 0, (35)

where 
 n
1 
ΦRMSD =  (Φ native − Φimin )2 . (36)
n i=1 i

Here, n is the total number of backbone dihedral angles (φ and ψ angles) in all
molecules, Φinative is the i-th backbone dihedral angle of the native structures and Φimin
is the corresponding i-th backbone dihedral angle of the minimized structures using
the trial force-field parameters. In reality, ΦRMSD = 0, because ΦRMSD ≥ 0, we
expect that we can optimize the force-field parameters by minimizing ΦRMSD with
respect to these force-field parameters. In practice, we perform a simulation in the
force-field parameter space for this minimization.
However, our first aim is to determine the balance of secondary-structure-forming
tendencies such as helix structure and β-sheet structure. Moreover, it is difficult
to perform the minimization of ΦRMSD in wider force-field paramter space until
ΦRMSD is close to 0 because of the computational cost. Therefore, we only focus on
secondary-structure regions of helix structure and β-sheet structure in the amino-acid
Optimizations of Protein Force Fields 217

sequence. Namely, we only consider the backbone dihedral angles of residues in the
native structures which are identiffied by the DSSP program [45] that they constitute
one of α-helix, 3/10-helix, π -helix, and β-sheet structures. We calculate two kinds
of ΦRMSD for secondary structures, namely, ΦRMSDhelix and ΦRMSDβ . Here,
ΦRMSDhelix stands for ΦRMSD of backbone dihedral angles of residues which
have helix structures in the native structures, and ΦRMSDβ means that of only
β-sheet structures in the native structures. Using these two ΦRMSDs, we want to
optimize the torsion-energy parameters, which will have better balance of secondary-
structure-forming tendencies. We propose the following combination:

ΦRMSD2ndry = λΦRMSDhelix + ΦRMSDβ , (37)

where we have introduced a fixed scaling factor λ.


Finally, by minimizing ΦRMSD2ndry with respect to the force-field parameters,
we can obtain the optimized force-field parameters.

2.3.4 Use of Short MD Simulations [46]

We now describe our fourth method for optimizing the force-field parameters. In this
method, we prepare M protein structures, which are some experimentally determined
conformations. For these proteins, we perform MD simulations, which start from the
experimental conformations, by using a trial force field. We try to perform MD
simulations with varied values of force-field parameters. After that, we estimate the
“S” value defined by the following function from the trajectories of the M proteins
obtained from the trial MD simulations:

M
n iS→U n iU→S
S= + . (38)
i=1
NiS NiU

Here, n iS→U is the number of the amino acids in protein i where their structures
in PDB (initial conformation) had some secondary structures (such as α-helix, 310 -
helix, π -helix, and β structures) but transformed into unstructured, coil structures
without any secondary structures after a short MD simulation. Likewise, n iU→S is
is the number of amino acids in protein i where their structures in PDB had coil
structures but transformed to have some secondary structures after a MD simulation.
NiS is the total number of amino acids in protein i which have some secondary
structures in PDB, and NiU is the total number of amino acids in protein i which have
coil structures in PDB.
When we calculate the S values for the conformations obtained from MD sim-
ulations by using trial force-field parameters, the parameter set, which yields the
minimum S value, is considered to give the optimized force field.
218 Y. Sakae and Y. Okamoto

3 Examples of Optimizations of Force-Field Parameters

3.1 New Torsion-Energy Terms

3.1.1 Representation by a Double Fourier Series [34, 35]

We now present various examples of our refinements of force-field parameters. We


first consider the following truncated double Fourier series [see Eq. (14)]:

E (φ, ψ) = a + b1 cos φ + c1 sin φ + b2 cos 2φ + c2 sin 2φ + b3 cos 3φ + c3 sin 3φ


+d1 cos ψ + e1 sin ψ + d2 cos 2ψ + e2 sin 2ψ + d3 cos 3ψ + e3 sin 3ψ
+ f 11 cos φ cos ψ + g11 cos φ sin ψ + h 11 sin φ cos ψ + i 11 sin φ sin ψ
+ f 21 cos 2φ cos ψ + g21 cos 2φ sin ψ + h 21 sin 2φ cos ψ + i 21 sin 2φ sin ψ
+ f 12 cos φ cos 2ψ + g12 cos φ sin 2ψ + h 12 sin φ cos 2ψ + i 12 sin φ sin 2ψ
+ f 22 cos 2φ cos 2ψ + g22 cos 2φ sin 2ψ
+h 22 sin 2φ cos 2ψ + i 22 sin 2φ sin 2ψ . (39)

This function has 29 Fourier-coefficient parameters. We will see below that this
number of Fourier terms is sufficient for most of our purposes.
We first check how well the truncated Fourier series in Eq. (39) can reproduce the
six original backbone torsion-energy terms in Eqs. (8)–(13). Because these functions
are already the sum of one-dimensional Fourier series and subsets of the double
Fourier series in Eq. (14), the Fourier coefficients in Eq. (15) can be analytically
calculated and agree with those in Eqs. (8)–(13) except for the last one (that for
cos 4ψ) in Eq. (8). This term is missing in Eq. (39). These cases thus give us good
test of numerical integrations in Eq. (15). The numerical integrations were evaluated
as follows. We divided the Ramachandran space (−180◦ < φ̃ < 180◦ , −180◦ < ψ̃ <
180◦ ) into unit square cells of side length ε̃ (in degrees). Hence, there are (360/ε̃)2
unit cells altogether. The double   integral on the right-hand side of Eq. (15) was
π π π π
approximated by the sum of E 180 φ̃, 180 ψ̃ x 180 φ̃, 180 ψ̃ × (ε̃)2 , where each
   
π π π π
E 180 φ̃, 180 ψ̃ x 180 φ̃, 180 ψ̃ was evaluated at one of the four corners of each
unit cell. We tried two values of ε̃ (1◦ and 10◦ ). Both cases gave almost complete
agreement of Fourier coefficients with the resutls of the analytical integrations (see,
for example, Table 4).
In Fig. 2 we compare the six original backbone torsion-energy surfaces with those
of the corresponding double Fourier series in Eq. (39). Hereafter, the primed labels
for figures such as (a ) indicate that the results are those of the double Fourier series.
As can be seen from Fig. 2, the backbone torsion-energy surfaces are in complete
agreement for all force fields except for AMBER parm94, whereas we see a little
difference for AMBER parm94 between Fig. 2a, a . As discussed above, this slight
difference for AMBER parm94 reflects the fact that the cos 4ψ term in Eq. (8) is
missing in the truncated double Fourier series in Eq. (39).
Optimizations of Protein Force Fields 219

Table 4 Fourier coefficients in Eq. (39) obtained from the numerical evaluations of the integrals in
Eq. (15). “org94” stands for the original AMBER parm94 force field.“mod94(α)” and “mod94(β)”
stand for AMBER parm94 force fields that were modified to enhance α-helix structures and β-sheet
structures, respectively, by Eqs. (17) and (18). The bin size ε̃ is the length of the sides of each unit
square cell for the numerical integration in Eq. (15)
Bin size ε̃ 1◦ 10◦
Coefficient org94 mod94(α) mod94(β) org94 mod94(α) mod94(β)
2.700000 2.308359 1.916719 2.700000 2.308370 1.916742
a
0.000000 −0.330937 0.781150 0.000000 −0.331053 0.781041
b1
0.000000 0.509599 0.930938 0.000000 0.509517 0.930809
c1
−0.200000 −0.101549 −0.115937 −0.200000 −0.101513 −0.115970
b2
0.000000 0.221123 −0.476745 0.000000 0.221100 −0.476558
c2
0.000000 −0.018073 0.031693 0.000000 −0.018084 0.031714
b3
0.000000 −0.002862 −0.018298 0.000000 −0.003036 −0.018310
c3
−0.750000 −1.164401 −0.052959 −0.750000 −1.164500 −0.052874
d1
0.000000 0.444390 −0.995478 0.000000 0.444289 −0.995599
e1
−1.350000 −1.333115 −1.184428 −1.350000 −1.333073 −1.184340
d2
0.000000 0.241460 0.454905 0.000000 0.241451 0.455147
e2
0.000000 −0.014220 0.035349 0.000000 −0.014143 0.035324
d3
0.000000 −0.011515 0.009472 0.000000 −0.011671 0.009465
e3
0.000000 −0.342789 −0.680493 0.000000 −0.343087 −0.680497
f 11
0.000000 0.367596 0.971845 0.000000 0.367697 0.971851
g11
0.000000 0.527849 −0.810980 0.000000 0.527949 −0.810985
h 11
0.000000 −0.566049 1.158199 0.000000 −0.565751 1.158206
i 11
0.000000 0.090016 −0.064642 0.000000 0.090168 −0.064636
f 21
0.000000 −0.096530 0.092318 0.000000 −0.096472 0.092309
g21
(continued)
220 Y. Sakae and Y. Okamoto

Table 4 (continued)
Bin size ε̃ 1◦ 10◦
Coefficient org94 mod94(α) mod94(β) org94 mod94(α) mod94(β)
0.000000 0.202178 0.366601 0.000000 0.202421 0.366565
h 21
0.000000 −0.216810 −0.523561 0.000000 −0.216596 −0.523509
i 21
0.000000 0.012329 −0.142682 0.000000 0.012385 −0.142712
f 12
0.000000 0.176308 −0.392017 0.000000 0.176622 −0.392098
g12
0.000000 −0.018984 −0.170042 0.000000 −0.019013 −0.170077
h 12
0.000000 −0.271490 −0.467187 0.000000 −0.271321 −0.467284
i 12
0.000000 −0.000586 −0.002453 −0.000001 −0.000585 −0.002451
f 22
0.000000 −0.008378 −0.006738 0.000000 −0.008397 −0.006733
g22
0.000000 −0.001316 0.013909 0.000000 −0.001317 0.013897
h 22
0.000000 −0.018817 0.038215 0.000000 −0.018867 0.038183
i 22

We now consider the double Fourier series of non-trigonometric functions. The


functions are those in Eqs. (17) and (18). We try to fine-tune the six original force
fields by subtracting f (φ, ψ) in Eq. (18) from the original functions. The criterion
for fine-tuning is, for instance, whether the refined force fields yield better agreement
of the secondary-structure-forming tendencies with experimental implications than
the original ones. For this we need good experimental data. Because the purpose here
is to test whether or not we can control the secondary-structure-forming tendencies,
we simply consider extreme cases where we try to modify the existing force fields
so that desired secondary structures may be obtained regardless of the tendencies of
the original force fields. Note that the six original force fields have quite different
preferences for α-helix and β-sheet structures [23–27].
The function f (φ, ψ) in Eq. (18) reduces the value of E(φ, ψ) in a circle of radius
r0 with the center located at (φ0 , ψ0 ). We used r̃0 = 100◦ and B̃ = 5000 (degrees)2 .
The coefficient A is calculated by Eq. (18) from the other parameters f (φ̃0 , ψ̃0 ), r̃0 ,
and B̃. Namely, we have
 

A = f (φ̃0 , ψ̃0 ) exp 2 . (40)
r̃0
Optimizations of Protein Force Fields 221

Fig. 2 Backbone-torsion-energy surfaces of six force fields. The backbone dihedral angles φ̃ and
ψ̃ are in degrees. a, b, c, d, e, and f are those of the original AMBER parm94, the original AMBER
parm96, the original AMBER parm99, the original CHARMM 27, the original OPLS-AA, and the
original OPLS-AA/L, respectively. a –f  are those of a–f, respectively, that were expressed by the
truncated double Fourier series in Eq. (39). The contour lines are drawn every 0.5 kcal/mol

We used (φ̃0 , ψ̃0 ) = (−57◦ , −47◦ ) and (φ̃0 , ψ̃0 ) = (−130◦ , 125◦ ) in order to
enhance α-helix-forming tendency and β-sheet-forming tendency, respectively. The
central values f (φ̃0 , ψ̃0 ) that we used were 3.0 and 6.0 kcal/mol for enhancing α-
helix and β-sheet, respectively, in the case of AMBER parm94, AMBER parm99,
CHARMM27, and OPLS-AA/L. They were both 3.0 kcal/mol in the case of AMBER
parm96 and OPLS-AA.
We remark that the large value of f (φ̃0 , ψ̃0 ), 6.0 kcal/mol, that was necessary to
enhance β-sheet in the case of AMBER parm94, AMBER parm99, CHARMM27,
and OPLS-AA/L reflects the fact that their original force fields favor α-helix.
In Fig. 3a1–f1 we compare the six backbone torsion-energy surfaces modified
according to Eq. (17), which reduced the torsion energy in the α-helix region, with
those of the corresponding double Fourier series in Eq. (39). In Fig. 3a1–f1, α-helix
is enhanced from the original AMBER parm94 (a1), AMBER parm96 (b1), AMBER
parm99 (c1), CHARMM27 (d1), OPLS-AA (e1), and OPLS-AA/L (f1). In Fig. 4a1–
f1 we show the case of the β-sheet region, and β-sheet is enhanced from the original
AMBER parm94 (a1), AMBER parm96 (b1), AMBER parm99 (c1), CHARMM27
(d1), OPLS-AA (e1), and OPLS-AA/L (f1).
These modified backbone torsion-energy functions were expanded by the trun-
cated double Fourier series in Eq. (39) by evaluating the corresponding Fourier coef-
222 Y. Sakae and Y. Okamoto

Fig. 3 Backbone-torsion-energy surfaces of six force fields that were modified by Eqs. (17),
(18) and (39). From a1 to f1 are those of AMBER parm94, AMBER parm96, AMBER parm99,
CHARMM 27, OPLS-AA, and OPLS-AA/L force fields that were modified to enhance α-helix
structures, respectively. From a1 to f1 are those of AMBER parm94, AMBER parm96, AMBER
parm99, CHARMM 27, OPLS-AA, and OPLS-AA/L force fields that were expanded by the trun-
cated double Fourier series in Eq. (39)

ficients from Eq. (15). For the numerical integration we again tried two values of the
bin size ε̃ (1◦ and 10◦ ). The obtained Fourier coefficients are summarized in Table 4,
for example, in the case of AMBER parm94. For comparisons, the Fourier coeffi-
cients of the original AMBER force fields (before modifications) are also listed. We
see that the two choices of the bin size ε̃ gave essentially the same results (agreeing
in about 3 digits).
In Figs. 3a1 –f1 and 4a1 –f1 we show the backbone torsion-energy surfaces rep-
resented by the truncated double Fourier series. Comparing these with the original
ones in Figs. 3a1–f1 and 4a1–f1, we find that the overall features of the energy sur-
faces are well reproduced by the Fourier series. If more accuracy is desired, we can
simply increase the number of Fourier terms in the expansion. As we will see below,
the present accuracy of the Fourier series was sufficient for the purpose of controlling
the secondary-structure-forming tendencies towards α-helix or β-sheet.
We examined the effects of the above modifications of the backbone torsion-
energy terms in AMBER parm94, AMBER parm96, AMBER parm99, CHARMM27,
OPLS-AA, and OPLS-AA/L (towards specific secondary structures) by performing
the folding simulations of two peptides, namely, C-peptide of ribonuclease A and the
C-terminal fragment of the B1 domain of streptococcal protein G, which is some-
Optimizations of Protein Force Fields 223

Fig. 4 Backbone-torsion-energy surfaces of six force fields that were modified by Eqs. (17),
(18) and (39). From a1 to f1 are those of AMBER parm94, AMBER parm96, AMBER parm99,
CHARMM 27, OPLS-AA, and OPLS-AA/L force fields that were modified to enhance β-sheet
structures, respectively. From a1 to f1 are those of AMBER parm94, AMBER parm96, AMBER
parm99, CHARMM 27, OPLS-AA, and OPLS-AA/L force fields that were expanded by the trun-
cated double Fourier series in Eq. (39)

times referred to as G-peptide [47]. The C-peptide has 13 residues and its amino-acid
sequence is Lys-Glu-Thr-Ala-Ala-Ala-Lys-Phe-Glu-Arg-Gln-His-Met. This peptide
has been extensively studied by experiments and is known to form an α-helix struc-
ture [48, 49], as shown in Fig. 5a. Because the charges at peptide termini are known to
affect helix stability [48, 49], we blocked the termini by a neutral COCH3 - group and
a neutral -NH2 group. The G-peptide has 16 residues and its amino-acid sequence is
Gly-Glu-Trp-Thr-Tyr-Asp-Asp-Ala-Thr-Lys-Thr-Phe-Thr-Val-Thr-Glu. The termini
were kept as the usual zwitter ionic states, following the experimental conditions [47,
50, 51]. This peptide is known to form a β-hairpin structure by experiments [47, 50,
51], as shown in Fig. 5b.
Simulated annealing [44] MD simulations were performed for both peptides from
fully extended initial conformations, where the 12 versions of the truncated double
Fourier series (which were described in Table 4 and in Figs. 3a1 –f1 and 4a1 –f1 )
were used for the backbone torsion-energy terms of AMBER parm94, AMBER
parm96, AMBER parm99, CHARMM27, OPLS-AA, and OPLS-AA/L force fields.
For comparisons, the simulations with the original force fields were also performed.
The unit time step was set to 1.0 fs. Each simulation was carried out for 1 ns (hence,
it consisted of 1,000,000 MD steps). The temperature during MD simulations was
controlled by Berendsen’s method [53]. For each run the temperature was decreased
224 Y. Sakae and Y. Okamoto

Fig. 5 The structures of


C-peptide (a) and G-peptide
(b) obtained from the
experimental results (PDB
ID are a 1A5P and b 1PGA).
The figures were created
with DS Visualizer v1.5 [52]

exponentially from 2000 to 250 K. We modified and used the program package
TINKER version 4.1 [54] for all the simulations. As for solvent effects, we used
the GB/SA model [42, 43] included in the TINKER program package. For both
peptides, these folding simulations were repeated 60 times with different sets of
randomly generated initial velocities.
In Fig. 6, we show seven (out of 60) lowest-energy final conformations of C-
peptide and G-peptide obtained by the simulated annealing MD simulations, for
example, in the case of AMBER parm94.
In figure, we see that all conformations of the original AMBER parm94 (except
for conformations 2 and 4 of G-peptide) and all conformations of its force field
modified towards α-helix are α-helix structures (conformations 2 and 4 are 310 -
helix structures). The results show that the original AMBER parm94 favors α-helix
structures, and moreover, its force field modified towards α-helix favors α-helix
structures more than the original force field in the sense that the obtained helices are
more extended (and almost entirely helical). On the other hand, AMBER parm94
modified towards β-sheet favors β structures strongly. The results for other force
fields were similar.
Therefore, regardless of the secondary-structure-forming tendencies of the orig-
inal force fields, our modifications of the backbone torsion-energy term succeeded
in enhancing the desired secondary structures.

3.1.2 Amino-Acid-Dependent Main-Chain Torsion-Energy Terms [40]


(k)
We present the results of our optimizations of the force-field parameters V1 (ΦMC )
(k) (k) (k)
for the main-chain angles ΦMC = ψ (N–Cα –C–N) and ψ (Cβ –Cα –C–N) in
Eq. (21). We did this for the case of AMBER ff03 force field. We determined these
(k)
V1 (ΦMC ) values for the 19 amino-acid residues except for proline.
At first, we chose 100 PDB files with resolution 2.0 Å or better, with sequence
similarity of amino acid 30.0 % or lower, and with less than 200 residues (the average
number of residues is 117.0) from PDB-REPRDB [56]. We selected the number of
each fold ( all α, all β, α/β, and α + β) in 100 proteins based on the number
Optimizations of Protein Force Fields 225

Fig. 6 Seven lowest-energy final conformations of C-peptide a–a and G-peptide b–b obtained
from six sets of 60 simulated annealing MD runs. a and b are the results of the original AMBER
parm94. a and b are the results of AMBER parm94 of the truncated double Fourier series of six
force fields that were modified to enhance α-helix structures. a and b are the results of AMBER
parm94 of the truncated double Fourier series of six force fields that were modified to enhance
β-sheet structures. The conformations are ordered in the increasing order of energy for each case.
The figures were created with DS Visualizer v1.5 [55]
226 Y. Sakae and Y. Okamoto

of folds given by SCOP (version 1.73 in November 2007) [65]. Namely, we used
29 all α, 18 all β, 16 α/β, and 37 (α + β) proteins (see Table 5 and Fig. 7). We
then refined these selected 100 structures. We added hydrogen atoms to the PDB
coordinates by using the AMBER11 program package [57]. We thus minimized the
total potential energy E total = E conf + E solv + E constr with respect to the coordinates
 conformation, where E constr is the harmonic constraint energy term
for each proten
(E constr = heavy atom K x (x − x0 )2 ), and E solv is the solvation energy term. Here, K x
is the force constant of the restriction and x0 are the original coordinate vectors of
heavy atoms in PDB. As one can see from E constr , the coordinates of hydrogen atoms

Table 5 100 proteins used in the optimization of force-field parameters


Fold PDB ID Chain PDB ID Chain PDB ID Chain PDB ID Chain
All α 1DLW A 1N1J B 1U84 A 1HBK A
1TX4 A 1V54 E 1SK7 A 1TQG A
1V74 B 1DVO A 1HFE S 1J0P A
1Y02 A71-114 1IJY A 1I2T A 1G8E A
1VKE C 1FS1 A109- 1D9C A 1AIL A
149
1Q5Z A 1T8K A 1OR7 C 1NG6 A
1C75 A 2LIS A 1NH2 B 1Q2H A
1NKP A
All β 1XAK A 1T2W A 1GMU C1-70 1AYO A
1PK6 A 1NLQ C 1BEH A 1UA8 A
1UXZ A 1UB4 C 1LGP A 1CQY A
1PM4 A 1OU8 A 1V76 A 1UT7 B
1OA8 D 1IFG A
1IO0 A 1U7P A 1JKE C 1MXI A
α/β
1LY1 A 1NRZ A 1IM5 A 1VC1 A
1OGD A 1IIB A 1PYO D 1MUG A
1H75 A 1K66 A 1COZ A 1D4O A
1VCC A 1PP0 B 1PZ4 A 1TU1 A
α+β
1Q2Y A 1M4J A 1N9L A 1LQV B
1A3A A 1K2E A 1TT8 A 1HUF A
1SXR A 1CYO A 1KAF A 1ID0 A
1UCD A 1F46 B 1KPF A 1BYR A
1Y60 D 1SEI A 1RL6 A 1WM3 A
1FTH A 1APY B 1JID A 1N13 E
1LTS C 1JYO F 1E87 A 1UGI A
1MWP A 1PCF A 1MBY A 1IHR B
1H6H A
Optimizations of Protein Force Fields 227

Fig. 7 Structures of 100 proteins in Table 5 which were used in the optimization of force-field
parameters

and unnaturally displaced heavy-atoms will be mainly adjusted as described above.


We performed this minimization for all the 100 protein structures separately and
obtained 100 refined structures by using K x = 100 (kcal/mol). As for the solvation
energy term E solv , we used the GB/SA solvent included in the AMBER program
package (igb = 5 and gbsa = 1) [58, 59].
For these refined protein structures, we performed the optimization of force-
field parameters V1(k) of ψ and ψ  angles for AMBER ff03 force field by using the
fucntion F in Eq. (23) as the total potential energy function (E total = E conf + E solv )
for the Monte Carlo simulations in the parameter space. Here, we used AMBER11
[57] for the force calculations in Eq. (24). We have to optimize the 37 parameters
simultaneously by the simulations in 37 parameters (see Table 6). However, here,
for simplicity, we just optimized two parameters, V1 (ψ (k) ) and V1 (ψ (k) ), for each
amino-acid residue k separately, keeping the other V1 values as the original values.
In order to obtain the optimal parameters, we performed Monte Carlo simulations of
two parameters (V1 of ψ and ψ  ) for the 19 amino-acid residues except for proline.
In Table 6, the optimized parameters are listed.
In order to test the validity of the force-field parameters obtained by our opti-
mization method, we performed the folding simulations using two peptides, namely,
C-peptide and G-peptide.
228 Y. Sakae and Y. Okamoto

Table 6 Optimized V1 /2 parameters for the main-chain dihedral angles ψ and ψ  for the 19 amino-
acid residues (except for proline) in Eq. (21). The rest of the parameters are taken to be the same
as in the original ff03 force field. The original amino-acid-independent values are also listed for
reference
ψ (N–Cα –C–N) ψ  (Cβ –Cα –C–N)
original ff03 0.6839 0.7784
Ala 0.122 0.150
Arg 0.409 0.200
Asn −0.074 −0.162
Asp −0.137 0.182
Cys 0.361 0.089
Gln 0.144 −0.024
Glu 0.180 0.152
Gly 0.258 –
His 0.020 0.237
Ile 0.643 0.194
Leu 0.382 0.257
Lys 0.222 0.042
Met 0.141 0.346
Phe −0.010 0.553
Ser −0.248 0.475
Thr 0.512 0.328
Trp 0.027 0.477
Tyr 0.082 0.652
Val 0.142 0.590

For the folding simulations, we used replica-exchange molecular dynamics


(REMD) [60]. REMD is one of the generalized-ensemble algorithms and has high
conformational sampling efficiency by allowing configurations to heat up and cool
down while maintaining proper Boltzmann distributions. We used the AMBER11
program package [57] again. The unit time step was set to 2.0 fs, and the bonds
involving hydrogen atoms were constrained by SHAKE algorithm [61]. Each simu-
lation was carried out for 30.0 ns (hence, it consisted of 15,000,000 MD steps) with
16 replicas by using Langevin dynamics. The exchange procedure for each replica
were performed every 3000 MD steps. The temperature was distributed exponen-
tially: 650, 612, 577, 544, 512, 483, 455, 428, 404, 380, 358, 338, 318, 300, 282, and
266 K. As for solvent effects, we used the GB/SA model in the AMBER program
package (igb = 5 and gbsa = 1) [58, 59]. The initial conformations for each peptide
were fully extended ones for all the replicas. The REMD simulations were performed
with different sets of randomly generated initial velocities for each replica.
In Fig. 8, α-helicity and β-strandness of the two peptides obtained from the REMD
simulations are shown. We checked the secondary-structure formations by using
Optimizations of Protein Force Fields 229

(a-1) (a-2)

(b-1) (b-2)

Fig. 8 α-helicity (a-1) and β-strandness (a-2) of C-peptide and α-helicity (b-1) and β-strandness
(b-2) of G-peptide as functions of the residue number at 300 K. These values were obtained from
the REMD simulations. Normal and dotted curves stand for the optimized and the original AMBER
ff03 force fields, respectivery

the DSSP program [45], which is based on the formations of the intra-main-chain
hydrogen bonds. As is shown in Fig. 8, for the original AMBER ff03 force field,
the α-helicity is clearly higher than the β-strandness not only in C-peptide but also
in G-peptide. Namely, the original AMBER ff03 force field clearly favors α-helix
and does not favor β-structure. On the other hand, for the optimized force field,
in the case of C-peptide, the α-helicity is higher than the β-strandness, and in the
case of G-peptide, the β-strandness is higher than the α-helicity. We conclude that
these results obtained from the optimized force field are in better agreement with the
experimental results in comparison with the original force field. In Fig. 9, 310 -helicity
and π -helicity of two peptides obtained from the REMD simulations are shown. For
310 helicity, there is no large difference for both force fields in C-peptide, and in
the case of G-peptide, the value of the optimized force field slightly decreases in
comparison with the original force field. π -helicity has almost no value in the both
cases of the original and optimized force fields in two peptides.
In Fig. 10, α-helicity and β-strandness as functions of temperature for the two
peptides obtained from the REMD simulations are shown. For α-helicity, the values
230 Y. Sakae and Y. Okamoto

(a-1) (a-2)

(b-1) (b-2)

Fig. 9 310 -helicity (a-1) and π -helicity (a-2) of C-peptide and 310 -helicity (b-1) and π -helicity
(b-2) of G-peptide as functions of the residue number at 300 K. These values were obtained from
the REMD simulations. Normal and dotted curves stand for the optimized and the original AMBER
ff03 force fields, respectivery

of both force fields decrease gradually from low temperature to high temperature
in the case of C-peptide. On the other hand, in the case of G-peptide, there are
small peaks at around 300 and 358 K for the original and optimized force fields,
respectively. For β-strandness, in the case of C-peptide, it is almost zero for both
force fields. In the case of G-peptide, for the optimized force field, there is clearly a
peak around 300 K.

3.2 Optimization of Force-Field Parameters

3.2.1 Use of Force Acting on Each Atom in the PDB Coordinates


[25–27, 41]

We now present the results of our force-field optimizations. In Step 1 of the flowchart
in Fig. 1, we chose 100 PDB files (N = 100) from X-ray experiments with resolution
Optimizations of Protein Force Fields 231

(a-1) (a-2)

(b-1) (b-2)

Fig. 10 α-helicity (a-1) and β-strandness (a-2) of C-peptide and α-helicity (b-1) and β-strandness
(b-2) of G-peptide as functions of temperature. These values were obtained from the REMD sim-
ulations. Normal and dotted curves stand for the optimized and the original AMBER ff03 force
fields, respectivery

1.8 Å or better and with less than 200 residues (the average number of resiudes is
120.4) from PISCES [62]. Their PDB codes are 2LIS, 1EP0, 1TIF, 1EB6, 1C1L,
1CCW, 2PTH, 1I6W, 1DBF, 1KPF, 1LRI, 1AAP, 1C75, 1CC8, 1FK5, 1KQR, 1K1E,
1CZP, 1GP0, 1KOI, 1IQZ, 3EBX, 1I40, 1EJG, 1AMM, 1I07, 1GK8, 1GVP, 1M4I,
1EYV, 1E29, 1I2T, 1VCC, 1FM0, 1EXR, 1GUT, 1H4X, 1GBS, 1B0B, 119L, 1IFC,
1DLW, 1EAJ, 1GGZ, 1JR8, 1RB9, 1VAP, 1JZG, 1M55, 1EN2, 1C9O, 2ERL, 1EMV,
1F41, 1EW6, 2TNF, 1IFR, 1JSE, 1KAF, 1HZT, 1HQK, 1FXL, 1BKR, 1ID0, 1LQV,
1G2R, 1KR7, 1QTN, 1D4O, 1EAZ, 2CY3, 1UGI, 1IJV, 3VUB, 1BZP, 1JYR, 1DZK,
1QFT, 1UTG, 2CPG, 1I6W, 1C7K, 1I8O, 1LO7, 1LNI, 1EQO, 1NDD, 1HD2, 3PYP,
1FD3, 1DK8, 1WHI, 1FAZ, 4FGF, 2MHR, 1JB3, 2MCM, 1IGD, 1C5E, and 1JIG.
In Step 2 of the flowchart, we used the routine in the TINKER package to add
hydrogen atoms to the PDB coordinates. The force fields that we optimized are the
AMBER parm94 version [7], parm96 version [8], parm99 version [9], CHARMM
version 22 [12], and OPLS-AA [15]. We have optimized only two sets of parameters.
The first set is the partial-charge parameters [qi in Eqs. (5) and (27)]. In order to
simplify the constraint-imposing processes on the total charge, we did not optimize
232 Y. Sakae and Y. Okamoto

the charge of one of the hydrogen atoms (HN) in proline when it is located at tht
N-terminus. In the original X-ray data, hydrogen coordinates are missing, and in the
case of neutral histidine whether Nδ and Nε are protonated or not is non-trivial to
determine. Because we want to deal with as many as PDB data as possible, we treated
all the histidine residues as positively charged histidine for simplicity. Among the five
force fields, AMBER has the largest number of remaining partial-charge parameters
(602). We thus optimized these 602 parameters for all the five force fields. The second
set of parameters that we optimized is the backbone torsion-energy parameters [Va ,
Vb , and Vc in Eq. (30)] and there are six such parameters (three each for φ and ψ).
As explained in detail above, the coodinates of the 100 proteins molecules have
been prepared (Steps 1 and 2 of the flowchart in Fig. 1). The coordinate refinement
in Step 3 of the flowchart was then carried out with the constraint in Eq. (29) on the
heavy atoms. As for the force constant K x in Eq. (29), we have some freedom for
the choice of the values. Our choice is: K x should be of the same order as K l in the
bond-stretching term in Eq. (2). The force constant K l in AMBER varies from 1662
to 656 kcal/mol/Å2 , and that in CHARMM varies from 1732 to 650 kcal/mol/Å2 .
Hence, in our first trial we set K x = 100 kcal/mol/Å2 .
In Step 4 of the flowchart, we performed the optimization of the 602 partial-charge
parameters by MC simulated annealing. Namely, we minimized F in Eq. (23) by MC
simulated annealing simulations of these parameters (the parameters were updated
and the updates were accepted or rejected according to the Metropolis criterion). For
this we introduced an effective “temperature” for the parameter space. The simulation
run consisted of 50,000 MC sweeps with the temperature decreased exponentially
from 20 to 0.01. The simulation was repeated 10 times with different initial random
numbers. We found that F decreased quickly in the beginning until about 5000 MC
sweeps and then it decreased very slowly for all force fields; the total number of MC
sweeps (50,000) seemed sufficient. The optimized partial charges were taken from
those that resulted in the lowest F value.
In Tables 7, 8 and 9, five examples (glycine, alanine, and glutamic acid) of the
obtained partial charges together with the original force-field values are listed. We
see from these tables that the values of the partial charges have not changed a lot.
Although the sign of the partial charges remains the same for those with large magni-
tude, charges with small magnitude sometimes change their signs (see, for example,
CA of glycine and CG of glutamic acid).
In Step 5 of the flowchart, the original coordinates obtained in Step 2 were again
refined with the constraints in Eq. (29), but this time the optimized parameters from
Step 4 were used. This time we used the value K x = 500 kcal/mol/Å2 . For all force
fields, the average RMSD of the 100 proteins was 0.012 Å, and the coordinates of
heavy atoms had little changed.
In Step 6 of the flowchart, we carried out the optimization of the six torsion-
energy parameters (Va , Vb , and Vc in Eq. (30) for both φ and ψ) by minimizing F
in Eq. (23) with MC simulated annealing simulations in this parameter space. The
simulation run consisted of 10,000 MC sweeps with the temperature decreasing from
1000 to 1.0. The simulation was repeated six times with different random numbers.
We stopped after six trials because the convergence was very good. The optimized
Optimizations of Protein Force Fields 233

Table 7 Partial-charge parameters of glycine. AMB, CHA, and OPLS respectively stand for the
original AMBER, CHARMM version 22, and OPLS-AA force fields. Opt(94), Opt(96), Opt(99),
Opt(CH), and Opt(OP) are the optimized AMBER parm94, AMBER parm96, AMBER parm99,
CHARMM version 22, and OPLS-AA, respectively
Atom AMB Opt(94) Opt(96) Opt(99) CHA Opt(CH) OPLS Opt(OP)
N −0.4157 −0.3471 −0.3614 −0.3506 −0.4700 −0.4381 −0.5000 −0.5153
CA −0.0252 0.0175 0.0148 0.0166 −0.0200 0.0185 0.0800 0.0909
C 0.5973 0.5526 0.5698 0.5577 0.5100 0.5309 0.5000 0.6459
HN 0.2719 0.2492 0.2509 0.2480 0.3100 0.3004 0.3000 0.2615
O −0.5679 −0.5980 −0.5977 −0.5983 −0.5100 −0.5491 −0.5000 −0.5546
HA 0.0698 0.0629 0.0618 0.0633 0.0900 0.0687 0.0600 0.0358
Total 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

Table 8 Partial-charge parameters of alanine. See the caption in Table 7


Atom AMB Opt(94) Opt(96) Opt(99) CHA Opt(CH) OPLS Opt(OP)
N −0.4157 −0.3354 −0.3483 −0.3407 −0.4700 −0.3909 −0.5000 −0.5224
CA 0.0337 0.0545 0.0547 0.0511 0.0700 0.0427 0.1400 0.1301
C 0.5973 0.5141 0.5240 0.5235 0.5100 0.5215 0.5000 0.6687
HN 0.2719 0.2323 0.2346 0.2317 0.3100 0.2709 0.3000 0.2610
O −0.5679 −0.5703 −0.5599 −0.5778 −0.5100 −0.5417 −0.5000 −0.5567
HA 0.0823 0.0901 0.0912 0.0900 0.0900 0.0741 0.0600 0.0786
CB −0.1825 −0.0453 −0.0470 −0.0501 −0.2700 −0.2718 −0.1800 −0.0701
HB 0.0603 0.0200 0.0169 0.0241 0.0900 0.0984 0.0600 0.0036
Total 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

Table 9 Partial-charge parameters of glutamic acid. See the caption in Table 7


Atom AMB Opt(94) Opt(96) Opt(99) CHA Opt(CH) OPLS Opt(OP)
N −0.5163 −0.4248 −0.4376 −0.4302 −0.4700 −0.3961 −0.5000 −0.5401
CA 0.0397 0.0583 0.0553 0.0554 0.0700 0.0423 0.1400 0.1320
C 0.5366 0.4728 0.4873 0.4817 0.5100 0.5249 0.5000 0.6538
HN 0.2936 0.2595 0.2620 0.2590 0.3100 0.2845 0.3000 0.2626
O −0.5819 −0.6181 −0.6107 −0.6248 −0.5100 −0.5603 −0.5000 −0.5777
HA 0.1105 0.1232 0.1232 0.1221 0.0900 0.0837 0.0600 0.0670
CB 0.0560 0.1226 0.1170 0.1217 −0.1800 −0.1634 −0.1200 −0.0517
HB −0.0173 −0.0333 −0.0334 −0.0300 0.0900 0.0943 0.0600 0.0418
CG 0.0136 −0.0678 −0.0716 −0.0659 −0.2800 −0.2870 −0.2200 −0.2185
HG −0.0425 −0.0300 −0.0297 −0.0299 0.0900 0.1160 0.0600 0.0437
CD 0.8054 0.8293 0.8340 0.8292 0.6200 0.5465 0.7000 0.7320
OE −0.8188 −0.8142 −0.8163 −0.8142 −0.7600 −0.7479 −0.8000 −0.8152
Total −1.0000 −1.0000 −1.0000 −1.0000 −1.0000 −1.0000 −1.0000 −1.0000
234 Y. Sakae and Y. Okamoto

torsion-energy parameters were taken from those that resulted in the lowest F value.
The obtained torsion-energy parameters are listed in Tables 10 and 11.
In the present work, we stopped our process in Step 6 of the flowchart and did not
iterate the optimizations.
In order to examine how much the torsion-energy terms have changed after opti-
mizations, we depict them in Fig. 11 (we remark that the error of factor 2 in the
ordinate of Fig. 5e1 in Ref. [26] is corrected here). Although the behaviors of the
original force fields are quite different, those of the optimized force fields are rather
similar. For example, the optimized torsion-energy curves for φ angles have two max-
imum peaks around φ ∼ −60◦ and +60◦ and a local minimum at φ = 0◦ , while those
for ψ angle have two peaks around ψ ∼ −100◦ and +100◦ and a local minimum at
ψ = 0◦ (the exceptions are those for CHARMM version 22 and OPLS-AA, which
give the global maximum and a local maximum, respectively, at ψ = 0◦ ). These

Table 10 Torsion parameters of φ angle. Parm94, Parm96, Parm99, CHARMM, and OPLS are
AMBER parm94, AMBER parm96, AMBER parm99, CHARMM version 22, and OPLS-AA force
fields, respectively. “Optimized” stands for the corresponding optimized force field
Force field Va na γa Vb nb γb Vc nc γc
Parm94 0.200 2 180.0 – – – – – –
Optimized 0.191 1 0.0 0.146 2 180.0 −0.223 3 0.0
Parm96 0.850 1 0.0 0.300 2 180.0 – – –
Optimized 1.182 1 0.0 0.359 2 180.0 −0.410 3 0.0
Parm99 0.800 1 0.0 0.850 2 180.0 – – –
Optimized 1.380 1 0.0 0.599 2 180.0 −0.330 3 0.0
CHARMM 0.200 1 180.0 – – – – – –
Optimized −0.047 1 180.0 0.240 2 180.0 −0.015 3 0.0
OPLS −2.365 1 0.0 0.912 2 180.0 −0.850 3 0.0
Optimized 0.502 1 0.0 1.811 2 180.0 −0.567 3 0.0

Table 11 Torsion parameters of ψ angle. See the caption in Table 10


Force field Va na γa Vb nb γb Vc nc γc
Parm94 0.750 1 180.0 1.350 2 180.0 0.400 4 180.0
Optimized −0.368 1 180.0 1.658 2 180.0 0.265 4 180.0
Parm96 0.850 1 0.0 0.300 2 180.0 – – –
Optimized 0.039 1 0.0 1.011 2 180.0 0.104 3 0.0
Parm99 1.700 1 180.0 2.000 2 180.0 – – –
Optimized 0.228 1 180.0 1.684 2 180.0 −0.031 3 0.0
CHARMM 0.600 1 0.0 – – – – – –
Optimized 0.321 1 0.0 0.028 2 180.0 0.251 3 0.0
OPLS 1.816 1 0.0 1.222 2 180.0 1.581 3 0.0
Optimized 0.880 1 0.0 1.479 2 180.0 0.952 3 0.0
Optimizations of Protein Force Fields 235

(a1)
Energy (kcal/mol
(a2)

Energy (kcal/mol
 
(b1) (b2)
Energy (kcal/mol

Energy (kcal/mol
 
(c1) (c2)
Energy (kcal/mol

Energy (kcal/mol

 
(d1) (d2)
Energy (kcal/mol

Energy (kcal/mol

 
(e1) (e2)
6

2
4
Energy (kcal/mol

Energy (kcal/mol

0
-180 -120 -60 0 60 120 180 2


-2 0
-180 -120 -60 0 60 120 180

-4 -2

Fig. 11 Backbone torsion-energy curves as functions of φ (in degrees) and ψ (in degrees). The
force fields are AMBER parm94 (a), AMBER parm96 (b), AMBER parm99 (c), CHARMM version
22 (d), and OPLS-AA (e). The results for the original force fields are represented by dotted curves,
and those for the optimized force fields are by solid curves
236 Y. Sakae and Y. Okamoto

results suggest that our optimizations of the torsion-energy term yield a tendency for
convergence towards a common function. Some remark is in order. The case for the
optimized CHARMM is the most distinct from other optimized parameters in the
sense that it gives the global maximum at ψ = 0◦ whereas that for other cases lies
around ψ ∼ −100◦ and +100◦ .
In Fig. 12 the potential-energy surfaces of the alanine dipeptide (ACE-ALA-NME)
are shown for the 10 force-field parameters: the original AMBER parm94, AMBER
parm96, AMBER parm99, CHARMM version 22, OPLS-AA, and the corresponding
optimized parameters. According to the ab initio quantum mechanical calculations,
there exist three local-minimum states in the energy surface [7]. They are conformers
C7eq , C5 , and C7ax , which correspond to (φ, ψ) ∼ (−80◦ , +80◦ ), (−160◦ , +160◦ ),
and (+75◦ , −60◦ ), respectively (C7eq is the global-minimum state). We remark that
these are the results of quantum chemistry calculations in vacuum, and so it is not
clear how reliable the results are to represent the dipeptide in aqueous solution.
The results of all five original force fields in Fig. 12a1–e1 seem to satisfy the above
conditions. Namely, there are three local-minimum states at the locations of C7eq , C5 ,
and C7ax , and the global-minimum state is C7eq . As for the results of the optimized
force fields in Fig. 12a2–e2, those for CHARMM version 22 and OPLS-AA also
satisfy the above conditions. Those of the optimized AMBER force fields are less
consistent with the quantum mechanical calculations: C7eq is no longer the global-
minimum state, but it is a local-minimum state. In particular, the optimized AMBER
parm99 seems to be in the greatest disagreement in the sense that the C7eq state is
almost disappearing.
We now present another example of the refinement of our backbone torsion energy
in Eq. (14). We consider the following truncated Fourier series:

E (φ, ψ) = a + b1 cos φ + c1 sin φ + b2 cos 2φ + c2 sin 2φ


+ d1 cos ψ + e1 sin ψ + d2 cos 2ψ + e2 sin 2ψ
+ f 11 cos φ cos ψ + g11 cos φ sin ψ
+ h 11 sin φ cos ψ + i 11 sin φ sin ψ . (41)

This function has 13 Fourier-coefficient parameters. We will see below that this
number of Fourier terms is sufficient for the most of our purposes [34, 35], but that
for some cases more number of Fourier terms are preferred.
We optimized the force-field parameters of this double Fourier series by using our
optimization method. At first, we chose 100 PDB files from PDB-REPRDB [56].
We added hydrogen atoms to the PDB coordinates by using the TINKER program
package [54].
In our optimization method, the minimizations of F in Eq. (23) by the Monte
Carlo (MC) simulations of the 13 backbone-torsion-energy parameters with 3000
MC steps were performed. The initial values of 13 parameters were all set to be
zero. We performed MC simulations of the optimization for each f cut value 10 times
with different seeds for the random numbers. After that, the minimum F value was
selected from the results of the obtained 10 parameter sets for each case of the f cut
Optimizations of Protein Force Fields 237

Fig. 12 Potential-energy (a1) (a2)


surfaces of alanine dipeptide.
The force fields are the
original AMBER parm94
(a1), AMBER parm96 (b1),


AMBER parm99 (c1),
CHARMM version 22 (d1),
and OPLS-AA (e1), and the
corresponding optimized
parameters (a2)–(e2). The  
contour maps were evaluated (b1) (b2)
every 10◦ of φ and ψ angles
and plotted every 1 kcal/mol,
after minimizing the total
potential energy in vacuum


with the backbone structures
fixed. The bluer the color is,
the lower the potential
energy surface is. As the
potential-energy value  
increases, the color changes (c1) (c2)
from blue to green, to
yellow, and to red


(d1)  (d2) 

(e1)  (e2) 

 
238 Y. Sakae and Y. Okamoto

value. The overall parameter distributions were essentially the same for the 10 runs.
max
The maximum f cut value was taken to be f cut 9.0, which was selected from the
peak point in the distribution of the forces acting on each atom in the 100 protein
structures in Fig. 13. For the obtained several parameters, several ΦRMSD were
calculated by using Eq. (32). Here, if a difference between Φinative and Φimin of a
backbone dihedral angle in a protein was more than 20◦ , the value was ignored.
Because there are about 90% of differences between Φinative and Φimin including less
than 20◦ . In Fig. 14, the distribution of the backbone dihedral angles in the 100 protein
structures is shown. Namely, we wanted to consider the majority of the differences
of backbone dihedral angles. After the calculations of several ΦRMSD, we selected
f cut = 8.5 at the minimum value of ΦRMSD from the several those.
In Table 12, optimized double Fourier-coefficient parameters and the correspond-
ing original AMBER ff94 and ff96 force-field parameters are listed. Here, the original
AMBER ff94 has a Fourier coefficient that the number of waves is four. Therefore,
this coefficient set of the original AMBER ff94 is not complete. In Fig. 15, these
backbone-torsion-energy surfaces on the Ramachandran space are illustrated.
In order to test the validity of the force-field parameters obtained by our opti-
mization methods, we performed folding simulations using two peptides, namely,
C-peptide and G-peptide.

Fig. 13 The distribution of


the absolute value of the
forces acting on each atom in
the 100 protein structures,
which were obtained from
PDB

Fig. 14 The distribution of


the absolute value of the
backbone dihedral angles Φ
(φ and ψ ) in the 100 protein
structures, which were
obtained from PDB
Optimizations of Protein Force Fields 239

Table 12 Fourier coefficients in Eq. (39) obtained from the numerical evaluations of the integrals
in Eq. (15). “org94” and“org96” stand for the original AMBER ff94 and the original AMBER
ff96, respectively, “optimized” stands for the optimized force field obtained by our optimization
method. Here, the original AMBER ff94 has the Fourier coefficient that the number of waves is
four. Therefore, this coefficient set of the original AMBER ff94 is not complete
Coefficient org94 org96 Optimized
2.700 2.300 0.000
a
0.000 0.850 0.835
b1
−0.200 −0.300 −0.088
b2
0.000 0.000 −0.327
c1
0.000 0.000 0.100
c2
−0.750 0.850 0.287
d1
−1.350 −0.300 0.019
d2
0.000 0.000 −0.160
e1
0.000 0.000 −0.054
e2
0.000 0.000 −0.427
f 11
0.000 0.000 0.247
g11
0.000 0.000 0.114
h 11
0.000 0.000 0.603
i 11

For the folding simulations, we used REMD [60]. We used the TINKER program
package [54] modified by us for the folding simulations. The unit time step was
set to 1.0 fs. Each simulation was carried out for 5.0 ns (hence, it consisted of
5,000,000 MD steps) with 32 replicas. The temperature during MD simulations
was controlled by Nosé-Hoover method [63]. For each replica the temperature was
distributed exponentially from 700 to 250 K. As for solvent effects, we used the
GB/SA model [42, 43] included in the TINKER program package [54].
We checked the secondary-structure formations by the DSSP program [45]. In
Fig. 16, the helicity and strandness of C-peptide which were obtained with the opti-
mized force field, the original AMBER ff94, and the original AMBER ff96 are
shown. In comparison with the original AMBER ff94, the helicity of the optimized
force field decreases, and in comparison with the original AMBER ff96, that of the
240 Y. Sakae and Y. Okamoto

Fig. 15 The backbone-torsion-energy surfaces of the optimized force field (a), the original AMBER
ff94 (b), and the original AMBER ff96 are shown

optimized force field increases. As for the strandness, the original AMBER ff94 is
almost zero, and both the optimized force field and the original AMBER ff96 have
low strandness.
In Fig. 17, the helicity and strandness of G-peptide which were obtained with the
optimized force field, the original AMBER ff94, and the original AMBER ff96 are
shown. The helicity of the original AMBER ff94 obviously has high value as in the
case of C-peptide. On the other hand, the helicity of both the optimized force field
and the original AMBER ff96 decrease in comparison with the case of the original
AMBER ff94. However, in comarison with the original AMBER ff96, the optimized
force field slightly favors the helix structure in the region around amino-acid residues
6–8. In the experimental results, there is a turn region around residues 7–10 in G-
peptide, and the backbone-torsion angles of the turn conformation are similar to that
of the helix structure. Therefore, we consider that this tendency is not disagreement
with the experimental results. For the strandness, the original AMBER ff94 is also
Optimizations of Protein Force Fields 241

(a) (b)

Fig. 16 Helicity (a) and strandness (b) of C-peptide as functions of the residue number. These
values are obtained from the REMD [60] simulations at 300K. Normal, dashed, and dotted lines
stand for the optimized force field, the original AMBER ff94, and the original AMBER ff96,
respectively. There is only one secondary structural element (an α-helix in residues 4–12) in the
native structure (PDB ID: 1A5P). See Fig. 5a

(a) (b)

Fig. 17 Helicity (a) and strandness (b) of G-peptide as functions of the residue number. These
values are obtained from the REMD [60] simulations at 300 K. Normal, dashed, and dotted lines
stand for the optimized force field, the original AMBER ff94, and the original AMBER ff96,
respectively. There is only one secondary structural element (a β-hairpin; β-strands are in residues
2–6 and residues 11–15) in the native structure (PDB ID: 1PGA). See Fig. 5b

almost zero as in the case of C-peptide, and both the optimized force field and the
original AMBER ff96 have higher values of the strandness than those ot the helicity.
In Fig. 17b, the strandness decreases in the region around 7–8 residues in agreement
with the experiments.
These secondary-structure-forming tendencies of the optimized force field for
two peptides agree with experimental implications in comparison with those of the
original AMBER ff94 and ff96 force fields. Therefore, our improvement methods
succeeded in enhancing the accuracy of the AMBER force field.
242 Y. Sakae and Y. Okamoto

3.2.2 Use of CRMSD [39]

We now present the results of the applications of our optimization method in


Sect. 2.3.2, which we here refer to as Method 2, as well as that in Sect. 2.3.1, which
we refer to as Method 1.
At first, we chose 100 PDB files from PDB-REPRDB [56]. Next, we refined
these selected 100 structures. We added hydrogen atoms to the PDB coordinates by
using the TINKER program package. We thus minimized the total potential energy
E total = E conf + E solv + E constr with respect to the coordinates for each proten con-
formation, where E constr is the constraint energy term in Eq. (29). We performed this
minimization for all the 100 protein structures separately and obtained 100 refined
structures.
The force field that we optimized is the OPLS-UA [64]. The torsion-energy term
E torsion (Φ) for this force field is given by Eq. (4). We performed the force-field
parameter optimizations that correspond to the following torsion angles by Methods
1 and/or 2.
1. N–Cα –Cβ –Cγ and C–Cα –Cβ –Cγ (χ1 ) by Method 2

2. C–N–Cα –C (φ), N–Cα –C–N (ψ), C–N–Cα –Cβ and N–C–Cα –Cβ by Methods 1
and 2

3. C–N–Cα –Cβ by Method 2

4. N–Cα –C–N by Method 2

5. Cα –Cβ –Cγ –Cδ (χ2 of Glu) by Methods 1 and 2

Here, we also optimized the force-field parameters of χ2 of Glu. The reason is


given below.
In Method 1, the minimizations of F in Eq. (23) by the Monte Carlo (MC) simu-
lated annealing simulations of the torsion-energy parameters with 10000 MC steps
were performed 10 times. Here, we neglected the improper-torsion-energy contri-
butions to E conf in Eq. (25). In order to make a better force field, we have to opti-
mize many force-field parameters. However, we ignored the uncertainty of improper-
torsion-energy parameters with this optimization, because we wanted to focus on the
torsion-energy parameters and Method 1 is very sensitive for the energy of dihedral
angles. For example, one of the results of the simulations of Method 1 above is shown
in Fig. 18.
In Method 2, the lowest CRMSD value was selected from about 10 to 30 opti-
mization runs with different initial conditions. In order to calculate C-RMSD, the
minimizations of 100 proteins were performed using these new parameter sets. In
Table 13, all the optimized torsion-energy parameters are listed. As one can see in
Table 13, the original parameters of OPLS-UA force field for the optimization are
almost zero.
Optimizations of Protein Force Fields 243

Fig. 18 Time series of


Monte Carlo simulated
annealing simulations in
force-field parameter space
of torsion-energy for
OPLS-UA. The ordinate is
the value of F in Eq. (23)

Table 13 Original and optimized torsion-energy parameters of OPLS-UA


V1 /2 γ1 V2 /2 γ2 V3 /2 γ3
org opt org opt org opt
N–Cα –Cβ –Cγ (χ1 ) 0.5 or 1.950 0.0
1.0
C–Cα –Cβ –Cγ (χ1 ) 0.5 or 1.950 0.0
1.0
C–N–Cα –C (φ) 0.0 −0.662 0.0 0.0 0.277 π 0.0 −0.050 0.0
N–Cα –C–N (ψ) 0.0 0.974 0.0 0.0 0.576 π 0.0 −0.083 0.0
C–N–Cα –Cβ 0.0 0.811 0.0 0.0 0.328 π 0.0 0.155 0.0
N–C–Cα –Cβ 0.0 0.215 0.0 0.0 0.036 π 0.0 0.015 0.0
Cα –Cβ –Cγ –Cδ 0.0 0.565 0.0 0.0 0.177 π 2.0 −0.025 0.0
(χ2 of Glu)

In comparison with Method 1, Method 2 can optimize force-field parameters


appropriately even if there are some errors in PDB structures. However, the compu-
tational cost of Method 2 is much larger than that of Method 1. Therefore, we could
not apply Method 2 to the global optimization in the force-field-parameter space. The
force-field parameters of the backbone-torsion angles need the global optimization,
because we consider that these parameters are the most problematic. Thus, at first,
we performed the global optimization of the backbone-torsion parameters by using
Method 1. After that, Method 2 was applied only on the local region of the parameter
space, which was identified as relevant by Method 1.
In order to test the validity of the force-field parameters obtained by our opti-
mization methods, we performed folding simulations using two peptides, namely,
C-peptide and G-peptide.
Only Glu amino acid appears twice in each of the two peptides. Therefore, we
consider that Glu amino acid is the most important, and the χ2 parameters were
optimized for this amino acid. (Of cource, we expect that it becomes a better force
field if the remaining force-field parameters of other amino acids are also optimized.)
For the folding simulations, we used REMD [60]. We used the TINKER program
package [54] modified by us for the folding simulations. The unit time step was set
244 Y. Sakae and Y. Okamoto

to 1.0 fs. Each simulation was carried out for 10 ns (hence, it consisted of 10,000,000
MD steps) with 16 replicas. The temperature during MD simulations was controlled
by Nosé-Hoover method [63]. The temperature was distributed exponentially: 700,
662, 625, 591, 558, 528, 499, 471, 446, 421, 398, 376, 355, 336, 317, and 300 K. As for
solvent effects, we used the GB/SA model [42, 43] included in the TINKER program
package [54]. These folding simulations were repeated 10 times with different sets
of randomly generated initial velocities.
In Fig. 19, the helicity and strandness of C-peptide which were obtained with the
original OPLS-UA and its optimized force field are shown. These values are the
averages of the 10 REMD simulations at 300 K. In comparison with the helicity
of the original OPLS-UA, the helicity of the optimized force field increases at the
amino-acid sequence between 6 and 12. The strandness is almost zero for both the
original and the optimized OPLS-UA force fields.
In Fig. 20, the helicity and strandness of G-peptide with the original OPLS-UA
and its optimized force fields are shown. In comparison with the original OPLS-UA,

(a) (b)

Fig. 19 Helicity (a) and strandness (b) of C-peptide as functions of the residue number. These
values are the average of the 10 independent REMD [60] simulations at 300 K. Normal and dotted
lines stand for the optimized and original OPLS-UA force fields, respectively

(a) (b)

Fig. 20 Helicity (a) and strandness (b) of G-peptide as functions of the residue number. These
values are the average of the 10 independent REMD [60] simulations at 300 K. Normal and dotted
lines stand for the optimized and original OPLS-UA force fields, respectively
Optimizations of Protein Force Fields 245

the helicity of the optimized force field decreases in the area of amino-acid sequence
between 8 and 15, and in comparison with the original OPLS-UA, the strandness of
the optimized force field clearly increases at the two areas of amino-acid sequences
2–6 and 9–15. In the experimental results, there is a turn region around residues 7–10
and there are five intra-backbone hydrogen bond pairs, namely, between residue pairs
2–15, 3–14, 4–13, 5–12, and 6–11 in G-peptide. In Fig. 20b, the strandness decreases
in the region around 7–8 residues in agreement with the experiments.
These results show that the optimized force field favors helix structures more than
the original OPLS-UA in the case of C-peptide and favors β structures more than the
original OPLS-UA in the case of G-peptide. We see that these secondary-structure-
forming-tendencies of the optimized force field are better than those of the original
OPLS-UA.
In Figs. 21 and 22, we show the 20 lowest-energy conformations of C-peptide
and G-peptide obtained by the REMD simulations in the case of the original and

(a)

1 2 3 4 5 6 7

8 9 10 11 12 13 14

15 16 17 18 19 20

(b)

1 2 3 4 5 6 7

8 9 10 11 12 13 14

15 16 17 18 19 20

Fig. 21 Twenty lowest-energy conformations of C-peptide obtained from 10 sets of REMD [60]
simulation runs. a and b are the results of the original and optimized OPLS-UA force field, respec-
tively. The conformations are ordered in the increasing order of energy for each case. The figures
were created with DS Visualizer v1.5 [52]
246 Y. Sakae and Y. Okamoto

(a)

1 2 3 4 5 6 7

8 9 10 11 12 13 14

15 16 17 18 19 20

(b)

1 2 3 4 5 6 7

8 9 10 11 12 13 14

15 16 17 18 19 20

Fig. 22 Twenty lowest-energy conformations of G-peptide obtained from 10 sets of REMD [60]
simulation runs. a and b are the results of the original and optimized OPLS-UA force field, respec-
tively. The conformations are ordered in the increasing order of energy for each case. The figures
were created with DS Visualizer v1.5 [52]

optimized OPLS-UA force fields, respectively. In Fig. 21a, five conformations (Nos.
11, 13, 16, 18, and 19) have α-helix structures for the original OPLS-UA in the case
of C-peptide. In Fig. 21b, 18 conformations (all conformations except for Nos. 2 and
12) have α-helix structures for the optimized OPLS-UA in the case of C-peptide.
From these results, we can see that the optimized OPLS-UA force field favor α-
helix structure more than the original OPLS-UA force field in the case of C-peptide.
In Fig. 22a, 11 conformations have α-helix structures for the original OPLS-UA in
the case of G-peptide. In Fig. 22b, seven conformations have α-helix structures, and
eight conformations have β-hairpin structures for the optimized OPLS-UA in the
case of G-peptide. In Fig. 22b, two conformations (Nos. 3 and 16) out of the eight
β-hairpin conformations have the right hydrogen bond formations that are inferred
by the experiments. Namely, conformation No. 3 has three native-like hydrogen
Optimizations of Protein Force Fields 247

bonds between residue pairs 3–14, 4–13, and 5–12, and conformation No. 16 has
two native-like hydrogen bonds between residue pairs 3–14 and 4–13. These results
for G-peptide show that the optimized OPLS-UA force field does not favor α-helix
structure and clearly favors β-hairpin structure more than the original OPLS-UA
force field.
These secondary-structure-forming tendencies of the optimized OPLS-UA force
field for two peptides agree with experimental implications in comparison with those
of the original OPLS-UA force field. Therefore, our optimization methods succeeded
in enhancing the accuracy of the OPLS-UA force field.

3.2.3 Use of ΦRMSD [38]

We now present the results of the applications of our optimization method of force-
field parameters in Sect. 2.3.3.
At first, we chose 100 PDB files from PDB-REPRDB [56]. We selected the number
of each fold (all α, all β, α/β, and α + β) in 100 proteins based on the number of
folds given by SCOP (version 1.73 at November 2007) [65]. Namely, we used 29 all
α, 18 all β, 16 α/β, and 37 (α + β) proteins (the list is slightly different from that
in Table 5).
The force field that we optimized is the AMBER parm96 version [8]. The
backbone-torsion-energy term E torsion (Φ, Ψ ) for this force field is given by
φ φ ψ
V1 V V
E torsion (Φ, Ψ ) = [1 + cos φ] + 2 [1 − cos 2φ] + 1 [1 + cos ψ]
2 2 2 (42)
ψ
V2
+ [1 − cos 2ψ],
2
φ φ ψ ψ
where we have V1 = 1.7, V2 = 0.6, V1 = 1.7, and V2 = 0.6. Here, we have opti-
ψ
mized only two parameters in the backbone-torsion-energy term, namely, V1 and
ψ
V2 for ψ angle. As described above, AMBER parm94 and AMBER parm96 have
quite different secondary-structure-forming-tendencies, although these force fields
differ only in the backbone torsion-energy terms for rotations of the φ and ψ angles.
ψ ψ
Moreover, we can easily imagine that force-field parameters V1 and V2 for ψ angle
are important for the secondary-structure-forming-tendencies, because the energy
surface in the Ramachandran space is quite sensitive to this energy term in the helix
and β-sheet regions. Namely, if the torsion-energy term for the ψ angle changes, the
stabilities of helix structure region and β-sheet region on the Ramachandran space
ψ ψ
change. Therefore, we considered some trial force-field parameters for V1 and V2 ,
which are given by the following equations:

V1trial = 1.7 · 0.2i = 0.34i, (43)

V2trial = 0.6 · 0.2i = 0.12i. (44)


248 Y. Sakae and Y. Okamoto

Here, i is any real number. When i is 5, the force-field parameters V1trial and V2trial of
ψ angle are equal to those of the original AMBER parm96. From our experience, if
i has a small value (i < 5), the force field favors helix structure, and if i has a large
value (i > 5), the force field favors β-sheet structure (see also Figs. 23 and 24). We
calculated ΦRMSD2ndry values in Eq. (37) about some trial force-field parameters
obtained by changing i in Eqs. (43) and (44).
We performed the minimization, which was terminated when the root-mean-
square (RMS) potential energy gradients were less than 0.1 (kcal/mol/Å) by using
TINKER program package [54]. For solvent effects, we used GB/SA solvent model
in TINKER.

(a)
80
70
60
Helicity (%)

50
40
Optimized
30
Original
20 Para3
10 Para7

0
0 2 4 6 8 10 12 14
Residue number
(b)
80
Optimized
70
Original
60 Para3
Strandness (%)

50 Para7

40
30
20
10
0
0 2 4 6 8 10 12 14
Residue number

Fig. 23 Helicity (a) and strandness (b) of C-peptide as functions of the residue number. These values
are the averages of the 10 independent REMD [60] simulations at 300 K. Optimized, original, para3,
and para7 stand for the optimized AMBER parm96 (i = 4.7), original AMBER parm96 (i = 5.0),
trial force field para3 (i = 3.0), and trial force field para7 (i = 7.0), respectively
Optimizations of Protein Force Fields 249

(a)
80
Optimized
70
Original
60 Para3
50
Para7
Helicity (%)

40

30

20

10

0
0 2 4 6 8 10 12 14 16

Residue number

(b)
80
Optimized
70
Original
60 Para3
50 Para7
Strandness (%)

40
30
20
10
0
0 2 4 6 8 10 12 14 16
Residue number

Fig. 24 Helicity (a) and strandness (b) of G-peptide as functions of the residue number. These
values are the averages of the 10 REMD [60] simulations at 300 K. Optimized, original, para3, and
para7 stand for the optimized AMBER parm96 (i = 4.7), original AMBER parm96 (i = 5.0), trial
force field para3 (i = 3.0), and trial force field para7 (i = 7.0), respectively

The results of ΦRMSDhelix and ΦRMSDβ are shown in Fig. 25a, b, recpectively.
In these calculations, if the differences of the backbone-dihedral angles between
Φinative and Φimin in Eq. (36) are more than 30◦ , they were ignored, assuming that
the uncertaintties in those angles are too large. We see that ΦRMSDhelix decreases
gradually with a decrease in i. If i decreases, the torsion energy of the helix structure
region in the Ramachandran space also decreases. On the other hand, ΦRMSDβ
decreases gradually with an increase in i. If i increases, the torsion energy of the β
structure region in the Ramachandran space decreases. Hence, this result is reason-
able. However, ΦRMSDβ reaches the global minimium, when i is 6.5. If i is larger
250 Y. Sakae and Y. Okamoto

(a) (b)
10.4 18
RMSDhelix

16

RMSD
10.3
14
10.2
12
10.1 10
-20 -10 0 10 20 -20 -10 0 10 20
i i
Equation (5)
(c)
88

86
RMSD2ndly

84

82

80
-20 -10 0 10 20
i

Fig. 25 Distributions of ΦRMSDhelix (a), ΦRMSDβ (b), and ΦRMSD2ndry (c) obtained from the
minimization of 100 proteins using the trial force-field parameters V1trial and V2trial as functions of
the number i

than 6.5, ΦRMSDβ increases gradually. This result implies that the ΦRMSDβ does
not correspond to the parameters V1trial and V2trial completely.
For ΦRMSDhelix and ΦRMSDβ in Fig. 25a, b, we can see the difference clearly.
The noteworthy point obtaind from these results is that ΦRMSD can distinguish
between helix structure and β structure.
We combined ΦRMSDhelix and ΦRMSDβ by Eq. (37). Here, in order to have
roughly equal contributions from both terms, we can set the value of the scaling
factor λ to be, for example, the coefficients of variations:
σβ
μβ
λ= σ . (45)
helix
μhelix
Optimizations of Protein Force Fields 251

Here, μhelix and μβ are the averages and σhelix and σβ are the corresponding standard
deviations for ΦRMSDhelix and ΦRMSDβ . For the calculations, we have chosen a
small number of i values in a range i min ≤ i ≤ i max . For i min = 0 and i max = 10, we
obtained λ = 6.857, and this fixied value was used for all the calculations in the
present work.
In Fig. 25c, the combined result is shown. The smallest ΦRMSD2ndry is obtained
value i = 4.7, namely, the obtained force-field parameters are V1trial = 1.598 and
V2trial = 0.564. These values are slightly smaller than those of the original AMBER
parm96, which corresponds to i = 5. We can easily expect the new obtained force-
field parameters slightly favor helix structure more and β-sheet structure less than
the original AMBER parm96.
In order to test the validity of the force-field parameters obtained by our opti-
mization method, we performed the folding simulations using two peptides, namely,
C-peptide and G-peptide.
For the folding simulations, we used REMD [60]. We used the TINKER program
package [54] modified by us for the folding simulations. The unit time step was set to
1.0 fs. Each simulation was carried out for 2 ns (hence, it consisted of 2,000,000 MD
steps) with 16 replicas and repeated 10 times. The temperature during MD simula-
tions was controlled by Berendsen’s method [53]. The temperature was distributed
exponentially: 700, 662, 625, 591, 558, 528, 499, 471, 446, 421, 398, 376, 355, 336,
317, and 300 K. As for solvent effects, we used the GB/SA model [42, 43] included
in the TINKER program package [54]. These folding simulations were performed
with different sets of randomly generated initial velocities.
In Fig. 23, the helicity and strandness of C-peptide which were obtained with the
original AMBER parm96 and its optimized force field are shown. These values are
the averages of the 10 REMD simulations at 300 K. In comparison with the original
AMBER parm96, the helicity of the optimized force field is similar. However, the
helicity of Thr3, Ala4, and Ala5 of the optimized force field slightly increases. In
comparison with the original AMBER parm96, the strandness of the optimized force
field decreases except for that at Ala6, Lys7, and Phe8.
In Fig. 24, the helicity and strandness of G-peptide at the original AMBER parm96
and its optimized force field are shown. In comparison with the original AMBER
parm96, the helicity of the optimized force field slightly increases and the strandness
of the optimized force field slightly decreases. For trial force fields of para3 and para7,
the scondary-structure-forming-tendencies are simillar to the case of C-peptide.
These results clearly show that the optimized force field favors helix structures
and does not favor β structures in comparison with the original AMBER parm96.
We can see that these secondary-structure-forming-tendencies of the optimized force
field are better than those of the original AMBER parm96, becasue it is known that
the AMBER parm96 slightly favors the β structure too much [23–27].
We also performed the folding simulations with two extreme cases of the trial
force fields, namely, para3 (i = 3.0) and para7 (i = 7.0) (see Figs. 23 and 24) for
comparisons. The trial force field para3 favors helix structure strongly and does
not favors β structure clearly. On the other hand, the trial force field para7 has the
tendency that is quite reverse to para3. According to the results of ΦRMSDhelix and
252 Y. Sakae and Y. Okamoto

ΦRMSDβ in Fig. 25a,b, ΦRMSDhelix decreases gradually with a decrease in i, and


ΦRMSDβ reaches the global minimum when i is 6.5. Namely, we can see that the
values of ΦRMSDhelix and ΦRMSDβ are related to the stabilities of helix structure
and β structure well.

3.2.4 Use of Short MD Simulations [46]

We present the results of the applications of our optimization method in Sect. 2.3.4 to
the AMBER ff99SB force field. At first, we chose 31 PDB files (M = 31) with reso-
lution 2.0 Å or better, with sequence similarity of amino acid 30.0 % or lower and with
from 40 to 111 residues (the average number of residues is 86.7) from PDB-REPRDB
[56]. Namely, the PDB IDs of these 31 proteins are 1LDD, 1HBK, 1Y02, 1I2T, 1U84,
2ERL, 1TQG, 1O82, 1V54, 1XAK, 1GMU, 1O5U, 1NLQ, 1WHO, 1CQY, 1H75,
1GMX, 1IIB, 1VC1, 1AY7, 1KAF, 1KPF, 1BM8, 1MK0, 1EW4, 1OSD, 1VCC,
1OPD, 1CYO, 1CTF, and 1N9L. We added hydrogen atoms to the PDB coordi-
nates by using the AMBER11 program package. After adding the hydrogen atoms,
we performed the short potential energy minimizations while restraining the heavy
atoms. We used the obtained conformations as the initial structures (experimental
structures). We performed MD simulations for these proteins. Each simulation was
carried out for 40.0 ps (hence, it consisted of 20,000 MD steps, and the unit time
step was set to 2.0 fs and the bonds involving hydrogen were constrained by SHAKE
algorithm [61]) by using Langevin dynamics at 300 K. The nonbonded cutoff of 20
Å were used. As for solvent effects, we used the GB/SA model [58] included in
the AMBER program package (igb = 5). These simulations were performed with
different sets of the same generated initial velocities of atoms in 31 proteins. For
all the process, we used the AMBER11 program package [57]. As trial force-field
parameters, we used the parameters V1 of ψ (N–Cα –C–N) and ψ  (Cβ –Cα –C–N)
angles for torsion-energy term in Eq. (4). We performed the simulations by using 14
and 15 values of the V1 parameters of ψ and ψ  , respectively, and these simulations
with each set of parameter values were performed five times by changing the initial
velocities of atoms in the 31 proteins. Namely, we calculated n iS→U and n iU→S in
Eq. (38) as the average numbers of n iS→U and n iU→S of 10 trajectories from 20.0 to
40.0 ps of the five simulations. These results are shown in Fig. 26. We determined the
optimized force-field parameters in order of ψ  and ψ, by searching the minimum
value of S in Fig. 26. V1 parameter for ψ changed from 0.45 to 0.31, and V1 parameter
for ψ  changed from 0.20 to −1.60.
In order to test the validity of the force-field parameters obtained by our opti-
mization method, we performed the folding simulations using two peptides, namely,
C-peptide and G-peptide.
For test simulations, we used REMD [60]. We used the AMBER11 program
package [57]. The unit time step was set to 2.0 fs, and the bonds involving hydrogen
were constrained by SHAKE algorithm [61]. Each simulation was carried out for 30.0
ns (hence, it consisted of 15,000,000 MD steps) with 32 replicas by using Langevin
dynamics. The replica exchange was tried every 3,000 steps. The temperature was
Optimizations of Protein Force Fields 253

(a) (b)

Fig. 26 S values [defined in Eq. (38)] obtained from MD simulations of 31 proteins with the force
fields which have different V1 parameter values for ψ  (Cβ –Cα –C–N) (a) and ψ (N–Cα –C–N) (b)
angles

distributed exponentially: 600, 585, 571, 557, 544, 530, 517, 505, 492, 480, 469, 457,
446, 435, 425, 414, 404, 394, 385, 375, 366, 357, 348, 340, 332, 324, 316, 308, 300,
293, 286, and 279 K. As for solvent effects, we used the GB/SA model [58] included
in the AMBER program package (igb = 5). These simulations were performed with
different sets of randomly generated initial velocities.
In Fig. 27, α helicity and strandness of two peptides obtained from the test sim-
ulations are shown. For the original AMBER ff99SB force field, the α helicity is
clearly larger than the strandness in not only C-peptide but also G-peptide. Namely,
the original AMBER ff99SB force field clearly favors α-helix structure and does not
favor β structure. On the other hand, for the optimized force field, in the case of
C-peptide, the α helicity is larger than the strandness, and in the case of G-peptide,
the strandness is larger than the α helicity. We can see that these results obtained
from the optimized force field are in better agreement with the experimental results
than the original force field.

4 Conclusions

In this chapter we reviewed our works on force fields for molecular simulations
of protein systems. We first discussed the functional forms of the force fields and
present some extensions of the conventional ones. Because the main-chain torsion-
energy terms are the most problematic among the force-field terms in the existing
force fields, we mainly considered the main-chain torsion-energy terms. We have
generalized them into the double Fourier series in φ and ψ. We have also introduced
the amino-acid dependence on these terms.
Given the functional forms, we then presented four methods for force-field
parameter optimizations. Our methods use the coordinates from PDB, which were
254 Y. Sakae and Y. Okamoto

(a-1) (a-2)

(b-1) (b-2)

Fig. 27 α helicity (a-1) and strandness (a-2) of C-peptide and α helicity (b-1) and strandness
(b-2) of G-peptide as functions of the residue number. These values are obtained from REMD [60]
simulations at 300 K. Normal and dotted lines stand for the optimized and original AMBER ff99SB
force field, respectively

determined by experiments. All of the four optimization methods minimize some


score functions with respect to the force-field parameters. In the first method, the
score function was the sum of forces acting of atoms with the coordinates from PDB.
In the second method, it was the average of PDB coordinate RMSD between before
and after energy minimzatiions. In the third method, it was the RMSD of backbone
dihedral angles between before and after energy minimizations. In the fourth method,
it was the sum of residues which changed secondary structures before and after short
MD simulations, starting from the coordinates from PDB. The computational cost
of the first method is much smaller than the remaining three methods, but we have
to be careful because the results can depend on the values of the force constants for
the restraining potential of heavy atom coordinates. If one has ample computation
time, the remaining three methods are recommended because they do not use the
restraining potential.
Some examples of our applications of these parameter optimization methods were
given and they were compared with the results from the existing force fields. It turned
out that all the examples resulted in improvement of the existing force fields. We thus
believe that we are at least on the right track.
Optimizations of Protein Force Fields 255

Our optimization methods for the force-field parameters are quite general and
they can be readily applied to any new energy terms whenever they are introduced
in the future.

Acknowledgements The computations were performed on the computers at the Research Cen-
ter for Computational Science, Institute for Molecular Science, Information Technology Center,
Nagoya University, and Center for Computational Sciences, University of Tsukuba. This work was
supported, in part, by the Grants-in-Aid for the Academic Frontier Project, “Intelligent Information
Science”, for Scientific Research on Innovative Areas (“Fluctuations and Biological Functions”
), and for the Next Generation Super Computing Project, Nanoscience Program and Computa-
tional Materials Science Initiative from the Ministry of Education, Culture, Sports, Science and
Technology (MEXT), Japan.

References

1. Liwo, A., Czaplewski, C., Stanislaw, O., Scheraga, H.A.: Curr. Opin. Struct. Biol. 18, 134
(2008)
2. Scheraga, H.A.: Ann. Rev. Biophys. 40, 1 (2011)
3. Hansmann, U.H.E., Okamoto, Y.: Curr. Opin. Struct. Biol. 9, 177 (1999)
4. Mitsutake, A., Sugita, Y., Okamoto, Y.: Biopolymers 60, 96 (2001)
5. Okamoto, Y.: J. Mol. Graphics Model. 22, 425 (2004)
6. Mitsutake, A., Mori, Y., Okamoto, Y.: Biomolecular Simulations: Methods and Protocols. In:
Monticelli, L., Salonen, E. (eds.), pp. 153–195. Humana Press, New York (2012)
7. Cornell, W.D., Cieplak, P., Bayly, C.I., Gould, I.R., Kenneth, J., Merz, M., Ferguson, D.M.,
Spellmeyer, D.C., Fox, T., Caldwell, J.W., Kollman, P.A.: J. Am. Chem. Soc. 117, 5179 (1995)
8. Kollman, P.A., Dixon, R., Cornell, W., Fox, T., Chipot, C., Pohorille, A.: Computer Simulations
of Biological Systems In: van Gunsteren, W.F., Weiner, P.K., Wilkinson, A.J., vol. 3, pp. 83–96,
Kluwer/ESCOM, Dordrecht (1997)
9. Wang, J., Cieplak, P., Kollman, P.A.: J. Comput. Chem. 21, 1049 (2000)
10. Hornak, V., Abel, A., Okur, R., Strockbine, B., Roitberg, A., Simmerling, C.: Proteins 65, 712
(2006)
11. Duan, Y., Wu, C., Chowdhury, S., Lee, M.C., Xiong, G., Zhang, W., Yang, R., Cieplak, P., Luo,
R., Lee, T.: J. Comput. Chem. 24, 1999 (2003)
12. MacKerell, Jr., A.D., Bashford, D., Bellott, M., Dunbrack Jr., R.L., Evanseck, J.D., Field, M.J.,
Fischer, S., Gao, J., Guo, H., Ha, S., Joseph-McCarthy, D., Kuchnir, L., Kuczera, K., Lau,
F.T.K., Mattos, C., Michnick, S., Ngo, T., Nguyen, D.T., Prodhom, B., Reiher III., W.E., Roux,
B., Schlenkrich, M., Smith, J.C., Stote, R., Straub, J., Watanabe, M., Wiorkiewicz-Kuczera, J.,
Yin, D., Karplus, M.: J. Phys. Chem. B 102, 3586 (1998)
13. MacKerell Jr., A., Feig, M., Brooks III, C.: J. Comput. Chem. 25, 1400 (2004)
14. MacKerell Jr., A., Feig, M., Brooks III, C.: J. Am. Chem. Soc. 126, 698 (2004)
15. Jorgensen, W.L., Maxwell, D.S., Tirado-Rives, J.: J. Am. Chem. Soc. 118, 11225 (1996)
16. Kaminski, G.A., Friesner, R.A., Tirado-Rives, J., Jorgensen, W.L.: J. Phys. Chem. B 105, 6474
(2001)
17. Gunsteren, W.F., Billeter, S.R., Eising, A.A., Hünenberger, P.H., Krüger, P., Mark, A.E., Scott,
W.R.P., Tironi, I.G.: Vdf Hochschulverlag AG an der ETH Zürich, Zürich, (1996)
18. Oostenbrink, C., Villa, A., Mark, A.E., van Gunsteren, W.F.: J. Comput. Chem. 25, 1656 (2004)
19. Berendsen, H.J.C., van der Spoel, D., van Drunen, R.: Comput. Phys. Commun. 91, 43 (1995)
20. Lindahl, E., Hess, B., van der Spoel, D.: J. Mol. Model. 7, 306 (2001)
21. Némethy, G., Gibson, K.D., Palmer, K.A., Yoon, C.N., Paterlini, G., Zagari, A., Rumsey, S.,
Scheraga, H.A.: J. Phys. Chem. 96, 6472 (1992)
256 Y. Sakae and Y. Okamoto

22. Arnautova, Y.A., Jagielska, A., Scheraga, H.A.: J. Phys. Chem. B 110, 5025 (2006)
23. Yoda, T., Sugita, Y., Okamoto, Y.: Chem. Phys. Lett. 386, 460 (2004)
24. Yoda, T., Sugita, Y., Okamoto, Y.: Chem. Phys. 307, 269 (2004)
25. Sakae, Y., Okamoto, Y.: Chem. Phys. Lett. 382, 626 (2003)
26. Sakae, Y., Okamoto, Y.: J. Theor. Comput. Chem. 3, 339 (2004)
27. Sakae, Y., Okamoto, Y.: J. Theor. Comput. Chem. 3, 359 (2004)
28. Simmerling, C., Strockbine, B., Roitberg, A.E.: J. Am. Chem. Soc. 124, 11258 (2002)
29. Duan, Y., Wu, C., Chowdhury, S., Lee, M.C., Xiong, G., Zhang, W., Yang, R., Cieplak, P., Luo,
R., Lee, T., Caldwell, J., Wang, J., Kollman, P.: J. Comput. Chem. 24, 1999 (2003)
30. Iwaoka, M., Tomoda, S.: J. Comput. Chem. 24, 1192 (2003)
31. Kamiya, N., Watanabe, Y., Ono, S., Higo, J.: Chem. Phys. Lett. 401, 312 (2005)
32. Best, R.B., Hummer, G.: J. Phys. Chem. B 113, 9004 (2009)
33. Mittal, J., Best, R.B.: Biophys. J. 99, L26 (2010)
34. Sakae, Y., Okamoto, Y.: J. Phys. Soc. Jpn. 75, 054802 (9 pages) (2006)
35. Sakae, Y., Okamoto, Y.: Mol. Sim. 36, 138 (2010)
36. Ramachandran, G.N., Sasisekharan, V.: Adv. Protein Chem. 23, 283 (1968)
37. Tanaka, S., Scheraga, H.A.: Macromolecules 9, 945 (1976)
38. Sakae, Y., Okamoto, Y.: Mol. Sim. 36, 159 (2010)
39. Sakae, Y., Okamoto, Y.: Mol. Sim. 36, 1148 (2010)
40. Sakae, Y., Okamoto, Y.: e-print: arXiv:1206.3909 [cond-mat.stat-mech]; submitted for publi-
cation
41. Sakae, Y., Okamoto, Y.: Mol. Sim. (In press)
42. Still, W.C., Tempczyk, A., Hawley, R.C., Hendrickson, T.: J. Am. Chem. Soc. 112, 6127 (1990)
43. Qiu, D., Shenkin, P.S., Hollinger, F.P., Still, W.C.: J. Phys. Chem. A 101, 3005 (1990)
44. Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P.: Science 220, 671 (1983)
45. Kabsch, W., Sander, C.: Biopolymers 22, 2577 (1983)
46. Sakae, Y., Okamoto, Y. (In preparation)
47. Honda, S., Kobayashi, N., Munekata, E.: J. Mol. Biol. 295, 269 (2000)
48. Shoemaker, K.R., Kim, P.S., Brems, D.N., Marqusee, S., York, E.J., Chaiken, I.M., Stewart,
J.M., Baldwin, R.L.: Proc. Natl. Acad. Sci. U.S.A. 82, 2349 (1985)
49. Osterhout Jr., J.J., Baldwin, R.L., York, E.J., Stewart, J.M., Dyson, H.J., Wright, P.E.: Bio-
chemistry 28, 7059 (1989)
50. Blanco, F.J., Rivas, G., Serrano, L.: Nature Struct. Biol. 1, 584 (1994)
51. Kobayashi, N., Honda, S., Yoshii, H., Uedaira, H., Munekata, E.: FEBS Lett. 366, 99 (1995)
52. Accelrys discovery studio visualizer. Software available at http://www.accelrys.com/
53. Berendsen, H.J.C., Postma, J.P.M., van Gunsteren, W.F., DiNola, A., Haak, J.R.: J. Chem. Phys.
81, 3684 (1984)
54. Tinker program package. Software available at http://dasher.wustl.edu/tinker/
55. URL http://www.accelrys.com/
56. Noguchi, T., Onizuka, K., Akiyama, Y., Saito, M.: In: Proceeding of the Fifth International
Conference on Intelligent Systems for Molecular Biology, AAAI press, Menlo Park, CA (1997)
57. Case, D.A., Cheatham, T., Darden, T., Gohlke, H., Luo, R., Merz Jr., K.M., Onufriev, A.,
Simmerling, C., Wang, B., Woods, R.: J. Comput. Chem. 26, 1668 (2005)
58. Onufriev, A., Bashford, D., Case, D.A.: Proteins 55, 383 (2004)
59. Weiser, J., Shenkin, P.S., Still, W.C.: J. Comput. Chem. 20, 217 (1999)
60. Sugita, Y., Okamoto, Y.: Chem. Phys. Lett. 314, 141 (1999)
61. Ryckaert, J.P., Ciccotti, G., Berendsen, H.J.C.: J. Comput. Phys. 23, 327 (1977)
62. Wang, G., Jr, R.L.D.: Bioinformatics 19, 1589 (2003)
63. Hoover, W.G.: Phys. Rev. A 31, 1695 (1985)
64. Jorgensen, W.L., Tirado-Rives, J.: J. Am. Chem. Soc. 110, 1657 (1988)
65. Levitt, M., Chothia, C.: Nature 261, 552 (1976)
Enhanced Sampling for Biomolecular
Simulations

Workalemahu Berhanu, Ping Jiang and Ulrich H. E. Hansmann

Abstract The use of computer simulations as “virtual microscopes” is limited by


sampling difficulties that arise from the large dimensionality and the complex energy
landscapes of biological systems leading to poor convergences already in folding
simulations of single proteins. In this chapter we discuss a few strategies to enhance
sampling in biomolecular simulations, and present some recent applications.

1 Introduction

Proteins are crucial components of the molecular machinery in cells, responsible


for transporting molecules, catalyzing biochemical reactions, or fighting infections.
Despite the remarkable progress in experimental machinery techniques for producing
and characterizing proteins a detailed understanding of folding and interaction of
proteins is still missing. Hence, there is a need for reliable computational tools
that can complement experiments in describing protein folding and function from
physical interactions within a protein, and between a protein and the surrounding
environment. Such tools could lead to new insights into the molecular working of
cells as needed in many medical and biotechnological applications. Shaw and co-
workers [1] have demonstrated that it is possible to study reversible folding of small
proteins in atomistic detail at the time scale observed in experiments. However, their
study was based on specialized hardware, and the extensive usage of CPU is out
of reach for most academic institutions. In addition, the size of proteins that can be

W. Berhanu · U. H. E. Hansmann
Department of Chemistry and Biochemistry, University of Oklahoma,
Norman 73019-5251, USA
e-mail: wgberhanu@gmail.com
U. H. E. Hansmann
e-mail: uhansmann@ou.edu
P. Jiang (B)
Tiandao, Education, Shanghai, People’s Republic of China
e-mail: ping.jiang@tiandaoedu.com

© Springer Nature Switzerland AG 2019 257


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_8
258 W. Berhanu et al.

studied with such brute-force approach is limited. This is because that the complex
form of the forces leads to a rough energy landscape with a vast number of local
minima acting as traps, and as a result the computational requirements for sampling
the energy landscape increase exponentially with size of the system [2].
In principle one can think of two approaches to overcome these numerical dif-
ficulties. One is to utilize simplified or coarse-grained models since they lead by
design to an energy landscape with reduced number of valleys. However, while such
models allow a much faster evaluation of energy, the problem of poor sampling and
slow convergence will likely reappear for sufficiently large proteins as roughness
is an intrinsic characteristics of protein energy landscapes. The other approach to
obtain sufficient sampling of the conformational space is the use of enhanced sam-
pling techniques that can quickly find local minima but avoid trapping. Such methods
will “flatten” the energy landscape by reducing barriers. While they will change the
dynamics and therefore often do not allow to study directly the kinetics of protein
folding, association, or aggregation, this is a small price to pay for faster and more
accurate calculation of thermal averages and free energy landscapes.
This chapter is organized as follows: we start with a short review of a number of
advanced simulation techniques before discussing shortcomings and open problems.
Recent applications demonstrate what can be done when using these approaches on
high-performance computing systems. We finish this short review with a summary
and outlook.

2 Advanced Simulation Techniques

The sampling difficulties in protein simulations at physiological temperature are due


to the roughness of the protein energy landscape where crossing of an energy barrier
of height ΔE is suppressed by a factor ∝ exp(−ΔE/k B T ) (k B is the Boltzmann
constant and T is the temperature of the system). Hence, raising the temperature T
makes it easier for a protein to cross energy barriers, but at the same time it becomes
more difficult to find low energy configurations. Simulations at high temperature
can induce thermal unfolding of a protein, which is sometimes interpreted as time
- reversed folding [3, 4]. While this approach has been used in the past with some
success [3, 4], it is not clear whether it is in general a valid approach. For instance,
the C-fragment of TOP7 folds by a non-trivial pathway that involves caching of a
N-terminal segment in an adjunct helix. Only when all other part of the proteins
are folded and in place, the N-terminal segment unfolds and re-folds to a strand
that completes the final structure in a three-stranded sheet. Time-reversed unfolding
trajectories at high temperature do not show the caching mechanism that governs
folding of this protein. An interpretation of unfolding as time-reversed folding may
be restricted to simple two-state folder and associated with a nucleation mechanism
as observed, for instance, for CI2 [3, 4].
One possibility to ensure sampling of low-energy configurations and avoid trap-
ping in local minima are improved updates that guide the simulation and/or allow for
Enhanced Sampling for Biomolecular Simulations 259

larger time steps in the integrator in molecular dynamics simulations, or collective


moves in Monte Carlo. One example is hybrid Monte Carlo [5, 6] where a short
molecular dynamics run provides a trial configuration, which is then accepted or
rejected according to the Metropolis criterion. This allows a larger step size in the
molecular dynamics trajectory as the Metropolis step corrects for the discretization
errors. Another example is the Rugged Metropolis (RM) [7] which uses informations
from a simulation at a higher temperature to bias a Monte Carlo simulation at a low
temperature. Assume a range of temperatures

T1 > T2 > . . . > Tr > . . . > T f −1 > T f . (1)

Results from the simulation at the highest temperature, T1 , are used to construct an
estimator of the probability density function

ρ(x1 , . . . , xn ; T1 )

that biases the simulation at T2 . In turn, this simulation provides a bias for the one at
T3 , and iteratively continued down to T f . Here, one uses the approximation


n
ρ(x1 , . . . , xn ; Tr ) = ρ i1 (xi ; Tr ), (2)
i=1

where ρ i1 (xi ; Tr ) are estimators of reduced one-variable probability densities


 
ρi1 (xi ; T ) = d x j ρ(x1 , . . . , xn ; T ) . (3)
j=i

Recursively, the estimated probability density function

ρ(x1 , . . . , xn ; Tr −1 )

is generated as an approximation of ρ(x1 , . . . , xn ; Tr ). The acceptance step in the


(biased) Metropolis procedure at temperature Tr is now given by
   
exp −β E  ρ(x1 , . . . , xn ; Tr −1 )
PR M = min 1, (4)
exp (−β E) ρ(x1 , . . . , xn ; Tr −1 )

Improved updates such as rugged Metropolis have been tested successfully in


simulations of small peptides. While in general the gain in efficiency is not enough to
make folding simulations of protein domains (usually consisting of 50–200 residues)
feasible, they can be combined readily with the generalized-ensemble techniques
described in the following sections further increasing their efficiency.
260 W. Berhanu et al.

2.1 Generalized-Ensemble Techniques

2.1.1 Energy Landscape Paving

The idea behind all generalized-ensemble techniques can be seen most easily for
the global optimization method energy landscape paving (ELP) [8] which relies on
low-temperature Monte Carlo simulations with an effective energy:

w( Ẽ) = e− Ẽ/k B T with Ẽ = E + f (H (q, t)). (5)

Here, T is a (low) temperature and f (H (q, t)) is a function of the histogram H (q, t)
in a pre-chosen “order parameter” or “reaction coordinate” q. The weight of a local
minimum state decreases the more the longer the system stays in that state until the
local minimum is no longer favored, after which the system will again explore higher
energies. We have evaluated the efficiency of ELP in simulations of the 20-residue trp-
cage protein whose structure we could “predict” within a root-mean-square deviation
(rmsd) of 1 Å [9]. Energy landscape paving allows also the possibility of zero-
temperature simulations [9]. For T → 0 only moves with Δ Ẽ ≤ 0 will be accepted.
If one chooses: Ẽ = E + cH (E, t), the acceptance criterion is given by:

ΔE + cΔH (q, t) ≤ 0 ↔ cΔH (q, t) ≤ −ΔE (6)

where E is the “physical” energy. Hence, energy landscape paving can overcome
even at T = 0 any energy barrier. The waiting time for such a move is proportional
to the height of the barrier that needs to be crossed. The factor c sets the time scale,
and in this sense the T = 0 form of ELP is parameter-free.
However, the weight factor is time dependent, and therefore ELP violates detailed
balance. Hence, the method can not be used to calculate thermodynamic averages.
Detailed balance is fulfilled only for f (H (q, t)) = f (H (q)) in which case ELP
reduces to one of the generalized-ensemble methods [10] generating a random walk
through order parameter space (energy, for instance), control parameter space (tem-
perature), or model space (i.e. different energy functions).

2.1.2 Random Walks in Order Parameter Space

We first consider generalized-ensemble techniques that realize random walks in order


parameter space leading to a broad distribution of a pre-chosen physical quantity.
This allows one to sample both low and high energy states with sufficient probability.
For simplicity only ensembles that lead to flat distributions in one variable will be
considered. Extensions to higher dimensions are straightforward [11]. One of the
earliest realization of this idea is umbrella sampling [12], but now more common
is multicanonical sampling [13] and methods derived of it. The first application of
Enhanced Sampling for Biomolecular Simulations 261

these techniques to protein simulations can be found in Ref. [14] where a Monte
Carlo technique was used. Later, it was also adapted to molecular dynamics [15].
In multicanonical simulations configurations with energy E are assigned a weight
w(E) such that the distribution of energies

Pmu (E) ∝ n(E)wmu (E) = const, (7)

where n(E) is the spectral density. Since all energies appear with equal probability,
a free random walk in the energy space is enforced and the simulation can overcome
any entrapment in one of the many local minima. For a wide range of temperatures
it is now possible to obtain a canonical distribution by re-weighting techniques [16]:
−1
PB (T, E) ∝ Pmu (E) wmu (E) e−β E , (8)

since a large range of energies is sampled. This allows one to calculate the expectation
value of any physical quantity O at temperature T by

d E O(E)PB (T, E)
O T =  . (9)
d E PB (T, E)

The drawback of multicanonical sampling is that the weights wmu (E) ∝ n −1 (E)
are not a priori known and one needs their estimates for a numerical simulation.
Calculation of the weights is usually done by an iterative procedure [14, 17, 18]. For
instance, the so-called Wang-Landau sampling [19] where the transition probability
between two conformations with energy E 1 and E 2 is given by the ratio of the
(time-dependent) estimators n(E) of the density of states

n(E 1 )
p(E 1 → E 2 ) = min ,1 . (10)
n(E 2 )

Each time an energy level is visited, the estimator is updated according to

n(E) → n(E) f (11)

where, initially, n(E) = 1 and f = f 0 = e1 . Once the desired energy range is cov-
ered, the factor f is refined,

f1 = f , f n+1 = fn , (12)

until some small value is reached.


In multicanonical simulations the computational effort increases with the number
of residues like ≈N 4 (when measured in Metropolis updates) [20]. In general, the
computational effort in simulations increases with ≈X 2 where X is the variable in
262 W. Berhanu et al.

which one wants a flat distribution. This is because generalized-ensemble simulations


realize by construction of the ensemble a 1D random walk in the chosen quantity
X . In the multicanonical algorithm the reaction coordinate X is the potential energy
X = E. Since E ∝ N 2 the above scaling relation for the computational effort ≈N 4
is recovered. Hence, multicanonical sampling is not always the optimal generalized-
ensemble algorithm in protein simulations. A better scaling of the computer time
with size of the molecule may be obtained by choosing more appropriate reaction
coordinate for our ensemble than the energy.
This is the motivation behind the various other existing realizations of the
generalized-ensemble approach. All aim at sampling a broad range of energies in
order that the simulation will overcome energy barriers and allow escape from local
minima. For instance, in Ref. [21] it was proposed that configurations are updated
according to a special choice of the Tsallis generalized mechanics formalism [22]
(the Tsallis parameter q is chosen as q = 1 + 1/n F ):
−n F
β(E − E 0 )
w(E) = 1 + . (13)
nF

Here E 0 is an estimator for the ground-state energy and n F is the number of degrees of
freedom of the system. The weight reduces in the low-energy region to the canonical
Boltzmann weight exp(−β E). This is because E − E 0 → 0 for T → 0(β → ∞)
leading to β(E − E 0 )/n F 1. On the other hand, high-energy regions are no longer
exponentially suppressed but only according to a power law, which enhances excur-
sions to high-energy regions.
In stochastic tunneling [23], conformations are weighted by w(E) = exp( f (E)/
k B T ). Here, f (E) is a non-linear transformation of the potential energy onto the
interval [0, 1] and T is a low temperature. The energy in the stochastic tunnel-
ing technique is transformed dynamically dependent on the simulation history. The
transformation is designed so that the system is automatically cooled down near the
local minima, and heated up at the high energy region allowing efficient tunneling
through the barriers [23]. Such a transformation can be realized by

f (E) = e−(E−E0 )/n F , (14)

where E 0 is again an estimate of the ground state and n F is the number of degrees
of freedom of the system. Note that the location of all minima is preserved. The
efficiency of this algorithm for protein-folding simulations was demonstrated in
Ref. [24]. As a broad range of energies is sampled, one can use again re-weighting
techniques [16] to calculate thermodynamic quantities over a large range of tempera-
tures. In contrast to other generalized-ensemble techniques, the weights are explicitly
given. One needs only to find an estimator for the ground-state energy E 0 which is
easier than the determination of weights for other generalized ensembles.
Enhanced Sampling for Biomolecular Simulations 263

In the context of molecular dynamics the generalized-ensemble idea is utilized in


the metadynamics
 method
 where gaussian-shaped repulsive potentials Ubias (s, t) =
|s−s(ti )|2
ti h exp − 2w 2 are added iteratively to the energy function. The parameters
h and w determine size and shape of the Gaussian centered at updated points s(ti )
of the reaction coordinates in order to discourage the system from revisiting the
configurations [25]. The overall contribution from these auxiliary potentials flattens
the underlying curvatures of the free energy wells, therefore leading to a random
walk. The original free energy potentials are recovered by −Ubias (s, t).

2.1.3 Random Walks in Control Parameter Space

Another way of generating a generalized ensemble is through enforcing in the sim-


ulation a random walk in a control parameter, most often temperature. For instance,
in simulated tempering, temperature is treated as an independent dynamic variable
[26] and is sampled uniformly by updating both temperature and configuration with
a weight:
w ST (T, E) = e−E/k B T −g(T ) . (15)

Here, the function g(T ) is chosen so that the probability distribution of temperature
is given by 
PST (T ) = d E n(E) e−E/k B T −g(T ) = const. (16)

Physical quantities have to be sampled for each temperature point separately and
expectation values at intermediate temperatures are calculated by re-weighting tech-
niques [16].
As with the previously discussed generalized-ensemble methods, the weight
w ST (T, E) is not a priori known, since it requires knowledge of the parameters
g(T ) and their estimator has to be calculated. It can be again obtained by an iterative
procedure. In the simplest version the improved estimator for g (i) (T ) for the i-th
(i−1)
iteration is calculated from the histogram of temperature distribution HST (T ) of
the preceding simulation as follows:
(i−1)
g (i) (T ) = g (i−1) (T ) + log HST (T ). (17)

In this procedure one uses that the histogram of the i-th iteration is given by

HST (T ) = e−gi−1 (T ) Z i (T ) , (18)



where Z i (T ) = d En(E) exp(−E/k B T ) is an estimate for the canonical partition
function at temperature T . Setting ex p(gi (T )) = Z i (T ) leads to the iterative rela-
tionship of Eq. 17.
264 W. Berhanu et al.

It is easy to see that the factor g(T ) drops out once one considers more than one
copy of the system. This is the idea behind replica exchange method (or parallel
tempering) [27], which was first applied to protein science in Ref. [28]. Assuming
we have N non–interacting replicas of the molecule, each at a different temperature
Ti , standard MC or MD moves are performed in parallel and independently at these
N temperatures. At certain time points, conformational exchanges occur between
neighboring temperatures Ti and Ti+1 , and the exchange moves are accepted or
rejected with probability

w(Cold → Cnew ) = min(1, exp(−βi E(C j ) − β j E(Ci ) + βi E(Ci ) + β j E(C j )))


(19)
= min(1, exp(ΔβΔE). (20)

The result of the exchange of conformations is the faster convergence of the Markov
chain than in regular canonical simulations since the resulting random walk in tem-
peratures allows the configurations to move out of local minima and to cross energy
barriers. Hence, the temperature distribution should be chosen such that any relevant
energy barrier can be crossed at the highest temperature.
There is no clear consensus on the optimal frequency of exchange attempts. One
opinion is that exchanges should be performed often, but no more often than the
potential energy autocorrelation time [29, 30]. The other argument is that exchange
moves should be attempted every few steps [31, 32]. It has been also suggested to
use multiplexed layers of replicas (n layers, each with M temperatures). In this mul-
tiplexed replica exchange method, replicas are exchanged both within and between
layers [33]. This offers a way of using more computing units on massively parallel
computers without the need of adding more temperatures.
Expectation values of a physical quantity A are calculated as usual according to:

1 
MES
A Ti = A(Ci (k)) , (21)
MES k

where MES is the number of measurements taken for the i-th temperature. Values
for intermediate temperatures are calculated using reweighting techniques [16]. Note
that parallel tempering does not require Boltzmann weights. The method can be
combined easily with generalized-ensemble techniques [28]. Obviously, the method
is also not restricted to temperature but can be used with any control parameter, for
instance, pH [34] or pressure.

2.1.4 Random Walks in Model Space

Finally, one can enhance sampling of low energy configurations also by performing
a random walk through an ensemble of systems with altered energy functions. In
that way, information is exchanged between varying stages of coarse graining or
Enhanced Sampling for Biomolecular Simulations 265

different local environments. This is the idea behind “model hopping” [35], “hamilton
exchange method” [36] and related approaches [37]. Consider, for instance, that the
energy function can be separated into two terms: E = E A + a E B . As in parallel
tempering, “model hopping” considers N non-interacting copies of the molecule,
but adjacent copies are now exchanged with probability

w(Cold → Cnew ) = min(1, exp{−β E A (C j ) + ai E B (C j ) + E A (Ci ) + a j E B (Ci )
(22)
 
−E A (Ci ) − ai E B (Ci ) − E A (C j ) − a j E B (C j ) . (23)

Here, Δa = a j − ai and ΔE B = E B (C j ) − E B (Ci ). Configurations perform a ran-


dom walk on a ladder of models with a1 = 1 > a2 > a3 > · · · > a N that differ by
the relative contributions of E B to the total energy E of the molecule.
Take as an example the barriers in the energy landscape of proteins that arise
from van der Waals repulsion between atoms that come too close. Assuming that
such barriers are a main reason for slow sampling in protein simulations, we have
considered a version of “model hopping” where the contributions from the van der
Waals energy become successively smaller. While the “physical” system is on one
side of the ladder (at a1 = 1), the (non-physical) model on the other end of the ladder
(at a N 1) allows in the extreme atoms to share the same position in space. As the
protein “tunnels" through van der Waals energy barriers, sampling of low-energy
configurations is enhanced in the “physical” model (at a1 = 1). With this realization
of “model hopping” we have“predicted” the structure of a 46-residue protein A in
an all-atom simulation within a root mean square deviation (rmsd) of 3.2 Å [35].
Model Hopping also allows guiding a simulation by information obtained from
homologous structures [38]. Usually, such spatial constraints introduce an additional
roughness into the energy landscape which often leads to extremely slow convergence
of the simulation. This problem is circumvented in our approach through a random
walk in an ensemble of replicas that differ by the strength of the constraints which
are coupled to the system. We have demonstrated the usefulness of this approach on
some examples of the CASP6 competition [38].

2.2 Advancing Generalized-Ensemble Techniques

While there has been much progress in advancing the generalized-ensemble approach,
folding simulations are still limited in their scope. Aggregation, oligomer assembly
and intra-oligomer conformational rearrangements are examples of systems with a
need for faster algorithms: the sampling process poses even for relatively simple sys-
tems such as polyglutamine repeats a formidable challenge [39, 40]. The importance
and severity of the problem motivates our search for further methodological advances.
266 W. Berhanu et al.

2.2.1 Improving the Efficiency of Generalized-Ensemble Sampling

The computational efficiency of replica-exchange techniques and generalized-


ensemble is often worse than their theoretical optimum. The reason for this sub-
optimal efficiency is the bottlenecks and barriers that lead to slow relaxation. In
parallel tempering convergence is evaluated by the frequency of statistically inde-
pendent configurations at lowest temperature. A lower bound for this number is the
rate of round-trips n r t between the lowest and highest temperature, T1 and TN . We
define n up (i) and n dn (i) as the number of replicas at temperature Ti that came from
T1 (TN ). The fraction of replicas moving up is given by:

n up (i)
f up (i) = (24)
n up (i) + n dn (i)

and describes the probability of stationary flow between temperatures T1 and TN .


Maximizing the number of round-trips n r t results in a linear flow distribution [41]:
opt
f up (i) = i/N (25)

Explicit solvent simulations of proteins are dominated by the water molecules.


As a result, the heat capacity C is constant, and the system can be approximated by
a D = 2C harmonic oscillator. Based on this approximation, one can find that the
optimal temperature distribution is the one with the number of replicas given by

N opt ≈ 1 + 0.594 C ln(Tmax /Tmin ) (26)

replicas, and the temperatures distributed according to


i−1
Tmax N −1
opt
Ri = Tmin ; (27)
Tmin

where Tmax is the highest temperature, Tmin is the lowest temperature. Both quantities
have to be chosen in advance [42].
If the relaxation at a particular temperature is slower than hopping in tempera-
ture, the state space partitions into disjoint free energy basins forming a tree-like
hierarchical network. Because of this broken ergodicity an optimized temperature
distribution needs to be found iteratively [43],
 T jk
η(opt) (T )dT = j/N , (28)
T1

where 1 < j < N and k marks the iteration. The two terminal temperatures T1 and
TN are kept fixed, and
Enhanced Sampling for Biomolecular Simulations 267

(opt)  1 df
η (T ) = C , (29)
ΔT dT

with the normalization constant C  chosen so that


 TN
η(opt) (T )dT = 1 . (30)
T1

This will again lead to a linear flow distribution, but the acceptance probabilities are
not any longer constant. One can also show that in the case of broken ergodicity
weight optimization of flow through order parameter space (for instance, energy)
leads to a distribution that is no longer flat [41, 43].
A direct measurement of the flow distribution is computationally costly as indi-
vidual replicas have to cross the full ladder of nodes many times. Such “tunneling”
events are especially rare in early stages of the control parameter optimization when
round trip times are largest. For this reason, we have proposed to estimate the flow
distribution from measurements of mean first passage times of replicas crossing only
part of the ladder. In our simulations, this procedure led to temperature sets that are
more stable upon iteration than those from flows measured directly [44].
Traditionally temperature replica exchange method is implemented such that the
exchanges have been synchronous and this has been a major limiting factor mak-
ing it highly inefficient. This replica exchange synchronization of attempted moves
strategy which results in wasted computation time as the periodic synchronization
causes the overall simulation to run at the speed of the slowest processor and the
centralized coordination step is not scalable to many processors. In asynchronous
replica exchange, one attempts to escape this problem through performing replica
exchange moves for pairs of replicas independently from the other replicas, thereby
removing the need for processor synchronization found in conventional synchronous
implementations [45]. Because it does not involve a centralized synchronization step,
the algorithm is scalable to an arbitrary number of processors and it is not limited
by the slowest processor. The method is suitable for integration in dynamical simu-
lation environments, such as computational grids, in which processors dynamically
join and leave the calculation [45].

2.2.2 Velocity-Rescaling Improved Replica Exchange Molecular


Dynamics

In a molecular dynamic simulation, the energy

1
E(x, v) = E pot (x) + E kin (v) with E kin (v) = m i vi2 (31)
2 i
268 W. Berhanu et al.

is the sum of the potential energy E pot , which depends only on the coordinates x,
and the kinetic energy E kin that is solely a function of the velocities v. Scaling all
velocities by a factor r changes the kinetic energy by:

E kin (r v) = r 2 E kin (v). (32)

In standard replica exchange molecular dynamics this relation is used by scaling the
velocities after a successful exchange with a factor [46]

r(1,2) = T(2,1) /T(1,2) , (33)

that depends on the temperatures T1 and T2 of the two replicas that are exchanged. The
rescaling of the velocities leads to v(1,2)
new
= v(2,1)
old
, and therefore ΔE kin = 0. Hence,
the probability for an exchange is given only by the difference of potential energies
of the two replicas
w(1 ↔ 2) = exp(ΔβΔE pot ). (34)

Microcanonical replica exchange simulations call for a different scaling [47, 48].
By definition of the ensemble, one has to assure that ΔE = 0. Assuming E 1 < E 2 ,
and scaling parameters r1 and r2 given by

E (2,1) − E pot (x1,2 )
r(1,2) =
E (1,2) − E pot (x1,2 )

E kin (v(2,1) ) ± ΔE pot
= , (35)
E kin (v1,2 )

two configurations are exchanged with probability one :

E 1 (x1 , v1 ) = E pot (x1 ) + E kin (v1 )


= E pot (x2 ) + r22 E kin (v2 ) . (36)

and

E 2 (x2 , v2 ) = E pot (x2 ) + E kin (v2 )


= E pot (x1 ) + r12 E kin (v1 ) (37)

Such rejection-free moves are possible for E pot (x2 ) < E 1 , a restriction that does
not violate detailed balance. Molecular dynamics time evolution between exchange
moves ensures ergodicity. Hence, the sampling will lead for sufficiently long simu-
lation times to the correct distribution:
n /2
P(E pot ; E) ∝ Ω pot (E pot )E kinf , (38)
Enhanced Sampling for Biomolecular Simulations 269

where Ω is the density of states and n f is the number of degrees of freedom.


The above scaling leading to rejection-free sampling has been used in Ref. [48]
to study the trp-cage protein with an implicit solvent. However, this approach is
not restricted to microcanonical simulations. Instead, it can be generalized to the
more commonly used canonical ensemble without changes of the functional form of
Eq. 35.
The search for more efficient replica exchange schemes is an active area of research
[49, 50], especially for the case of explicit solvent simulations of proteins [51, 52].
Inspired by Okur et al. [51] we have proposed in Ref. [53] to circumvent the problem
by a hybrid method. We assume that the potential energy of the system can be written
as

E = E pot + E kin with E pot = Ppp + Ppw + Pww and E kin = K p + K w ,


(39)
where Ppp marks the contribution from interaction solely between atoms in the
protein, Pww denotes the ones arising from water-water interactions, and Ppw stands
for water-protein interactions. Between exchange moves the system evolves with the
energy function given by Eq. 39. However, for exchange moves we utilize in addition
an implicit solvent term Pis that is an approximation for Pww + Ppw . The difference
between the two solvation terms is given by

H = Pww + Ppw − Pis . (40)

The “true” potential energy E pot can be approximated by a quantity Q = Ppp + Pis ,
leading to:
E pot = Q + H. (41)

Exchange moves are as usual accepted with probability

w(1 ↔ 2) = min (1, exp(D)) with


(1) (1) (2) (2)
D = ΔβΔQ − β1 ( Ê kin − E kin + ΔH ) − β2 ( Ê kin − E kin − ΔH ), (42)

(1) (1)
where E kin and Ê kin are the kinetic energies at temperature T1 before and after an
exchange move, respectively. Rescaling the velocities according to
 
 (1)  (2)
 
(2)  E kin − ΔH (1)  E kin + ΔH
v (2) ↔ v̂ (1) =v (2)
and v (1) ↔ v̂ (2) =v (1)
(43)
E kin E kin

leads to
(1) (1) (2) (2)
Ê kin = E kin − ΔH and Ê kin = E kin + ΔH . (44)

Exchange moves are now accepted with a probability of the same form as in Okur
et al. [51]:
270 W. Berhanu et al.

w(1 ↔ 2) = min (1, exp(ΔβΔQ)) with Q = Ppp + Pis . (45)

However, the velocity rescaling improves on that method by relating the solvation
energies as measured with the explicit solvent and the one calculated with the implicit
solvent. We have shown for the 20-residue Trp-cage protein that the number of repli-
cas in explicit solvent replica exchange molecular dynamics can be reduced from
40 to 10 replicas [53]. As the contribution of solvent-solvent interaction increases
faster than protein-protein and protein-solvent terms one can expect a more dra-
matic improvement for the larger proteins, allowing to evaluate and improve velocity
rescaling as a way to advance on explicit solvent simulations and other applications
of replica exchange.

2.2.3 Replica-Exchange-with-Tunnling

A recent extension of the velocity rescaling idea is Replica-Exchange-with-Tunneling


(RET), which aims to “tunnel” through the unfavorable “transition state” generated
through the exchange move by a four-step procedure:
1. In the first step, the configurations A(B) evolve on two neighboring replica over
a short microcanonical molecular dynamics trajectory to configurations A (B  ),
without that the total energies E 1 and E 2 change on the two replicas. Note,
however, that while total energy is conserved, potential and kinetic energy will
interconvert on each of the two replica.
2. Next, the configurations A and B  are exchanged, and the associated velocities
are rescaled according to the following equations, such that the energy at each
replica (temperature) remains constant before and after the exchange: E 1 = E 1
and E 2 = E 2 .
 
E 2 − E pot (q A ) E 1 − E pot (q B )
v A = v A and v B = v B (46)
E kin (v A ) E kin (v B )

3. After the exchange, the two replica evolve again by microcanonical molecular
dynamics. While the total energies E 1 and E 2 on the two replica do not change,
the exchange between potential and kinetic energy will lead to final states B̂ on
replica 1 and  on replica 2 that have potential energies comparable to the cor-
responding configurations before the exchange move, and velocity distributions
as one would expect for the given temperatures at each replica.
4. The final configurations on each replica are now either accepted or rejected
according to the following Metropolis criterium
 
exp −β1 (E pot (q̂ B ) − E pot (q A )) − β2 (E pot (q̂ A ) − E pot (q B )) with β = 1/k B T.
(47)
Enhanced Sampling for Biomolecular Simulations 271

If rejected, molecular dynamics simulations continue with the original config-


urations A(B). However, in both cases, new velocity distributions are randomly
drawn according to the temperatures on the respective replica.
The acceptance criterium of Eq. 47 in the final step of the RET move is derived by
writing the probability to find configurations with potential energy E pot (q A ) and total
energy E 1 as
3N /2  3N /2
P(E pot (q A ), E 1 ) ∝ Ω(E pot (q A )) × E kin (v A ) = Ω(E pot (q A )) × E 1 − E pot (q A ) ,
(48)
with N the number of particles and Ω(E pot (q A )) the density of states with potential
energy E pot . As the total energy at T1 and T2 is conserved, the acceptance probability
for the RET move is one. However, the Metropolis-Hastings algorithm that ensures
convergence to the correct distribution, requires the product of acceptance and pro-
posal probability. The later is the probability to start at temperature T1 (T2 ) in a
configuration with coordinates q A (q B ) and picking a configuration with coordinates
q̂ B (q̂ A ), and is given by
3N /2 3N /2
E 1 − E pot (q̂ B ) E 2 − E pot (q̂ A )
× . (49)
E 1 − E pot (q A ) E 2 − E pot (q B )

Hence, the Metropolis-Hastings criterium for accepting the RET move is in general
given by:
 
3N /2 3N /2
E 1 − E pot (q̂ B ) E 2 − E pot (q̂ A )
w( C old
→ C new
) = min 1, ×
E 1 − E pot (q A ) E 2 − E pot (q B )
(50)
This equation is cumbersome to evaluate. However, as both functions on the right side
of Eq. 48 grow strongly with their arguments, the distribution of potential energies
P(E Pot , E) is for large N a sharply peaked function, and a saddle-point expansion
will lead to
⎧  2 ⎡ 3 ⎤⎫
⎨ 3N E pot − Ê pot E pot − Ê pot ⎬
P(E pot , E) ∝ Ω(E pot ) exp −β E E pot − +O⎣ ⎦ ,
⎩ 2 E − Ê pot E − Ê pot ⎭
(51)
with the inverse microcanonical temperature β E = 1/k B TE = d ln Ω(E)/d E and
Ê pot the most probable potential energy. Hence, for sufficiently large N and long
enough trajectories, the RET acceptance criterion of Eq. 50 reduces to Eq. 47 which
can be evaluated more easily [54].
We have shown in Ref. [54] through simulations of the trp-cage protein, an often
used toy-model for evaluating new sampling techniques, that the RET move increases
indeed the flow of replicas through temperature by allowing the system to “tunnel”
through unfavorable “transition states” generated by the exchange move. Both regu-
lar replica exchange molecular dynamics (REMD) and RET lead to the same thermo-
dynamic averages; but depending on number of replicas we could achieve a twelve
272 W. Berhanu et al.

times larger sampling efficiency for RET than seen in regular REMD. Thermaliza-
tion is especially faster for RET when a too large spacing in temperature leads for
regular REMD to very low acceptance rates. As described above, this is a persistent
problem in replica-exchange molecular dynamics of proteins in an explicit solvent
where the large number of water molecules leads to the need for very small spacing
in temperature (and therefore a large number of replicas).

2.3 Multiscale Sampling

Another approach to enhance sampling of protein configurations is multiscale sam-


pling. Simplified or coarse-grained models lead by design to an energy landscape
with reduced number of valleys, and allow often in addition for a much faster eval-
uation of energies. The reduced model allows to observe long time scale changes
quickly enough, which could take all-atom models an infeasible simulation time. The
so-obtained coarse grained potentials are designed to reproduce the thermodynami-
cal and structural properties of the corresponding all atom system. But the lost fine
details in coarse-grained models are in principle critical to the accurate description of
realistic molecular behaviors. For example, structure prediction of a pathologically
important enzyme is usually performed by using reduced models for a fast outcome.
But the drug screening followed requires more details in side chain arrangement in
the active site. Multiscale simulations attempt to overcome this problem by com-
bining coarse-grained with all-atom simulations, altering the fineness of the system
studied in a stepwise way.
Obviously, combining different coarse-graining levels requires a scheme for back-
mapping to the detailed degrees of freedom. The difficulty of back-mapping is evident
- coarse graining in the large part averages a fine-grained model, thus the reversing
is not one-to-one, but mapping a single coarse-grained structure to a fine-grained
ensemble. The high-resolution ensemble generated in the normal back-mapping
mode does not assure necessarily the correct statistical properties. As an extension of
parallel tempering, Zuckermann and coworkers developed the Resolution exchange
algorithm in which several simulations of differing resolutions are conducted in
parallel and exchanges of configurations are attempted periodically between the
neighboring resolutions [55]. Instead of using high temperature to smoothen the
rugged potential energy landscape, resolution exchange uses coarse-grained model
to effectively sample the conformational space. The method guarantees the canonical
sampling in the atomic fineness level by using the following exchange acceptance
criterion. & '
π H (φa , xb )π L (φb )
PR M = min 1, (52)
π H (φb , xb )π L (φa )

The configuration of a coarse-grained model is described by a set of coordi-


nates φ and that of a fine model is described by a larger set of coordinates includ-
ing not only φ but also x which is for the extra degrees of freedom. If the two
Enhanced Sampling for Biomolecular Simulations 273

configurations before exchange is φa and {φb , xb }, the trial configurations are simply
φb and {φa , xb }. Namely, only the coarse-grained part of potential energies are sub-
jected to exchange. Subscripts H and L denote high-resolution and low-resolution
respectively and the corresponding potential energy is defined as U H and U L . Then
the probability of the configurations a and b before exchange is the product of
probability of having configuration a, π L = ex p(−β L U L (φa ))/Z L and having b,
π H = ex p(−β H U H (φb , xb ))/Z H . Similarly, the probability after exchange is the
product of π L = ex p(−β L U L (φb ))/Z L and π H = ex p(−β H U H (φa , xb ))/Z H . Z H
and Z L are partition functions. In sum, the exchange criterion can be written as
Eq. 52. The criterion satisfies the detailed balance and therefore ensures the canoni-
cal distribution at any resolution.
A practical problem of the resolution exchange method is that when the system
studied i s of larger size than dipeptides, the trial exchanges are rejected easily.
Lyman et. al have found that the rejection rate depends on both number and type of
the degrees of freedom of coordinates x. They employed an incrementally coarse-
graining scheme to coarse grain one residue each time [56]. In-between the finest
and most coarse grained replica, hybrid models which are partially atomic and for
the rest united are used. Finally the acceptance rate of exchange becomes reasonably
high (from 0.09% to >2%). To tackle the same issue, Liu et. al used configurational-
bias Monte Carlo (CBMC) to reconstruct the nascent degrees of freedom [57]. The
position of the next interacting site is constructed using a look-ahead algorithm. A
set of trial positions are generated and each is assigned a weight wi = ex p(−βUi ).
The coordinates will be selected based on its Rosenbluth factor, wi / wi , and the
process iterated till the last site is generated.
We have proposed to overcome the problem of vanishing acceptance in resolu-
tion exchange simulations by utilizing our new “Replica-Exchange-with-Tunneling”
approach. For this purpose, we describe our system by a potential energy made of
three terms:
E pot = E F G + E C G + λE λ . (53)

The first term is the energy E F G of the system described by an all-atom model.
The second term E C G describes our system by suitable coarse-grained model. The
fine-grained and coarse grained models are coupled by a model specific penalty term
E λ , proposed in Ref. [58], that measures their similarity. The strength by that the two
models are coupled is set by a parameter λ that differs for each replica.
With the above set-up one can build now a ladder of replica, starting with one
where λ = 0 and fine-grained and coarse-grained model are independent, followed
by replica with increasing values of λ, i.e growing coupling between the two models.
While the energy of a replica is given by the joint expression of Eq. 53, replicas are
exchanged with a probability that depends only on the coupling term E λ , i.e., Eq. 52
simplifies to the familiar looking expression:
 
w(A → B) = min 1, eβ(Δλ)(ΔEλ ) , (54)
274 W. Berhanu et al.

where Δλ = λ B − λ A and ΔE λ = E λ (B) − E λ (A). exchange of the above defined


multi-scale system alleviates already the problem of steric clashes in resolution
exchange, but the acceptance rates depend strongly on the spacing of the λ-
parameters. In order to avoid a prohibitive large number of replicas, we will use
again the RET move to “tunnel” through an unfavorable “transition state” generated
by the exchange move. This procedure leads to increased acceptance rates, enhanc-
ing in this way the flow of information between fine-grained and fine-grained model.
Note, however, that for analysis and generation of equilibrium configurations only
the λ = 0 replica is used, which is the one where the two models are not coupled.
We have tested the above ideas in preliminary simulations where we combined
Replica-Exchange-with-Tunneling with exchange moves between “physical” mod-
els and such relying on Go-type force fields that bias toward distinct configurational
states of a protein. The degree of bias by a Go-model varies in our simulations with
replica (usually by multiplying this contribution with a parameter λ), and the mea-
surements are made solely at a “physical” replica which has no contributions from the
Go-model (λ = 0). The problem here is again vanishing acceptance rates, especially
for exchanges with the λ = 0 replica if not a large number of replicas is concentrated
around it. This problem can be avoided by the RET move where now the velocities
are rescaled according to

E 2 − E phys (q A ) − E Go (q A ) − λ2 E λ (q A )
v A = v A
E kin (v A )

E 1 − E phys (q B ) − E Go (q B ) − λ1 E λ (q B )
v B = v B , (55)
E kin (v B )

and RET moves are accepted with probability


    
exp −β1 ΔE (1) (1)
phys + ΔE go + λ1 ΔE λ
(1)
− β2 ΔE (2) (2)
phys + ΔE go + λ2 ΔE λ
(2)

(56)
where ΔE (1)phys = E phys (q̂ B ) − E phys (q A ) and ΔE (2)
phys = E phys ( q̂ A ) − E phys (q B );
(i) (i)
ΔE Go and ΔE λ are defined accordingly. First examples showing he usefulness
of this approach can be found in Refs. [59, 60].

3 Recent Applications

Our group has a long-standing interest in mis-folding and aggregation of proteins. A


class of proteins where one would expect an increased danger of mis-folding are pro-
teins with end-to-end β-sheet. This is because the N-terminal β-strand is synthesized
early on, but it cannot bind to the C-terminus before the chain is fully synthesized.
During this time there is a danger that the β-strand at the N-terminus interacts with
nearby molecules leading to potentially harmful aggregates of incompletely folded
Enhanced Sampling for Biomolecular Simulations 275

proteins. Using our advanced generalized-ensemble techniques we have recently


shown [61, 62] that the 49-residue C-terminal CFr of the artificially designed Top7
[63, 64] avoids this risk by a “caching” mechanism, that relies on chameleon behav-
ior of one of the terminal β-strands, to facilitate folding. In the early phases of folding
the N-terminal residues are “cached” as part of the subsequent α-helix. Only after
the other parts of the molecules have folded into the correct structure, do the N-
terminal residues unfold and refold to a strand that then forms with a C-terminal
hairpin into a three-stranded β-sheet. While “caching” is not in contradiction to the
funnel picture, it implies a rather complex energy landscape. We have shown further
that mutations which increase the propensity of forming strands and decrease that
of forming helices, still lead to the same native structure, but by interfering with the
caching mechanism lead to reduced folding rates [65].
Another example is the possible mechanisms by which the A629P (alanine to
proline) mutant of ATP7A causes Menke’s Disease (a hereditary copper deficiency
disease in most cases leads to death in early childhood). The mutation is located in
the fourth (and C-terminal) strand of the β-sheet in the sixth domain. The isolated
domain consists of 75 residues, with the mutation at position 69, and exists in solu-
tion as a monomer. As such it has been characterized by NMR for wild type and
mutant, both in the apo and the copper-binding form. Structural differences between
wild type and mutant are around 3 Å in root-mean-square-deviation (rmsd), and
within the variations of the respective NMR ensembles. Hence, the question arises
by what mechanism the mutation leads to the outbreak of Menke’s disease. Our
results indicate that the mutation does not have appreciable effects on the stability of
copper-bound states but rather destabilizes the characteristic end-to-end β-sheet [66].
The resulting transient unfolding leads to partial exposure of hydrophobic residues
that makes the mutant prone to degradation. In turn this leads to the low effective
concentration of the copper transporting protein that is responsible for the pathology
of Menke’s disease. We further show that the differences in the binding affinities
between the two terminal strands alter the folding mechanism for the mutant: the
secondary structure elements form contacts between each other in different order
than in the wild type [67].
Recent applications of Replica-Exchange-with-Tunneling (RET)include our
investigations into the folding of the A and B domain of protein G. Both proteins fold
in a two-state way without detectable intermediates, similar to CI2. They share no
significant sequence homology and have different folds: GA is a three-helix bundle,
and GB a α-helix on top of a 4-stranded β-sheet. The group of Bryan and Orban
(University of Maryland) have studied systematically mutations of these two pro-
teins that increased the homology of the two proteins up while preserving structure
and function [68]. The final mutants GA98 and GB98 differ by a single residue that
switches between the two folds. Our assumption is that the two proteins and their
mutants have both structures as local minima, with the sequence determining their
relative weight. We conjecture that the sequence of a protein encodes not only the
native fold but also other forms that either are important to the folding process (as
in the case of the caching mechanism in CFr) and the protein functions (changes
of protein structure upon binding), or reflect an evolutionary history (or future):
276 W. Berhanu et al.

mutations can accumulate without changing structure and function of a protein until
a single mutation finally switches the fold. In the case of GA and GB this process
can be studied systematically by comparing the free-energy landscapes of the various
mutants. We have probed this assumption first with all-atom Go-model simulations
of both the GA and GB wild types and the GA98 and GB98 mutants [69], but recently
extended these investigations by using all-atom RET simulations [59]. Unlike previ-
ous physics-based all-atom simulations, that failed to reproduce these differences, we
find for the proteins very different landscapes consistent with the experiments. This is
the more astonishing as our simulations approximate the protein-solvent interaction
by an implicit solvent model. This suggest that the previous difficulties in simulat-
ing these two proteins reported in recent papers are not so much due to insufficient
accuracy of the force fields (as was claimed) but incomplete sampling.
In another application of replica exchange with tunneling (RET) we could simulate
formation and interconversion between fibril-like and barrel-like assemblies of the
amyloid-forming cylindrin peptide [60]. This success was possible because the RET
move leads to faster walk between replica where the system is biased toward fibril
assemblies and such where it is biased toward barrel-like aggregates. The net-effect is
a more effective sampling of independent configurations at the replica where λ = 0,
i.e., where the physical model is not biased. We further increased the efficiency of our
approach by including of information from all replica. Hence, while at replica with
λ = 0 the physical model is biased toward either fibril or barrel structure, this bias is
accounted for and corrected through re-weighting to the λ = 0 replica. Both effects
allowed a detailed exploration of the free energy landscape of cylindrin assemblies,
which let us propose the mechanism for formation and interconversion of the various
assemblies. Its main element is that the transition between the two polymorphs does
not involve unfolding of the chains but only their dissociation and re-association.
Crucial for formation of the barrel-like oligomer is the salt-bridge between K3-D7
which guides the association of the peptides into this form instead of the energetically
more favorable fibril.

4 Conclusion

Progress in the development of algorithms over the last three decades has extended the
size of peptides and proteins that are accessible in all-atom simulations, and has also
allowed to pinpoint the remaining difficulties. The most important open problem
in present generalized-ensemble techniques is that they require careful tuning of
parameters. Unfortunately, there are no simple and universal rules for this tuning
toward optimal sampling. As the described techniques can only reduce the sampling
difficulties from an exponential scaling to a power law, it is necessary to have software
that is highly adapted to massively parallel computers and modern architectures such
as GPUs and cell processors. Further advancements in hardware and algorithms
may overcome the remaining sampling problems and establish the use of computer
Enhanced Sampling for Biomolecular Simulations 277

simulations as “microscope” to a point where the whole cells can be explored in


silico.

Acknowledgements This article is an updated version of a review published in the first edition
of this book, adding new algorithmic developments and applications. We thank Nathan Bernhardt,
Yanjie Wei, Huilin Zang, Wei Wang, Wenhui Xi and Fatih Yasar for their contributions to work now
also reviewed here. Support by the National Science Foundation (research grants CHE-998174,
0313618, 0809002, 1266256) and the National Institutes of Health (GM62838) are acknowledged.

References

1. Lindorff-Larsen, K., Piana, S., Dror, R.O., Shaw, D.E.: How fast-folding proteins fold. Science
334, 517–520 (2011)
2. Chen, Y., Ding, F., Nie, H., Serohjos, A.W., Sharma, S., Wilocx, K.C., Yin, S., Dokholyan,
N.V.: Protein folding: then and now. Arch. Biochem. Biophys. 469, 4–19 (2007)
3. Daggett, V., Fersht, A.: Is there a unifying mechanism for protein folding? Trends Biochem.
Sci. 28, 18–25 (2003)
4. Daggett, V.: Molecular dynamics simulations of the protein unfolding/folding reaction. Acc.
Chem. Res. 35, 422–429 (2002)
5. Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D.: Hybrid Monte Carlo. Phys. Lett. B195,
216–221 (1987)
6. Brass, A., Pendleton, B.J., Chen, Y., Robson, B.: Hybrid Monte Carlo simulation theory and
initial comparison with molecular dynamics. Biopolymers 33, 1307–1315 (1993)
7. Berg, B.A.: Metropolis importance sampling for rugged dynamical variables. Phys. Rev. Lett
90, 180601 (2003)
8. Hansmann, U.H.E., Wille, L.: Global optimization by energy landscape paving. Phys. Rev.
Lett. 88, 068105 (2002)
9. Schug, A., Wenzel, W., Hansmann, U.H.E.: Energy landscape paving simulations of the trp-
cage protein. J. Chem. Phys. 122, 194711 (2005)
10. Hansmann, U.H.E., Okamoto, Y.: The generalized-ensemble approach for protein folding sim-
ulations. In: Stauffer, D. (ed.) Annual Reviews in Computational Physics, pp. 129–157. World
Scientific, Singapore (1998)
11. Kumar, S., Payne, P., Vásquez, M.: Method for free-energy calculations using iterative tech-
niques. J. Comp. Chem. 17, 1269–1275 (1996)
12. Torrie, G.M., Valleau, J.P.: Nonphysical sampling distributions in Monte Carlo free-energy
estimation: umbrella sampling. J. Comp. Phys. 23, 187–199 (1977)
13. Berg, B.A., Neuhaus, T.: Multicanonical algorithms for first order phase transitions. Phys. Lett.
B 267, 249–253 (1991)
14. Hansmann, U.H.E., Okamoto, Y.: Prediction of peptide conformation by multicanonical algo-
rithm: a new approach to the multiple-minima problem. J. Comp. Chem. 14, 1333–1338 (1993)
15. Hansmann, U.H.E., Okamoto, Y., Eisenmenger, F.: Molecular dynamics, Langevin and hybrid
Monte Carlo simulations in a multicanonical ensemble. Chem. Phys. Lett. 259, 321–330 (1996)
16. Ferrenberg, A.M., Swendsen, R.H.: New Monte Carlo technique for studying phase transitions.
Phys. Rev. Lett. 61, 2635–2638 (1988). Optimized Monte Carlo data analysis. Phys. Rev. Lett.
63, 1195–1198 (1989)
17. Berg, B.A.: Markov chain Monte Carlo simulations and their statistical analysis. World Scien-
tific, Singapore (2004)
18. Hansmann, U.H.E., Okamoto, Y.: Comparative study of multicanonical and simulated anneal-
ing algorithms in the protein folding problem. Physica A 212, 415–437 (1994)
19. Wang, F.G., Landau, D.P.: Efficient, multiple-range random walk algorithm to calculate the
density of states. Phys. Rev. Lett. 86, 2050–2053 (2001)
278 W. Berhanu et al.

20. Hansmann, U.H.E., Okamoto, Y.: Finite-size scaling of helix-coil transitions in poly-alanine
studied by multicanonical simulations. J. Chem. Phys. 110, 1267–1276 (1999)
21. Hansmann, U.H.E., Okamoto, Y.: New Monte Carlo algorithms for protein folding. Curr. Opin.
Struct. Biol. 9, 177–184 (1999)
22. Curado, E.M.F., Tsallis, C.: Possible generalization of Boltzmann-Gibbs statistics. J. Phys. A:
Math. Gen. 27, 3663 (1994)
23. Wenzel, W., Hamacher, K.: Stochastic tunneling approach for global minimization of complex
potential energy landscapes. Phys. Rev. Lett. 82, 3003 (1999)
24. Hansmann, U.H.E.: Protein folding simulations in a deformed energy landscape. Eur. Phy. J.
B 12, 607–612 (1999)
25. Laio, A., Parrinello, M.: Escaping free-energy minima. Proc. Natl. Acad. Sci. USA 99, 12562–
12566 (2002)
26. Lyubartsev, A.P., Martinovski, A.A., Shevkunov, S.V., Vorontsov-Velyaminov, P.N.: New
approach to Monte Carlo calculations of the free energy: method of expanded ensembles.
J. Chem. Phys. 96, 1776–1783 (1992). Marinari, E., Parisi, G.: Simulated tempering: a new
Monte Carlo Scheme. Europhys. Lett. 19, 451–458 (1992)
27. Hukushima, K., Nemoto, K.: Exchange Monte Carlo method and applications to spin glass sim-
ulations. J. Phys. Soc. (Japan) 65, 1604–1608 (1996); Geyer, G.J., Thompson, E.A.: Annealing
Markov chain Monte Carlo with applications to ancestral inference. J. Am. Stat. Assn. 90,
909–920 (1995)
28. Hansmann, U.H.E.: Parallel tempering algorithm for conformational studies of biological
molecules. Chem. Phys. Lett. 281, 140–150 (1997)
29. Periole, X., Mark, A.E.: Convergence and sampling efficiency of replica-exchange molecular
dynamic simulations of peptide folding in explicit solvent. J. Chem. Phys. 126, 014903 (2007)
30. Abraham, M.J., Gready, J.E.: Ensuring mixing efficiency of replica-exchange molecular
dynamics simulations. J. Chem. Theor. Comput. 4, 1119–1128 (2008)
31. Sindhikara, D.J., Emerson, D.J., Roitberg, A.E.: Exchange often and properly in replica
exchange molecular dynamics. J. Chem. Theor. Comput. 6, 2804–2808 (2010)
32. Sindhikara, D.J., Emerson, D.J., Roitberg, A.E.: Exchange frequency in replica exchange
molecular dynamics. J. Chem. Phys. 128, 10 (2008)
33. Rhee, Y.M., Pande, V.S.: Multiplexed-replica exchange molecular dynamics method for protein
folding simulation. Biophys. J. 84, 755–786 (2003)
34. Wallace, J.A., Shen, J.K.: Continuous constant pH molecular dynamics in explicit solvent with
pH-based replica exchange. J. Chem. Theor. Comput. 7, 2617–2629 (2011)
35. Kwak, W., Hansmann, U.H.E.: Efficient sampling of protein structures by model hopping.
Phys. Rev. Lett. 95, 138102 (2005)
36. Fukunishi, H., Watanabe, O., Takada, S.: On the Hamiltonian replica exchange method for
efficient sampling, of biomolecular systems: application to protein structure prediction. J.
Chem. Phys. 116, 9058–9067 (2002)
37. Sugita, Y., Kitao, A., Okamoto, Y.: Multidimensional replica-exchange method for free-energy
calculations. J. Chem. Phys. 113, 6042–6051 (2000)
38. Gront, D., Kolinski, A., Hansmann, U.H.E.: Exploring protein energy landscape with hierar-
chical clustering. Int. J. Quant. Chem. 105, 826 (2005)
39. Williamson, T.E., Vitalis, A., Crick, S.L., Pappu, R.V.: Modulation of polyglutamine confor-
mations and dimer formation by the N-terminus of huntingtin. J. Mol. Biol. 396, 1295–1309
(2010)
40. Vitalis, A., Pappu, R.V.: Assessing the contribution of heterogeneous distributions of oligomers
to aggregation mechanisms of polyglutamine peptides. Biophys. Chem. 159, 14–33 (2011)
41. Nadler, W., Hansmann, U.H.E.: Generalized ensemble and tempering simulations: a unified
view. Phys. Rev. E 75, 026109 (2007)
42. Nadler, W., Hansmann, U.H.E.: Optimized explicit-solvent replica-exchange molecular dynam-
ics from scratch. J. Phys. Chem. B 112, 10386 (2008)
43. Trebst, S., Troyer, M., Hansmann, U.H.E.: Optimized parallel tempering simulations of pro-
teins. J. Chem. Phys. 124, 174903 (2006)
Enhanced Sampling for Biomolecular Simulations 279

44. Nadler, W., Meinke, J.A., Hansmann, U.H.E.: Folding proteins by first-passage-times optimized
replica exchange. Phys. Rev. E 78, 061905 (2008)
45. Gallicchio, E., Levy, R.M., Parashar, M.: Asynchronous replica exchange for molecular sim-
ulations. J. Comput. Chem. 29, 788–794 (2008)
46. Sugita, Y., Okamoto, Y.: Replica-exchange molecular dynamics method for protein folding.
Chem. Phys. Lett. 314, 141–151 (1999)
47. Nadler, W., Hansmann, U.H.E.: Optimizing replica exchange moves for molecular dynamics.
Phys. Rev. E 76, 057102 (2007)
48. Kar, P., Nadler, W., Hansmann, U.H.E.: Microcanonical replica exchange molecular dynamics
simulation of proteins. Phys. Rev. E 80, 056703 (2009)
49. Kim, B., Hagen, M., Liu, P., Friesner, R.A., Berne, B.J.: Serial replica exchange. J. Phys. Chem.
B. 111, 1416–1423 (2007)
50. Lee, M., Olson, M.: Comparison of two adaptive temperature-based replica exchange methods
applied to a sharp phase transition of protein unfolding-folding. J. Chem. Phys. 134, 244111
(2011)
51. Okur, A., Wickstrom, L., Layten, M., Geney, R., Song, K., Hornak, V., Simmerling, C.:
Improved efficiency of replica exchange simulations through use of a hybrid explicit/implicit
solvation model. J. Chem. Theor. Comput. 2, 420–433 (2006)
52. Huang, X., Hagen, M., Kim, B., Friesner, R.A., Zhou, R., Berne, B.J.: Replica exchange with
solute tempering: efficiency in large scale systems. J. Phys. Chem. B 111, 5405–5410 (2007)
53. Wang, J., Zhu, W., Li, G., Hansmann, U.H.E.: Velocity-scaling for replica exchange simulations
of proteins in explicit solvent. J. Chem. Phys. 135, 084115 (2011)
54. Yaşar, F., Bernhardt, N.A., Hansmann, U.H.E.: Replica-exchange-with-tunneling for fast explo-
ration of protein landscapes. J. Chem. Phys. 143, 224102 (2015)
55. Lyman, E., Ytreberg, F.M., Zuckerman, D.M.: Resolution exchange simulation. Phys. Rev.
Lett. 96, 028105 (2006)
56. Lyman, E., Zuckerman, D.M.: Resolution exchange simulation with incremental coarsening.
J. Chem. Theor. Comput. 2, 656–666 (2006)
57. Liu, P., Shi, Q., Lyman, E., Both, G.A.: Reconstructing atomistic detail for coarse-grained
models with resolution exchange. J. Chem. Phys. 129, 114103 (2008)
58. Moritsugu, K., Terada, T., Kidera, A.: Scalable free energy calculation of proteins via multiscale
essential sampling. J. Chem. Phys. 133, 224105 (2010)
59. Bernhardt, N.A., Xi, W., Wang, W., Hansmann, U.H.E.: Simulating protein fold switching
by replica-exchange-with-tunneling. J. Chem. Theor. Comput. 12, 5656–5666 (2016); 13 393
(2017)
60. Zhang, H., Xi, W., Hansmann, U.H.E., Wei, Y.: Fibril-barrel transitions in cylindrin amyloids.
J. Chem. Theor. Comput. 13, 3936–3944 (2017)
61. Mohanty, S., Meinke, J.H., Zimmermann, O., Hansmann, U.H.E.: Simulation of top7-CFr: a
transient helix extension guides folding. Proc. Natl. Acad. Sci. U.S.A. 105, 8004–8007 (2008)
62. Mohanty, S., Hansmann, U.H.E.: Caching of a chameleon segment facilitates folding of a
protein with end-to-end β -sheet. J. Phys. Chem. B 112, 15134 (2008)
63. Kuhlman, B., Dantas, G., Ireton, G.C., Varani, G., Stoddard, B.L., Baker, D.: Design of a novel
globular protein fold with atomic level accuracy. Science 302, 1364–1368 (2003)
64. Dantas, G., Watters, A.L., Lunde, B.M., Eletr, Z.M., Isern, N.G., Roseman, T., Lipfert, J.,
Doniach, S., Tompa, M., Kuhlman, B., Stoddard, B.L., Varani, G., Baker, D.: Mis-translation
of a computationally designed protein yields an exceptionally stable homodimer: implications
for protein engineering and evolution. J. Mol. Biol. 362, 1004–1024 (2006)
65. Gaye, M.L., Hardwick, C., Kouza, M., Hansmann, U.H.E.: Chamelonicity and folding of the
C-fragment of TOP7. Eur. Phys. Let. 97, 68003 (2012)
66. Kouza, M., Gowtham, S., Seel, M., Hansmann, U.H.E.: A numerical investigation into possible
mechanisms by that the A629P mutant of ATP7A causes Menkes Disease. Phys. Chem. Chem.
Phys. 12, 11390–11397 (2010)
67. Jiang, P., Hansmann, U.H.E.: Modeling structural flexibility of proteins with Go-models. J.
Chem. Theor. Comput. 8, 2127–2133 (2012)
280 W. Berhanu et al.

68. Alexander, P., He, Y., Chen, Y., Orban, J., Bryan, P.: A minimal sequence code for switching
protein structure and function. Proc. Natl. Acad. Sci U.S.A. 106, 21149–21154 (2009)
69. Kouza, M., Hansmann, U.H.E.: Folding simulations of the A and B domains of protein G. J.
Phys. Chem. B. 116, 6645–6653 (2012)
Determination of Kinetics
and Thermodynamics of Biomolecular
Processes with Trajectory Fragments

Alfredo E. Cardenas

Abstract Trajectory fragments algorithms are a set of methods that partition the
relevant trajectory space between reactants and products into smaller regions of
phase space. Many short trajectories are launched to evaluate transition probabili-
ties between these regions. Each of the methods processes this short-trajectory data
with different kinetic models and as a result long-time kinetic and thermodynamic
information for the overall molecular event can be extracted. This chapter focuses
on Milestoning, providing detailed analysis of the approximations involved in the
algorithm and its computational implementation. Two other trajectory fragments
methods (Partial Path Transition Interface Sampling and Markov State Models) are
briefly discussed as well. Finally, two recent applications of trajectory fragments
methods are described.

1 Introduction

Molecular Dynamics (MD) is a widely used computational tool in many condensed


phase studies, making it possible to understand molecular mechanisms at the micro-
scopic level and compare simulations to experiments. Empirical measurements of
equilibrium and non-equilibrium properties are defined and computed as ensemble
or time averages in statistical mechanics. Therefore to connect MD simulations to
experiments sampling multiple configurations (for equilibrium) and multiple tra-
jectories (for kinetics) is necessary. Methods for equilibrium sampling have been
extensively studied (see Chap. 8 in this book). The focus of this chapter will be on
methods to compute kinetic information.
Consider a molecular system with two metastable states A and B. A key question of
kinetics is: What is the probability of the system to reach state B at time t for the first
time if it started at state A at time zero? The answer provided by one trajectory with

A. E. Cardenas (B)
Institute for Computational Engineering and Sciences, University of Texas,
Austin, TX 78712, USA
e-mail: alfredo@ices.utexas.edu

© Springer Nature Switzerland AG 2019 281


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_9
282 A. E. Cardenas

a single transition event is zero or one. An ensemble of many trajectories, initiated


according to a given distribution at A, provides a statistically more meaningful answer
between zero and one that better reflects the kinetics of macroscopic observations
in a lab. In principle a single long trajectory going forward and backward from A
to B, sampling the transition event many times under equilibrium conditions, can be
used for the same purpose. The computational cost of such a long trajectory is at
least as high as the computational cost of an ensemble of trajectories initiated at A
and propagated until they either hit state B for the first time, or return to A and are
terminated.
Calculations of kinetics present a significant computational challenge. Not only
do the molecular systems studied by simulations keep growing in size and com-
plexity, but the time scales of kinetic processes of interest are long and therefore
longer multiple trajectories are required. Transition State Theory [1] can be used to
effectively investigate the kinetics if an identifiable and dominant free energy barrier
is present in the system. If the location of the dominant barrier is unknown, but is
present, we expect transitional trajectories to be short avoiding spending time at the
barrier. Transition Path Sampling and related approaches are appropriate for those
cases [2–4]. If a highly significant barrier is absent, which is the case in numerous
biophysical events, individual trajectories will not be short. That poses a major chal-
lenge to approaches that compute trajectories explicitly. A discussion on theoretical
approaches to bridge the time scale gap between experiments and simulations for a
broad range of cases is the focus of this chapter. Before describing the trajectory frag-
ment approach [2, 5–7] to long time processes, we briefly review other approaches
to long time dynamics of complex molecular systems.
Consider first a straightforward MD calculation of a long time trajectory. In the
best-case scenario the computational complexity of a trajectory grows linearly with
the system size N. Similarly, the complexity scales linearly with the length of the
trajectory L. The number of degrees of freedom in simulations of a typical biophysical
system is ~105 . The typical number of steps accessible in current simulations is ~108 .
Overall simulation times of tens or hundreds of nanoseconds are becoming common.
The combined complexity of N × L can be tackled with theory, numerical algorithms,
and improved hardware. Some approaches are focused on decreasing the time of
computation of the N-factor (e.g., by volume decomposition of the simulation box).
However, in the last decade significant theoretical advances have enabled efficiency
gains that tackle the L factor as well. A combination of approaches that speed up
the calculation per time step and techniques that make it possible to compute kinetic
observables with a significantly smaller number of steps are of particular interest.
Computational speedups due to hardware have been mostly based on paralleliza-
tion. Common approaches reduce the clock time required to compute a single inte-
gration step using Initial Value Solvers (IVS) algorithms. IVS integrate Newton’s
equations of motion in small time steps. It takes one force evaluation to generate a
step of an IVS trajectory. The number of force evaluations is a useful measure of the
computational cost, which is roughly proportional to N × L.  Parallelization
 of IVS
code can reduce the cost in the best-case scenario to (N − n) P + n × L where P
is the number of processes that run in parallel, and n is the code segment that is not
Determination of Kinetics and Thermodynamics of Biomolecular … 283

parallelized. In practice, parallelization speedups rarely exceed 100 on commodity


clusters.
An alternative approach to the parallelization of the step load is the computation
of multiple time steps simultaneously, or the parallelization of time. Algorithms to
parallelize directly multiple time slices (the L factor) are available, and are based
on the optimization of a functional (action). Optimization of an action, a Boundary
Value Formulation (BVF) of classical mechanics [8], and of stochastic processes [9,
10] generates classical trajectories [11–13] or solutions to the Langevin equation
[14] between two boundary states, in contrast to IVS that only need the initial state.
These approaches are very different from step parallelization and are of considerable
interest. However, their computational cost should be noted. The parallel optimiza-
tion of an action can be conducted simultaneously on different time frames. With a
sufficient number of independent processors, every time slice could be optimized on
a different processor. Since processes communicate only between neighboring time
slices (to estimate velocities and accelerations from coordinates), the communica-
tion time is negligible compared to other calculations.
   The complexity of a single
calculation of the action is proportional to N × L P . Since L can be large (~108
for 1 fs time step and ten nanosecond simulation), P is only limited by the number
of available processors. This is in contrast to IVS in which the number of useful
processes is bound by a fraction of the system size N or the size of the non-bonded
list. Unfortunately, optimization of actions is more expensive than IVS calculations
[15, 16] (if the same level of accuracy is desired) since all the times of a trajectory
are considered simultaneously in a single optimization step. It takes many steps to
optimize the action and to generate an optimal BVF path. The number of degrees of
freedom of an action in BVF formulation is N × L. For a quadratic Hamiltonian,
the maximum number of conjugate-gradient steps to find the global minimum of
the action is N 2 × L 2 . This is an upper bound and heuristic optimizations can be
used [17] however, the calculation remains costly. BVF algorithms are only effec-
tive if the focus is on calculations of approximate trajectories with large time steps
(small L) [11, 18] or exact calculation of short and rare trajectories [17] between two
metastable states that are difficult to sample using IVS. In those cases, BVF tech-
niques generate stable, approximate solutions that provide qualitative insight into
molecular mechanisms. The use of large steps, illustrated in [15, 18–20] cannot be
done in typical IVS that lose their stability with steps bigger than ~5 fs. These BVF
solutions filter out high frequency motions from the trajectories [11]. The removal
of high frequency motions adds to the stability of the calculation but makes the esti-
mates of the statistical weights of the trajectories approximate and heuristic. While
some estimates of weights were promising [19] it is difficult to estimate errors in the
general case. Hence for the generation of ensembles of trajectories to be used for
calculations of rates and thermodynamics properties, other approaches are required.
Another type of methods design to speed up the calculations are multi-time step-
ping algorithms (such as RESPA [21]) in which slow forces are integrated less fre-
quently. In RESPA a larger time step for integration of slowly varying forces is
assigned. For example, while the fast degrees of freedom are integrated with a time
step of ~1 fs, the slower long range interactions can be evaluated every 4 to 6 fs
284 A. E. Cardenas

[22]. The overall computational gain from multi-time stepping algorithms is modest
(about a factor of two).
Expensive special purpose machines for MD as Anton focus only on reducing the
factors proportional to N [23, 24]. While this hardware is strikingly successful in
producing a few millisecond trajectories, the problem of kinetic at biophysical times
(milliseconds) remains prohibitively costly due to the requirement of an ensemble
of trajectories. Furthermore, it is desirable to make the calculations of long-time
dynamics available at a single-researcher laboratory setting.
Most of the success in speeding up the calculations has come from reducing the
N-factor contribution. Therefore, the most significant remaining barrier for routine
calculations of kinetic and thermodynamic properties of molecular systems is the L
factor—the trajectory length.
To recapitulate let us not forget why these trajectories are computed, how they
are used and if there are ways of avoiding the expensive straightforward calculations
discussed so far. In the case of thermodynamic calculations, configurations can be
generated by MD simulations to average the values of observables. Averaging using
straightforward trajectories is correct for ergodic systems, but correct does not mean
efficient. Enhanced sampling techniques have been used for a long time in statisti-
cal mechanics calculations of thermodynamic variables. For example the method of
umbrella sampling [25] is widely used to probe and estimate probabilities of infre-
quent events in phase space. Straightforward MD trajectories should not be used to
compute thermodynamic properties that can be estimated much faster with enhanced
sampling techniques.
We can make similar arguments for the evaluation of kinetics. While straight-
forward calculations of an ensemble of trajectories from A to B provide the exact
answer, it is not the only way of obtaining the correct result. The cost of calculations
of kinetics is even higher than simulations of equilibrium due to the need of many
trajectories. Alternative approaches can provide the desired statistics and overcome
the time scale barriers, or a large value of L. It is the reduction of the lengths of
the trajectories, breaking them into fragments, running these fragments in different
processes, and still computing observables of long time dynamics, which is the main
topic of the present chapter.

2 Trajectory Fragments Methodologies

The development of trajectory fragment methodologies is one of the theoretical


advances that have enhanced simulation capabilities in the last decade. The frag-
ments are trajectory pieces defined between portions of phase space. These portions
are called cells [26] or states [7]. The hyper-surfaces that divide these cells are called
milestones [6] or interfaces [2, 3]. We will use interchangeably both sets of names. In
Fig. 1 we show a schematic drawing of interfaces in two dimensions. While the inter-
faces considered in the figure are along a single reaction coordinate, generalizations
for higher dimensions have been made [26, 27]. Transitions between cell interfaces
Determination of Kinetics and Thermodynamics of Biomolecular … 285

Fig. 1 Five milestones are used to separate the relevant trajectory space between states A (reactant)
and B (product). Trajectory fragments are short segments of trajectories connecting neighboring
milestones. For example, trajectories started from milestone 2 (the three blue trajectories) are run
until they hit milestones 1 or 3. Before they hit any of the two neighboring milestones we say that
the system “belongs” to milestone 2. A reaction coordinate connecting A and B is shown in orange

are marked as passage events and generate trajectory fragments. The lengths of the
trajectory fragments are much shorter than the expected length of an exact first pas-
sage trajectory connecting A to B (with number of integration steps L). What are the
reasons for this efficiency gain?
Consider first a diffusive process. Diffusive motion is the typical dynamics found
in biomolecular motions beyond tens of picoseconds. Let the reactant and product
be separated by a distance R. The time scale for free diffusion along one dimension
is roughly t : R 2 . If we consider M-1 cells between the end interfaces then the time
  2
scale for diffusion between a pair of divisions is of order of R M . In order to
complete a trajectory we need to select M pieces of the fragments and hence the time
  2 
scale using fragments is t M : M · R M  R 2 M. The analysis suggests a speed
up by a factor of M with respect to a straightforward trajectory.
What is the origin of this saving? Diffusive trajectories are going back and forth
many times. In contrast, the fragments are computed without explicitly simulating
back and forth transitions. In the Milestoning picture we first generate a bank of
transitional trajectory fragments, say from cell i to cell j and from cell j to i. We found
by experience that adequate sampling of trajectory fragments to estimate transition
probabilities can be achieved using hundreds or thousand of trajectories for most
molecular systems [26, 28, 29]. The sampling intends to estimate the transition
probability between the interfaces and not necessarily to provide a comprehensive
picture of the dynamics within the cell. For example, a transition probability of
10 percent can be estimated quite accurately using 100 trajectory fragments per
transition event in Milestoning. Milestoning is designed to provide uniform sampling
286 A. E. Cardenas

of events as the reaction progresses or returns. If the trajectory goes back there is no
need to re-compute trajectory fragments since we re-sample from the prepared pool.
We will obtain similar statistics if we are at a minimum or at the top of the free
energy barrier. That is in contrast to straightforward MD simulations in which we
usually get a lot more statistics near the minima, using inefficiently our limited com-
putational resources. This brings us to another advantage of the trajectory fragments:
overcoming a barrier is more efficient compared to a complete calculation of a tra-
jectory moving from one side of the barrier to the other side. Consider climbing a
barrier of height V . In the canonical ensemble the time to reach the top of the barrier,
is proportional to exp(βV ), where β is the Boltzmann factor. Imagine that the bar-
rier is broken into cells. Eachmilestoning
 transition climbs up with an intermediate
time proportional to exp βV M . This time is exponentially shorter than exp(βV  ).
Adding up M milestones has a small impact on the overall time t M : M exp β B M
in this case, keeping the rate significantly faster than of a single trajectory. In prac-
tice the speedup easily exceeds a factor of millions for these activated processes. For
example, in the simulation of the recovery stroke in myosin [30] the actual accumu-
lated length of all the trajectory fragments was of the order of 100 ns. The predicted
mean first passage time of the process (fraction of a millisecond) was within a factor
of 10 from the experimental result [31] and is a million times longer than the simu-
lated time. Hence the use of trajectory fragments dramatically reduces the collective
length of the computed trajectories and increases the computational efficiency.
It is important to point out that an adequate use of short trajectories to evaluate
thermodynamics and kinetic properties depends on a thorough sampling of the con-
formational space of the system, such that the calculations do not miss important
regions of the conformational space. Also, the interfaces used to partition the space
should be close enough such that short trajectories can correctly sampled their tran-
sition times, but long enough to eliminate any bias to the initial conformations [32].
A few long trajectories or other sampling techniques can be used to explore the space
before any trajectory fragment technique is used.
In the following we will describe three different trajectory fragment methodolo-
gies: Milestoning, Partial Path Transition Interface Sampling (PPTIS) and Markov
State Models (MSM). Other trajectory fragment techniques have been developed
and applied successfully to the study of rare processes such as transition interface
sampling (TIS) [3], forward flux sampling (FFS) [33, 34], weighted ensemble [35],
and boxed molecular dynamics [36]. There are similarities between these algorithms
so for the purposes of this review we are describing only three of them. We will
provide more theoretical and algorithmic details for Milestoning. Recent reviews
provide additional descriptions of other methods [37–41].
Determination of Kinetics and Thermodynamics of Biomolecular … 287

2.1 Milestoning

In the following we will introduce the basic objects and definitions of Milestoning,
provide some details of its implementation, and describe the equations to determine
kinetic and thermodynamic properties.

2.1.1 Definition of Milestones

Milestoning is a method that enables the estimation of fluxes at interfaces with


trajectory fragments. The fluxes are used as input for a non-Markovian theory to
extract the kinetics and thermodynamics of the system. The theory assumes that the
system is close to equilibrium and uncorrelation between fragments of trajectories.
These assumptions help in the derivation of compact and coarse-grained equations for
the dynamics, but must be tested carefully as described elsewhere [28]. If Milestones
are placed spatially close to each other (to increase computational efficiency) the
dynamics at the interfaces may be correlated and the accuracy of the results will be
suspicious.
A trajectory fragment of Milestoning starts from a dividing hypersurface and ter-
minates the first time it “touches” another surface (Fig. 1). The location of the dividing
surface is determined with the help of a set of anchors X i , i  1, . . . , K , X ∈ R 6N
and coarse variables Q α (X i ). An anchor is, in principle, a point in phase space. How-
ever, in all the applications performed so far it is reduced to a coordinate vector. The
set of anchors provides a rough sampling of the most relevant part of conformation
space. The anchors are used to assist in the determination of relevant interfaces in the
space of the coarse variables. The set of anchors can change or expand as the sam-
pling of trajectories is conducted. For example, if some of the trajectory fragments
are found at phase space domains far from any current anchor, a new anchor could be
added to cover the just found set of conformations. Of course the choice of anchors
must be made carefully since they need to capture the overall direction of the process.
In the past anchors were chosen along a numerically computed reaction coordinates
(Fig. 1) [30, 42]; a formulation that was recently extended to higher dimensions in
the approaches of Markovian Milestoning with Voronoi Tesselation (MMVT) [27]
and Directional Milestoning (DiM) [26].
Once a set of anchors is defined, the coarse variables are selected. A coarse
variable can be simple and include a few atoms such as an intramolecular distance or
an internal torsion. It can be more complex and include a larger set of atoms such a
root-mean-squares deviations [43], the steepest descent path [30, 43] or a minimum
free energy coordinate [29]. The requirement from the set of coarse variables is that
it is sufficient to make the anchors distinguishable. Defining the distance from a point
X i to another point X j in the space of the coarse variables as

    2
d Xi , X j  Q l (X i ) − Q l X j (1)
l
288 A. E. Cardenas
 
we require for all i, j i  j that d X i , X j > ε, where ε is a minimal separation
between the anchors. If this criterion is not satisfied it implies that more coarse
variables are needed to capture the differences between the anchors, or perhaps that
the anchors were placed too close to each other and some of them can be removed.
Having determined anchors and coarse variables we define milestones, hypersur-
faces that divide the relevant phase space for the transition into cells. A milestone in
DiM has a sense of direction (which explains the name). It is defined as the following
set of points
 2  
M(i → j)  X d(X, X i )2  d X, X j + 2 and ∀k d(X, X k ) > d X, X j
(2)

The points X that satisfy the equality and therefore define the interface are closer to
the final state j and hence the sense of directionality. The parameter  determines the
extent of asymmetry between the two end states. The term  is added to minimize
the possibility of rapid termination of trajectories between milestones that crossed
each other.
As we further discuss later, the physical assumption of Milestoning theory is of
memory loss between hypersurfaces. The coarse variables of the individual trajecto-
ries or trajectory fragments, in accord with a statistical mechanics view of dynamics,
suffer numerous collision events with other degrees of freedom and their motion is
overall diffusive. After a typical time period the coarse variables uncorrelate and it
is not possible to trace them back to their point of origin. A formal statement of
this approximation and the profound simplifications it suggests for the calculations
of kinetics and thermodynamics are given in Sect. 2.1.3. In the next subsection we
continue to describe the algorithm of fragment generation that uses this assumption
and how these fragments can be used to compute the relevant transition kernel.

2.1.2 Fragment Generation and Transition Kernel

To define a trajectory fragment we need starting and terminating conditions and


obviously an algorithm to propagate the phase space point as a function of time. The
choice in Milestoning is arbitrary (Langevin, Newtonian dynamics, etc.). However,
to reflect true microscopic time scales straightforward mechanics is preferred. In
principle, a termination point of one fragment should be the starting point of another
fragment (this is what is done in FFS [33], where trajectories can be traced back to
the reactant state). In Milestoning, we choose the termination point to be the first
hitting point of a trajectory on a milestone different from the milestone it started on
(Fig. 2a). A starting point (which is not necessarily the milestone of the reactants)
must therefore be the same, a first hitting point of a trajectory that reached the present
milestone from yet another milestone. Since we sample plausible starting points at
the interface directly (see below) we need to verify that these points indeed represent
a first hitting distribution. We therefore integrate this point back in time and verify
Determination of Kinetics and Thermodynamics of Biomolecular … 289

Fig. 2 Trajectory fragments computed in Milestoning. Three milestones (i, j, k) are represented as
vertical lines. In a backward trajectories are launched starting from configurations in milestone j.
The configurations (and velocities) belong to the first hitting point distribution of j (black points)
if the backward trajectory hits a neighboring milestones (i or k) without re-crossing j (solid lines).
If they re-cross j (dashed lines), the originated points (in grey) are not saved for the next step.
In b forward trajectories are launched from the first hitting points discovered in (a). The forward
trajectories are shown as solid lines. The backward trajectories [from (a)] are shown as dotted lines.
Notice that the forward trajectories are allowed to re-cross the originating milestone j

that it crosses another milestone, before re-crossing the milestone it started from. If
it re-crosses the milestone of initiation, then it is not a first hitting point. This phase
space point is removed from the sample set.
In summary the generation of the trajectory fragments uses the following steps:
1. Generate a canonical sample of configurations at a milestone. This is achieved
either with constant temperature MD while restraining the system to the hyper-
surface [26] similar to what is done in umbrella sampling [25] or with constrained
dynamics implemented with Lagrange’s multipliers [28]. The set of selected con-
figurations is distributed in the interface with weights of exp(−βU (X )) where
β is the Boltzmann factor and U (X ) the potential energy.
2. Examine if the phase space points sampled in step 1 are first hitting points. Since
the sampling in step 1 is of configurations only, sample first atomic velocities
from the Maxwell distribution conditioned on the overall velocity directed back-
ward from the hypersurface. Each point is integrated backward in time using
Newtonian mechanics (constant energy) until it hits and terminates on a mile-
stone (Fig. 2a). The use of the NVE ensemble is important for the calculations
of dynamics. Other ensembles provide only phenomenological parameterization
of time dependent properties. If the terminating milestone is different from the
interface we started from, accept this initial configuration and velocity as a first
hitting point. If not, reject the point and try with another phase point from step 1.
3. Integrate the first hitting points from step 2 forward in time. The trajectory frag-
ment is terminated when it hits for the first time a milestone different from the
milestone it was initiated on. Note the important difference between the back-
ward and the forward integrations. During the forward integration we do not
290 A. E. Cardenas

terminate trajectories that re-cross the initial milestone. We continue the forward
trajectories until they find a new milestone to terminate on (Fig. 2b). All the
forward trajectories count, and the removal of some of the sampled phase points
at the interface occur only in step 2.
What do we do with the sampled fragments? The Milestoning theory is built
around a kernel or a transition operator, which we denote by K αβ (t). It is the proba-
bility density that a trajectory fragment initiated at interface α will hit interface β at

time t. This probability density is normalized: β∈ᾱ 0 K αβ (t)dt  1. The normal-
ization states that at infinite time the trajectory must terminate on one of the nearby
milestones β. The symbol ᾱ means milestones that can be reached from α without
crossing other milestones along the way.
How do we use the trajectory fragments to estimate the value of the kernel (or
time moments of it)? We compute the kernel (or moments of it) by binning. For
example, let the number of first hitting point trajectories initiated at hypersurface
α be n α . Let the number of trajectories that hit a neighboring milestone β between
time t and time t + t be n αβ (t). The kernel element K αβ (t) is therefore estimated
as n αβ (t) n α t. We will be mostly interested in the moments of the kernel. For
example, the probability that a trajectory fragment will make it from α to β (at any
∞ 
time) is the zero moment (in time) of K αβ (t), pαβ  K αβ (t)dt ≈ i n αβ (ti ) n α .
0
Computing the moments is more stable statistically since less sampling is required
to compute them compared to accurate estimates of many bins of K.

2.1.3 Stationary Flux, Stationary Probabilities and Mean First Passage


Time

Assuming that we have computed the ensemble of trajectory fragments, and then
estimated the kernel K αβ (t), how do we proceed to obtain kinetics and thermody-
namics? At the core of the Milestoning theory one finds an equation for the flux
through milestones. A flux is defined as the number of trajectories fragments that
pass through a milestone at time t. We write a general and exact equation for the flux
(irrespective of the dynamics used to generate the trajectory fragments):

t
      
qα (t, X α )  pα (0, X α )δ t + + qβ t , X β K βα t − t , X β , X α dt d X β , (3)
β∈ᾱ 0

where the indices α, β are used to denote milestones, and pα (t, X α ) is the probability
that the last milestone that was crossed at time t is α. The coordinate vectors X α and
X β are at the interfaces, and qα (t, X α ) is the flux through the milestone point X α at
time t. Equation 3 is difficult to solve as it is. The flux is a function of the position in
the hypersurface, which means a function of N-k dimensions of all degrees of freedom
(where N is the number of degrees of freedom, and k the number of coarse variables).
Determination of Kinetics and Thermodynamics of Biomolecular … 291

The kernel itself depends on position vectors in two Milestones. This exact equation
is therefore not useful for simulation of large molecular systems with a number of
coarse variables that could easily exceeds hundreds. To make progress we use the
memory loss assumption mentioned in the previous section. In the kernel language it
means that a trajectory fragment depends only on the label of the milestone it started
from, but is independent of the exact location within the milestone. Hence
   
K αβ t, X α , X β ∼ K αβ t, X β (4)

The approximation in Eq. (4) is what makes Milestoning different (and computa-
tionally more efficient) than other trajectory fragment techniques. For example, FFS
continues a trajectory from the current interface using a prior trajectory that hit the
interface before; it therefore produces an exact path. Milestoning is using indepen-
dent fragments to estimate the kernel. With the approximation of Eq. (4) at hand we
define

 
K αβ (t)  K αβ t, X β d X β

qα (t)  qα (t, X α )d X α

pα (t)  pα (t, X α )d X α (5)

Integrating Eq. (3) with respect to X α (and also integrating over X β on the right hand
side equation) we obtain the basic formula of the Milestoning theory [6]

t
      
qα (t)  pα (0)δ t + + qβ t K βα t − t dt (6)
β∈ᾱ 0

Equation (6) can be solved analytically using Laplace transforms to provide the
stationary distribution, pα (t → ∞) and the mean first passage time τ (and higher
moments of it) as was shown in a number of publications [28, 29, 44]. In the absence
of external forces and (or) fluxes in and out the system, pα (t → ∞) is the equilibrium
distribution. The overall mean first passage time, τ , is computed for a system with an
absorbing boundary at the product state. Every trajectory that makes it to the product
state is terminated. The final expressions for the stationary flux and distribution are

qstat (I − K)  0
pα,stat  qα,stat · tα (7)

The vector q is of length L—the number of milestones. We have (q)α,stat 



qα (t → ∞). Similarly K is a matrix such that (K)αβ  K αβ (t)dt and I is the
0
identity matrix. The average tα is the lifetime of milestone α, i.e., the average time
292 A. E. Cardenas

that it takes a trajectory fragment initiated at milestone α to terminate on any other


milestone. It is given by



tα  t · K αβ (t)dt (8)
β∈ᾱ 0

From the first line of Eq. (7) we realize that q is an eigenvector of the matrix (I-K)
with an eigenvalue of zero—a straightforward problem in linear algebra.
The calculation of the Mean First Passage Time (MFPT) follows another analytical
expression

τ  p · (I − K)−1 t (9)
 
where p is the vector of initial conditions (p)α  pα (0) , and t is a vector with
components t α ≡ tα . Higher moments of the first passage time can be computed
as well using moments of the kernel [28, 44].

2.2 Partial Path Transition Interface Sampling

Partial Path Transition Interface Sampling (PPTIS) [2] is a method similar to Mile-
stoning in the sense that only computes trajectory fragments between interfaces. The
fundamental principles used in both methods are similar but the practical implemen-
tation to extract the required probabilities is slightly different.
PPTIS is a variation of the Transition Interface Sampling (TIS) method [3]. In
TIS paths are computed from the interfaces until they reach states A or B. Therefore,
this method is not particularly useful when considering diffusive barriers. This limi-
tation led to the development of PPTIS that uses shorter paths between neighboring
interfaces similar to Milestoning. The theoretical framework of PPTIS starts with
the conditional crossing probability depending on the location of any four interfaces
i, j, l, m:
   f

i φi j (x)h lm (x)
l
P    (10)
m j φi j (x)

which is the conditional probability of reaching interface l before m, after crossing


f
i while coming directly from j in the past. The functions h lm (x) are two-interface
theta functions indicating if the forward trajectories starting from x reach interface l
before m, and φi j (x) is the flux of trajectories at i at time zero coming from j. If i < j
< k then the following flux relations holds:
Determination of Kinetics and Thermodynamics of Biomolecular … 293
 
k j  
φki P φ ji (11)
i i

This equation states that the flux at k coming from i is the product of the flux at j
< k coming from i times the conditional probability of reaching k before i when the
system is coming from j directly from i. Applying twice these flux relations between
neighboring interfaces the following probabilistic relation among four interfaces i
< j < k < l is obtained
     
l j l k k j
P P P (12)
i i i i i i

These last two equations are exact because they keep track of the starting interface,
in this case interface i.
Before writing down the expressions for the rates, PPTIS introduces two additional
crossing probabilities. The single interface crossing probabilities are defined as:
   
± i +1 i i −1 i
pi ≡ P , pi ≡ P
m
,
i −1 i −1 i +1 i +1
   
i − 1 i i + 1 i
pi ≡ P , piP ≡ P , (13)
i +1 i −1 i −1 i +1

and the long distance crossing probabilities as:


   
i 1 0 i −1
Pi+ ≡P , Pi− ≡P , (14)
0 0 i i

where 0 is the interface closest to the initial state A. For example, P3+ is the proba-
bility that a trajectory crosses interface 3 while coming from state A directly. From
these definitions the corresponding rate constants for a one-dimensional reaction
coordinate can be written as:
   
φ1,0 + φn−1,n −
k AB  P , kB A  Pn (15)
hA n hB

with

p ±j−1 P j−1
+
p ∓j−1 P j−1

P j+ ≈ , P j− ≈ (16)
p ±j−1 + p j−1 P j−1

p ±j−1 + p j−1 P j−1

where n is the interface closest to the product state B and A and B represent overall
states. For example, state A consists of stable state A and all phase space points
coming directly from state A in the past. A similar definition applies to state B.
294 A. E. Cardenas

The recursive expressions for the long-distance crossing probabilities are approx-
imate. The approximation in PPTIS is that trajectories lose their memory over a
distance shorter than the separation between interfaces. This Markovian assumption
is basically the same one used in Milestoning.
 Starting
 with the initial condition
− −
P1  P1  1 one can iteratively solve for P j , P j from j  2, . . . , n.
+ +

The evaluation of the single interface local probabilities pi± , pi , pim , and piP
entails the generation of all possible paths starting from interfaces i − 1 and i + 1
that cross at least once with i.
PPTIS was developed to use for transitions that can be described by a single
reaction coordinate. For those cases, PPTIS is similar to Milestoning. However,
Milestoning is a more general formulation that enables computations of kinetics
without the need to know a priori the reaction coordinate. In the DiM implementation
of Milestoning, anchors labeled with multiple coarse variables are used to partition
the relevant phase space for the process under study.

2.3 Markov State Models

In the last 10 years the use of Markov State Models (MSM) have become quite popular
to analyze large set of simulation data [7, 38, 45–51]. These techniques usually start
from a few long MD simulations or many short trajectories where a molecular system
undergoes conformational transitions. Typical examples of applications are protein
folding and conformational changes associated with ligand binding. Very often the
amount of data generated from these simulations is too large and analysis tools are
required to extract from them the relevant structural and dynamical information. This
reduction of the original high-dimensional molecular simulation data often entails the
partitioning of the relevant conformational space of the system into discrete states.
Kinetics information can be obtained by extracting transition probabilities between
these discreet states. MSM assume that the transitions between states are Markovian,
i.e., the jumps between these states are memoryless. Specifically, let’s assume that
x(t) describes the positions and momenta of a long trajectory for a molecular system
of interest. This trajectory can be discretized into a set of states {S1 , . . . , Sn }. The
time evolution of the system between the states can be described by the transition
matrix T(τ ) ∈ Rn×n , where Ti j (τ ) is the steady-state probability to find the system
in state j at time t + τ given it was in state i at time t. The transition matrix elements
can be computed by evaluating correlation functions:

j (τ )
cicorr
Ti j (τ )  (17)
πi

where πi  x∈Si d xμ(x) is the stationary probability to be in state Si and


μ(x) is the Boltzmann distribution. These correlation functions are normalized
i, j ci j (τ )  1 and if the dynamics satisfies detailed balance they are symmet-
corr
Determination of Kinetics and Thermodynamics of Biomolecular … 295
 
ric cicorrj (τ )  c ji (τ ) . If the number of transitions between the different states
corr

is counted in the long trajectory x(t) and stored in a count matrix ci j (τ ) then the
correlation functions are easy to obtain because Cicorr j (τ ) ∝ C i j (τ ).
Let’s denote by p(t) ∈ Rn the population of the system at the different states
{S1 , . . . , Sn } at time t. After a time τ , the state populations change according to:


n
p j (t + τ )  pi (t)Ti j (τ ), (18)
i1

or in matrix notation as

pT (t + τ ) ≈ pT (t)T(τ ) (19)

In practice, it is not possible to obtain exactly the transition probabilities due to


limited statistics. The best that can be done is to estimate an approximate transition
matrix T̂ with a maximum likelihood of:
ci j
T̂i j  (20)
ci

where ci  k cik is the total number of times the trajectory is in state i. For a very
long trajectory this approximated transition matrix will converge to the exact result:

lim T̂i j  Ti j (21)


N →∞

Due to the limited statistics the approximate transition matrix does not satisfy the
detailed balance condition but in general we have πi T̂i j  π j T̂ ji [52]. This can be
partially corrected using maximum likelihood estimator that enforces the detailed
balance equations.
The number of states used in MSM varies depending on the complexity of the
system. For example for protein folding simulations the number of partitions can
easily reach tens of thousands [39]. Conventional structural clustering techniques
such as k-means or k-centers are often used initially to create states that group
structures from the available simulation data. A kinetic clustering is done later by
constructing the corresponding transition matrix and lumping together states that
interconvert faster than a chosen lag time (typically less than 10 ns). In practice, this
is done by computing and analyzing the eigenvalues and eigenvectors of the current
transition matrix to identify states that are kinetically similar.
It often happens that using the initial simulation data is not enough to sample
adequately relevant state transitions. In that case, adaptive techniques [53] can be used
to efficiently sample with additional short simulations the transitions that contribute
more to the uncertainties in the transition probability matrix.
Once the MSM is constructed it should be validated for self-consistency with
respect to the input data used in its construction. Several approaches have been
296 A. E. Cardenas

suggested in the literature [48]. Once this validation is passed the model can be used
to make kinetic predictions that could be compared to experiment.
Markov State Models provide only an approximate kinetics mostly due to two
reasons. First, in practical applications MSM can provide only approximate transition
probabilities due to limited sampling. This is a limitation that is present in any
trajectory fragment algorithm. The second reason is that by discretizing the dynamical
process x(t) (that is Markovian in the more often used algorithms of molecular
dynamics) into a set of states the exact location of the system is lost, and the jump
process between states is no longer Markovian. For example, when the system is in a
region of state i closer to j it will have a larger probability to jump to j than systems that
are close to the center of state i. The state space discretization introduces systematic
error in the prediction of long-time kinetics:

pT (t + kτ ) ≈ pT (t)Tk (τ ) (22)

Accurate evaluation of this expression is essential to predict long-time dynamics


(with large values of the integer k) using short trajectories of time length τ . It has
been found that increasing the lag time is the best way to improve the accuracy of the
results but if τ is too large the time resolution of the model will be limited. At the end
after performing an MSM analysis of trajectory data, tests should always be done to
determine if the model is consistent with the data set within statistical uncertainties
[48].
Although Markov State Models started as an analysis tool of trajectory data to
identify metastable states, they have evolved to include additional use of trajectory
fragments to sample the transitions between states. In spirit MSM is similar to Mile-
stoning or PPTIS. However, many implementation details are different. Milestoning,
for example, sample conformations in phase space hypersurfaces while in MSM,
states are regions in phase space. If milestones are appropriately separated (like in
the DiM implementation) the computed trajectory fragments will be long enough
for uncorrelation to occur, and these trajectories can be used directly to estimate the
transition kernel and mean first passage time for the process. For MSM, once the
states are defined, a lag time still must be chosen so that the Markovian assump-
tion is appropriate to describe the original dynamics. Recently, attempts to combine
MSM with Milestoning for systems with dominant metastable states seem to be more
efficient in the minimization of the discretization error than conventional MSM [54].

3 Applications of Trajectory Fragments

Instead of a survey of the many applications of trajectory fragments here we focus


on describing two recent applications that analyzed very different biophysical phe-
nomena.
A novel application of Milestoning showed that trajectory fragments can be used to
determine permeation rates of solutes through lipid membranes [55]. Specifically, the
Determination of Kinetics and Thermodynamics of Biomolecular … 297

Table 1 Mean first passage time for blocked tryptophan permeation through a DOPC lipid mem-
brane
Method Average (h) Individual layers (h)
Milestoning 3.8 7.5, 0.05
Solubility-diffusion 0.23 0.41, 0.05
Experiment 8
The second column shows the average permeation time for the two lipid layers and the third column
the permeation time computed for the individual layers

permeation process of a blocked tryptophan through a DOPC bilayer was considered.


For this small solute is reasonable to assume that its depth inside the membrane (z
axis) is a good reaction coordinate to describe the permeation process. Therefore, xy
planes perpendicular to the z axis were used as milestones. First, umbrella sampling
simulations were used to constrain the solute to sample conformations in each of these
planar milestones. Then two different unconstrained molecular dynamics simulations
were initiated from these in-plane conformations. The first simulations computed
backward trajectories to determine if the set of positions and velocities correspond
to a point in the first hitting distribution at the corresponding milestone (Fig. 2a). If
the first hitting distribution test passed then a forward trajectory was launched until
a neighboring planar milestone was hit (Fig. 2b). This was done for all milestone
planes to estimate the transition kernel and the lifetime in each milestone. Those are
the only two quantities needed to determine the MFPT for the process (Eq. 9).
Table 1 shows the MFPT for the permeation process from the aqueous phase to the
center of the bilayer estimated with Milestoning. The results are in excellent agree-
ment with the experimental results and are slightly better than results obtained with
the solubility-diffusion model (that is based on the determination of the potential
of mean force along the z axis). For larger solutes it is expected that the solubility-
diffusion model will fail because other degrees of freedom (rotational and internal
coordinates) will become as important as the z coordinate to describe their membrane
permeation. In such cases, Milestoning will be appropriate because the method can
be used with several coarse variables describing the relevant phase space for the
permeation. The computed hour-long timescale for permeation was 9 orders of mag-
nitude longer than the microsecond computer time used to generate the trajectory
fragments. This shows the tremendous efficiency of Milestoning to generate kinetics
information of activated processes.
The free energy profile shows a large barrier at the center of the bilayer (Fig. 3).
The blocked tryptophan has seven atoms that can form hydrogen bonding with sur-
rounding molecules. However, at the bilayer center hydrogen bonding is not possible.
The barrier is less pronounced for the side chain of tryptophan because this molecule
has only one atom capable of forming hydrogen bonds.
Markov State Models have been successful to analyze the folding kinetics of
proteins up to 80 residues long at time scales of microsecond to millisecond using
massively parallel simulations. Recently [56], MSM was used to study the kinetics
298 A. E. Cardenas

Fig. 3 Free energy profiles


for the permeation of
blocked tryptophan obtained
with Milestoning and the
solubility-diffusion model.
The free energy profile of the
tryptophan side chain is also
shown. Reproduced with
permission from Ref. [55]

and mechanisms of a WW domain, FiP35, using as input two 100 microsecond


simulations obtained by Anton [23]. From the 106 molecular snapshots saved from
the MD simulations, a MSM with 26,104 states was constructed using a relatively
long lag time of 100 ns. After constructing and validating the model by evaluation of
autocorrelation functions of several observables and comparing with the original MD
data, the folding time was estimated by modeling a temperature-jump experiment.
This was done by random perturbations of the equilibrium population of the states
in the model and observing how the overall system relaxed back to equilibrium over
time. A double exponential fitted the effect of the perturbation with time scales of 5.0
µs and 100 ns. These results agreed well with the two time scales found in the T-jump
experiment of 11 µs and 150 ns. An analysis of the eigenvalues and eigenvectors
of the MSM transition matrix showed that the 5.0 µs time scale corresponds to the
folding process and the 100 ns to transitions between unfolded states.
To elucidate the folding mechanism, transition path theory was used to determine
the more traveled pathways [57, 58]. The results of this analysis showed a com-
plex, heterogeneous, parallel pathways to the native structure (Fig. 4). This result
contrasted with the conclusions from the original MD data that suggested that a sin-
gle dominant folding pathway was present [23]. Evidently, MSM allows for a more
unbiased and general description of the process compared to simple visualization and
use of intuition to analyze trajectory data. The MSM analysis also indicated that the
states identified as the native conformations are highly connected and interconvert
rapidly (hundreds of ns) while non-native states transitions are slower (10 s of µs).

4 Conclusions and Outlook

The last 10 years have brought new algorithmic advances such as trajectory fragments
that are starting to bridge the gap between the short-time limits of molecular dynamics
Determination of Kinetics and Thermodynamics of Biomolecular … 299

Fig. 4 Folding of FiP35. On the left, a folding flux network showing the top 12 folding pathways
obtained with transition path theory. Arrow widths are proportional to flux and node size is propor-
tional to state populations. The conformations closest to the native are depicted at the bottom. On
the right, examples of conformations while the folding progresses from Pfold  0.1 to Pfold  0.9
are shown. Reproduced with permission from Ref. [56]

simulations and the long-time duration of many biomolecular processes. Methods


such as Milestoning and PPTIS focus on the computation of trajectories to directly
determine properties such as transition probabilities and milestone lifetimes that can
be used to compute network fluxes and rates using a kinetic theory. MSM methods
have been used effectively as analysis tools to compute kinetic networks extracted
from many short trajectories or a few long trajectories. Rates are computed by solving
300 A. E. Cardenas

master equations between the states and pathways fluxes are obtained by using kinetic
approaches such as transition path theory. Applications of these trajectory fragment
methods have shown their efficiency and accuracy in the determination of rates
and provided richer insights into the mechanisms of biomolecular processes and
interpretation of experimental data.
Despite those advances and impressive applications, these methods are used by a
limited number of groups in the theoretical biophysical community. One reason for
this is that the theory can be rather intimidating at first and its algorithmic imple-
mentation is involved with many steps to follow. Another reason is that the hardware
needed to perform the required calculations (hundreds to thousand of computers) is
not always available to many groups. The second reason is more difficult to tackle, but
to try to alleviate the first problem a more automatized procedure could be designed
to provide assistance in setting up the calculations, given a few input parameters
and error tolerance levels. For MSM some tools have been designed to address this
automatization [47, 59] but not for Milestoning. Algorithmic challenges still remain
to help in the design of general procedures and in the choice of simulation parameters
that will provide accurate results in most general cases.

References

1. Truhlar, D.G., Garrett, B.C., Klippenstein, S.J.: Current status of transition-state theory. J. Phys.
Chem. 100(31), 12771–12800 (1996)
2. Moroni, D., Bolhuis, P.G., van Erp, T.S.: Rate constants for diffusive processes by partial path
sampling. J. Chem. Phys. 120(9), 4055–4065 (2004). https://doi.org/10.1063/1.1644537
3. van Erp, T.S., Moroni, D., Bolhuis, P.G.: A novel path sampling method for the calculation of
rate constants. J. Chem. Phys. 118(17), 7762–7774 (2003)
4. Bolhuis, P.G., Chandler, D., Dellago, C., Geissler, P.L.: Transition path sampling: throwing
ropes over rough mountain passes, in the dark. Ann. Rev. Phys. Chem. 53, 291–318 (2002).
https://doi.org/10.1146/annurev.physchem.53.082301.113146
5. Allen, R.J., Warren, P.B., ten Wolde, P.R.: Sampling rare switching events in biochemical net-
works. Phys. Rev. Lett. 94(1), 018104 (2005). https://doi.org/10.1103/PhysRevLett.94.018104
6. Faradjian, A.K., Elber, R.: Computing time scales from reaction coordinates by milestoning.
J. Chem. Phys. 120(23), 10880–10889 (2004)
7. Chodera, J.D., Swope, W.C., Pitera, J.W., Dill, K.A.: Long-time protein folding dynamics from
short-time molecular dynamics simulations. Multiscale Model. Simul. 5(4), 1214–1226 (2006)
8. Landau, L.D., Lifshitz, E.M.: Mechanics, vol. 1. Course of Theoretical Physics. Pergamon,
Oxford (1976)
9. Machlup, S., Onsager, L.: Fluctuations and irreversible processes. II system with kinetic energy.
Phys. Rev. 91, 1512–1515 (1953)
10. Onsager, L., Machlup, S.: Fluctuations and irreversible processes. Phys. Rev. 91, 1505–1512
(1953)
11. Olender, R., Elber, R.: Calculation of classical trajectories with a very large time step: formalism
and numerical examples. J. Chem. Phys. 105(20), 9299–9315 (1996)
12. Elber, R., Ghosh, A., Cardenas, A.: Long time dynamics of complex systems. Acc. Chem. Res.
35(6), 396–403 (2002)
13. Elber, R., Cardenas, A., Ghosh, A., Stern, H.A.: Bridging the gap between long time trajectories
and reaction pathways. In: Prigogine, I., Rice, S.A. (eds.) Advances in Chemical Physics, vol.
126, pp. 93–129. Wiley & Sons Inc, NJ (2003)
Determination of Kinetics and Thermodynamics of Biomolecular … 301

14. Faccioli, P., Sega, M., Pederiva, F., Orland, H.: Dominant pathways in protein folding. Phys.
Rev. Lett. 97(10), 108101 (2006). https://doi.org/10.1103/PhysRevLett.97.108101
15. Cardenas, A.E., Elber, R.: Kinetics of cytochrome C folding: atomically detailed simulations.
Proteins Struct. Funct. Bioinf. 51(2), 245–257 (2003)
16. Cardenas, A.E., Elber, R.: Atomically detailed Simulations of helix formation with the stochas-
tic difference equation. Biophys. J. 85(5), 2919–2939 (2003)
17. Bai, D., Elber, R.: Calculation of point-to-point short-time and rare trajectories with boundary
value formulation. J. Chem. Theory Comput. 2(3), 484–494 (2006)
18. Elber, R., Meller, J., Olender, R.: Stochastic path approach to compute atomically detailed
trajectories: application to the folding of C peptide. J. Phys. Chem. B 103(6), 899–911 (1999)
19. Siva, K., Elber, R.: Ion permeation through the gramicidin channel: atomically detailed model-
ing by the Stochastic Difference Equation. Proteins Struct. Funct. Bioinf. 50(1), 63–80 (2003)
20. Ghosh, A., Elber, R., Scheraga, H.A.: An atomically detailed study of the folding pathways
of protein A with the stochastic difference equation. Proc. Natl. Acad. Sci. U. S. A. 99(16),
10394–10398 (2002)
21. Tuckerman, M., Berne, B.J., Martyna, G.J.: Reversible multiple time scale molecular-dynamics.
J. Chem. Phys. 97(3), 1990–2001 (1992)
22. Morrone, J.A., Zhou, R.H., Berne, B.J.: Molecular dynamics with multiple time scales: how
to avoid pitfalls. J. Chem. Theory Comput. 6(6), 1798–1804 (2010). https://doi.org/10.1021/
ct100054k
23. Shaw, D.E., Maragakis, P., Lindorff-Larsen, K., Piana, S., Dror, R.O., Eastwood, M.P., Bank,
J.A., Jumper, J.M., Salmon, J.K., Shan, Y.B., Wriggers, W.: Atomic-level characterization of
the structural dynamics of proteins. Science 330(6002), 341–346 (2010). https://doi.org/10.
1126/science.1187409
24. Shaw, D.E., Deneroff, M.M., Dror, R.O., Kuskin, J.S., Larson, R.H., Salmon, J.K., Young, C.,
Batson, B., Bowers, K.J., Chao, J.C., Eastwood, M.P., Gagliardo, J., Grossman, J.P., Ho, C.R.,
Ierardi, D.J., Kolossvary, I., Klepeis, J.L., Layman, T., McLeavey, C., Moraes, M.A., Mueller,
R., Priest, E.C., Shan, Y.B., Spengler, J., Theobald, M., Towles, B., Wang, S.C.: Anton, a
special-purpose machine for molecular dynamics simulation. Commun. ACM 51(7), 91–97
(2008). https://doi.org/10.1145/1364782.1364802
25. Valleau, J.: Monte Carlo: changing the rules for fun and profit. In: Berne, B.J., Cicootti, G.,
Coker, D.F. (eds.) Classical and quantum dynamics in condensed phase simulations. World
Scientific, Singapore (1998)
26. Majek, P., Elber, R.: Milestoning without a reaction coordinate. J. Chem. Theory Comput. 6(6),
1805–1817 (2010). https://doi.org/10.1021/ct100114j
27. Vanden-Eijnden, E., Venturoli, M.: Markovian milestoning with Voronoi tessellations. J. Chem.
Phys. 130(19), 194101 (2009). https://doi.org/10.1063/1.3129843
28. West, A.M.A., Elber, R., Shalloway, D.: Extending molecular dynamics time scales with mile-
stoning: Example of complex kinetics in a solvated peptide. J. Chem. Phys. 126(14), 145104
(2007)
29. Kirmizialtin, S., Elber, R.: Revisiting and computing reaction coordinates with directional
milestoning. J. Phys. Chem. A 115(23), 6137–6148 (2011)
30. Elber, R., West, A.: Atomically detailed simulation of the recovery stroke in myosin by Mile-
stoning. Proc. Natl. Acad. Sci. U. S. A. 107, 5001–5005 (2010)
31. Malnasi-Csizmadia, A., Toth, J., Pearson, D.S., Hetenyi, C., Nyitray, L., Geeves, M.A.,
Bagshaw, C.R., Kovacs, M.: Selective perturbation of the myosin recovery stroke by point
mutations at the base of the lever arm affects ATP hydrolysis and phosphate release. J. Biol.
Chem. 282(24), 17658–17664 (2007)
32. Monticelli, L., Sorin, E.J., Tieleman, D.P., Pande, V.S., Colombo, G.: Molecular simulation
of multistate peptide dynamics: a comparison between microsecond timescale sampling and
multiple shorter trajectories. J. Comput. Chem. 29, 1740–1752 (2008)
33. Allen, R.J., Frenkel, D., ten Wolde, P.R.: Forward flux sampling-type schemes for simulating
rare events: Efficiency analysis. J. Chem. Phys. 124(19), 194111 (2006). https://doi.org/10.
1063/1.2198827
302 A. E. Cardenas

34. Allen, R.J., Valeriani, C., ten Wolde, P.R.: Forward flux sampling for rare event simulations.
J. Phys.: Condens. Matter. 21(46), 463102 (2009). https://doi.org/10.1088/0953-8984/21/46/
463102
35. Zhang, B.W., Jasnow, D., Zuckerman, D.M.: The “weighted ensemble” path sampling method
is statistically exact for a broad class of stochastic processes and binning procedures. J. Chem.
Phys. 132(5), 054107 (2010). https://doi.org/10.1063/1.3306345
36. Glowacki, D.R., Paci, E., Shalashilin, D.V.: Boxed molecular dynamics: a simple and general
technique for accelerating rare event kinetics and mapping free energy in large molecular
systems. J. Phys. Chem. B 113(52), 16603–16611 (2009)
37. Van Erp, T.S.: Dynamical rare event simulation techniques for equilibrium and nonequilibrium
systems. In: Nicolis, G., Maes, D. (eds.) Kinetics and Thermodynamics of Multistep Nucleation
and Self-Assembly in Nanoscale Materials: Advances in Chemical Physics, vol. 151. Wiley &
Sons Inc, Hoboken (2012)
38. Prinz, J.-H., Keller, B., Noe, F.: Probing molecular kinetics with Markov models: metastable
states, transition pathways and spectroscopic observables. Phys. Chem. Chem. Phys. 13,
16912–16927 (2011)
39. Pande, V.S., Beauchamp, K., Bowman, G.R.: Everything you wanted to know about Markov
State Models but were afraid to ask. Methods 52, 99–105 (2010)
40. Bolhuis, P.G., Dellago, C.: Trajectory-based rare event simulations. In: Lipkowitz, K.B. (ed.)
Reviews in Computational Chemistry, vol. 27. John Wiley & Sons Inc, Hoboken (2010)
41. Cardenas, A.E., Elber, R.: Enhancing the capacity of molecular dynamics simulations with tra-
jectory fragments. In: Schlick, T. (ed.) Innovations in Biomolecular Modeling and Simulations,
vol. 1. RSC Biomolecular Sciences. The Royal Society of Chemistry, Cambridge (2012)
42. Elber, R.: A milestoning study of the kinetics of an allosteric transition: atomically detailed
simulations of deoxy Scapharca hemoglobin. Biophys. J. 92(9), L85–L87 (2007)
43. Kuczera, K., Jas, G.S., Elber, R.: Kinetics of helix unfolding: molecular dynamics simula-
tions with milestoning. J. Phys. Chem. A 113(26), 7461–7473 (2009). https://doi.org/10.1021/
jp900407w
44. Shalloway, D., Faradjian, A.K.: Efficient computation of the first passage time distribution
of the generalized master equation by steady-state relaxation. J. Chem. Phys. 124(5), 054112
(2006)
45. Noe, F., Schutte, C., Vanden-Eijnden, E., Reich, L., Weikl, T.R.: Constructing the equilibrium
ensemble of folding pathways from short off-equilibrium simulations. Proc. Natl. Acad. Sci.
U. S. A. 106(45), 19011–19016 (2009). https://doi.org/10.1073/pnas.0905466106
46. Swope, W.C., Pitera, J.W.: Describing protein folding kinetics by molecular dynamics simu-
lations. 1. Theory. J. Phys. Chem. B 108(21), 6571–6581 (2004)
47. Chodera, J.D., Singhal, N., Pande, V.S., Dill, K.A., Swope, W.C.: Automatic discovery of
metastable states for the construction of Markov models of macromolecular conformational
dynamics. J. Chem. Phys. 126(15), 155101 (2007)
48. Prinz, J.-H., Wu, H., Sarich, M., Keller, B., Senne, M., Held, M., Chodera, J.D., Schutte,
C., Noe, F.: Markov models of molecular kinetics: generation and validation. J. Chem. Phys.
134(17), 174105 (2011)
49. Noe, F., Horenko, I., Schutte, C., Smith, J.C.: Hierarchical analysis of conformational dynamics
in biomolecules: transition networks of metastable states. J. Chem. Phys. 126(15), 155102
(2007)
50. Buch, I., Giorgino, T., De Fabritiis, G.: Complete reconstruction of an enzyme-inhibitor bind-
ing process by molecular dynamics simulations. Proc. Natl. Acad. Sci. U. S. A. 108(25),
10184–10189 (2011)
51. Voelz, V.A., Bowman, G.R., Beauchamp, K., Pande, V.S.: Molecular simulation of ab initio
protein folding for a millisecond folder NTL9(1-39). J. Am. Chem. Soc. 132(5), 1526–1528
(2010)
52. Scalco, R., Caflisch, A.: Equilibrium distribution from distributed computing (Simulations of
protein Folding). J. Phys. Chem. B 115(19), 6358–6365 (2011)
Determination of Kinetics and Thermodynamics of Biomolecular … 303

53. Singhal, N., Pande, V.S.: Error analysis and efficient sampling in Markovian state models for
molecular dynamics. J. Chem. Phys. 123(20), 204909 (2005)
54. Schutte, C., Noe, F., Lu, J.F., Sarich, M., Vanden-Eijnden, E.: Markov state models based on
milestoning. J. Chem. Phys. 134(20), 204105 (2011). https://doi.org/10.1063/1.3590108
55. Cardenas, A.E., Jas, G.S., DeLeon, K.Y., Hegefeld, W.A., Kuczera, K., Elber, R.: Unassisted
transport of N-Acetyl-L-tryptophanamide through membrane: experiment and simulation of
kinetics. J. Phys. Chem. B 116, 2739–2750 (2012)
56. Lane, T.J., Bowman, G.R., Beauchamp, K., Voelz, V.A., Pande, V.S.: Markov State Model
reveals folding and functional dynamics in ultra-long MD trajectories. J. Am. Chem. Soc. 133,
18413–18419 (2011)
57. Berezhkovskii, A., Hummer, G., Szabo, A.: Reactive flux and folding pathways in network
models of coarse-grained protein dynamics. J. Chem. Phys. 130(20), 205102 (2009). https://
doi.org/10.1063/1.3139063
58. Metzner, P., Schutte, C., Vanden Eijnden, E.: Transition path theory for Markov jump processes.
Multiscale Model. Simul. 7, 1192–1219 (2009)
59. Bowman, G.R., Beauchamp, K., Boxer, G., Pande, V.S.: Progress and challenges in the auto-
mated construction of Markov state models for full protein systems. J. Chem. Phys. 131(12),
124101 (2009)
Part III
Molecular Simulations: Applications
Mechanostability of Virus Capsids
and Their Proteins in Structure-Based
Coarse-Grained Models

Marek Cieplak

Abstract We outline a simple coarse-grained molecular dynamics model of proteins


which is based on the knowledge of their native structures. We apply the model
to study properties of selected proteins that are found in virus capsids, such as in
CCMV and its mutant. We characterize their folding kinetics and force-displacement
curves obtained during stretching. The stretching curves are shown to be sensitive
to the mutations. We make a short review of possible mechanical clamps (motifs
that are most resistant to stretching) that have been found in large scale surveys of
mechanostability with the use of the model. We then discuss stretching of multimeric
complexes of such proteins and demonstrate existence of strong dependence of the
force-displacement curves on selection of a pair of termini involved in stretching.
Finally, we consider nanoindentation processes in several virus capsids. We show
that values of characteristic forces at which the capsids collapse are not correlated
with mechanostabilities of the constituting proteins. We also show that the response
to nanoindentation recognizes existence of single point mutations in the proteins but
not in the initial stages of the process.

1 Introduction

Recent advances in nanotechnology have provided new experimental tools to study


biological processes at the molecular level [1]. Instead of monitoring biochemi-
cal reactions involving macroscopic numbers of molecules one can now observe
behavior of individual molecules by techniques of single molecule optical and force
spectroscopies. The optical spectroscopy has been used primarily for identification
of stages in protein folding [2–5]. On the other hand, the force spectroscopy has been
usually applied to establish a degree of mechanical stability through stretching either
at constant speed or at constant force to induce unfolding [6]. However, monitoring

M. Cieplak (B)
Institute of Physics, Polish Academy of Sciences,
Aleja Lotników 32/46, 02-668 Warsaw, Poland

© Springer Nature Switzerland AG 2019 307


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_10
308 M. Cieplak

of the subsequent refolding events in a mechanically controlled environment has also


been accomplished [7, 8].
In this chapter, we focus on mechanical stability of proteins and their complexes,
including virus capsids. Understanding mechanical stability of proteins is impor-
tant for many biological processes which involve force-induced unfolding such as
muscle extension, cell-cell adhesion, protein translocation, sensing, etc. [9–12]. It
also appears to be relevant for identification of proteins that cause neurodegenerative
diseases [13]. There are two primary instruments for implementation of mechani-
cal manipulation of proteins (and also of nucleic acids, polysaccharides, and other
biomolecules): atomic force microscope (AFM) and optical tweezers. The pioneer-
ing AFM-based work by Rief et al. [14] involved constant speed stretching of titin,
the giant sarcomeric protein of striated muscle. This protein consists of many globu-
lar domains and the unfolding process generates a reversible sawtoothlike force (F)
versus extension (d) pattern. The individual teeth correspond to unraveling of one
globule. For other proteins, unraveling of one unit may lead to more force peaks. The
tertiary structure of one of these wild type titin globules, denoted as I27, is known
[15] and chains of identical I27 globules can be formed through genetic engineering.
Studies of such chains have yielded a characteristic force peak size, denoted here as
Fmax , to be 210 ± 27 pN for the pulling speeds 0.3–0.5 nm/ms [16]. They provide a
benchmark for subsequent studies of mechanostability of proteins.
The experimentally derived F − d patterns for proteins require interpretation and
theoretical modeling provides help in this respect. In particular, all-atom modeling
[17, 18] has identified simultaneous shearing of six hydrogen bonds between the near
terminal β-strands as being responsible for the substantial resistance to stretching
in titin. In contrast, in an unzipping process, the resistance would be much smaller
because just one bond is broken at a time. For instance, separation of strands in the
double stranded DNA through unzipping yields a resistance only of order 14 pN [19].
All-atom simulations are challenging when applied to large conformational changes
that one encounters during protein stretching as realistic simulations are usually
restricted to the physical time scales not exceeding 100 ns or so. This circumstance
requires analyzing stretching at pulling rates which are 6–7 orders of magnitude
higher than in experiments. Perhaps more importantly, even then they can be used
to study only a handful of systems and trajectories. They thus provide only a limited
guidance for selection of objectives of new experimental studies. For this reasons,
we have developed molecular dynamics coarse-grained models which are based on
the knowledge of the native structure and involve an implicit solvent [20–24] and
applied them first to folding and then, primarily, to stretching.
The (native) structure-based models are empirical in nature. Other than their ease
of use, their advantage is that they relate to the native conformation of a specific
protein directly. It should be noted that the force fields of all-atom models are con-
structed to reproduce the measured energy levels of the alanine and glycine dipeptide
and conformational preferences for alanine polypeptides in water [25]. They are thus
not based on the native structure of a protein. Coarse-grained structure-based mod-
els miss many details of description that are present in all-atom models. It is likely,
Mechanostability of Virus Capsids and Their Proteins … 309

however, that these details are not relevant at the orders-of-magnitude longer
timescales that are probed in the simplified models.
In Sect. 2, we describe the version of the model used here. In Sect. 3 the results
of surveys of mechanostability of thousands of proteins are outlined and types of
mechanical clamps are identified. In the following sections we shall discuss folding
and stretching taking place in proteins that are found in virus capsids and manip-
ulation of complexes involving such proteins. We shall focus here on the role of
single-point mutations.
When discussing specific examples, we shall focus on the cowpea chlorotic mottle
virus (CCMV) as it has been studied experimentally the most. Its capsid is composed
of 60 complexes, known as capsomers, which comprise three sequentially identical
proteins, or chains, known also as subunits. The chains will be denoted as 1CWP:A,
1CWP:B, and 1CWP:C, where 1CWP is the Protein Data Bank (PDB) code of the
complex structure. This complex is shown in Fig. 1. Even though the complex is,
generally, a part of the CCMV capsid, it is likely to exist as a physical entity during the
self-assembly stage of the virus. Here, however, it will serve a didactical purpose as
we shall discuss how to analyse mechanostability of multimeric systems where there
is a much greater variety of possible ways of stretching compared to monomeric
systems. In the last section, we shall discuss several spherical virus capsids and
demonstrate existence of a large variety of their responses to nanoindentation.
This chapter presents a review of the merits of using the coarse-grained structure-
based model and it also shows two new results: (1) Stretching and nanoindentation

Fig. 1 The three chains of capsomer 1CWP that forms the structural unit of the CCMV capsid.
The three pairs of termini are indicated
310 M. Cieplak

are sensitive to single point mutations in the sequences, (2) The strength of the elastic
response of a capsid to indentation is not related to mechanostability of its constitutive
proteins as assessed through stretching.

2 The Coarse-Grained Go-Like Model

There are many possible variants of structure-based models as there are many ways
to realize Go’s idea [26] to describe conformational changes of a protein in terms of
its native interactions. The first implementations have been set on a lattice [27–29].
However, dynamics are better defined in a continuum space where Newton’s equation
apply and forces derive from the potentials. We have considered 62 specific molecular
dynamics realizations [30, 31], some of them proposed previously by other authors
[32, 33], and compared them to the experimental data on stretching. We have also
checked their folding properties. Only some of the realizations led to good folding
and were consistent with the stretching data. We have identified four optimal choices.
One of them is the simplest version that does not distinguish between the chemical
identities of the amino acids. The results discussed in this Chapter have been obtained
using this very model.
In this simplest model one assigns the same depth, ε, to the potential wells asso-
ciated with a pair of amino acids that form a native contact. (Relevant attractive
non-native contacts can also be built in, if information about them is available [34,
35]). The contact interactions effectively correspond to hydrogen bonds and ionic
bridges. The disulfide bridges between cysteines are covalent in nature and are rep-
resented by the harmonic potentials. The contact map is determined by checking for
atomic overlaps [22, 36, 37] and if it exists the two amino acids become represented
merely by their Cα atoms (adding the Cβ atoms does lead to improvement in the
workings of the model [24, 30]) that form a potential well. Otherwise there is a soft
core repulsion between the Cα atoms. Alternative schemes to derive contact maps
are discussed in Ref. [38]. The specific choice of the well potential has turned out
to be of a minor importance compared to the proper definition of the contact map
and we usually work with the Lennard-Jones potential. The length parameter in this
potential is chosen so that the location of its minimum agrees with the experimen-
tally determined distance between the Cα atoms – in water. Another way in which
the solvent enters the description is through the velocity dependent (over) damping
term and Langevin noise which is controlled by the temperature, T . Still another is
through the characteristic time scale, τ , which is of order 1 ns instead of 1 ps usually
characterizing all-atom models [39, 40]. The reason is that the motion of the model
Cα atoms in the implicit solvent is diffusive instead of ballistic.
The model has good predictive properties. For instance, our simulations [24] have
predicted large mechanostability of two cellulosome-related cohesin proteins c7A
(the PDB structure code 1aoh) and c1C (the structure code 1g1k) that got confirmed
experimentally [41]. In the case of c7A, the calculated value of Fmax is 470 pN and
measured – 480 pN. Comparisons of this sort are based on calibration of the energy
Mechanostability of Virus Capsids and Their Proteins … 311

parameter ε. Our latest estimate [42] is that ε = 110 pN Å ± 30 pN Å. This result


is based on calculating Fmax as a function of the pulling speed and on extrapolating
it to the speeds at which the actual measurements were made. The calibration has
been obtained by considering 38 proteins. The estimated value of ε means that the
room temperature is close to 0.35 ε/k B , where k B denotes the Boltzmann constant.
In practice, most of our calculations have been performed at 0.3 ε/k B as this is the
temperature corresponding to fastest folding in most cases. Furthermore, stretching
at 0.35 ε/k B is almost the same as at 0.3 ε/k B .
It should be noted that the backbone stiffness contributes to the potential energy
of the system and thus affects what is meant by the room temperature – it should be
located in the temperature region corresponding to optimal folding [43]. The results
discussed here are based on the backbone stiffness being represented by harmonic
terms that favor establishment of the native values of the local chirality [24].

3 Surveys of Mechanostability of Proteins and Types of


Mechanical Clamps

We have made several surveys of mechanostability of single proteinic chains. The


last two of them have addressed 17,134 proteins with up to 250 residues [42] and
318 proteins with up to 1021 residues [44]. Almost all proteins in the first set are
single-domained whereas all proteins in the second set are multi-domained. The
pulling process has been implemented by attaching springs to the termini. One of
the springs is anchored and another is moving at a constant speed. The results have
been deposited in the BSDB database described in Ref. [45] and are available at
info.ifpan.edu.pl/BSDB/. The later surveys differ from the very first one [24] (for
7510 proteins with up to 150 residues) in the definition of the contact map used. In
our later surveys we eliminate native contacts between residues separated just by one
residue from the contact map. Such contacts usually correspond to weak dispersive
interactions. We have kept such contacts in Ref. [24] and hence somewhat different
values of Fmax for proteins considered again in Ref. [42]. The main purpose of making
the surveys is to: (a) rank order model proteins according to the their value of Fmax ,
(b) find proteins which are particularly stable mechanically, (c) identify mechanical
clamps which are structural regions responsible for the emergence of the force peaks.
In most cases, the mechanical clamps arise due to shearing between various ele-
ments of tertiary structure: parallel β-strands, antiparallel β-strands, two α-helices,
unstructured loops, etc. Typically, shearing between two parallel β-strands provides
substantially bigger resistance to stretching than in the other cases [24]. The longer
the strands, the larger the shearing effect. Furthermore, the stability can be enhanced
by other strands in the immediate vicinity of the primary mechanical clamp as in the
case of the streptokinase β-domain with the PDB code 1C4P [24, 42] for which we
have predicted Fmax of 5.1 ε/Å or about 560 pN.
312 M. Cieplak

Despite the prevalence of the shearing effects in producing resistance to pulling,


other types of mechanical clamps can also be encountered. One of them is the tensile
clamp observed for chain X in a model γ D-crystalline [44]. The F − d pattern for
this two-domained system is endowed with multiple force peaks as shown in Fig. 2.
The first of these has a height of about 1.4 ε/Å and it corresponds to unraveling of
the tensile mechanical clamp shown in the left and panel of Fig. 3. This is the stage
at which contacts between the two domains get ruptured which makes the domains
swing apart because the cohesion within the domains is stronger than between the
domains. At a later stage, around d = 425 Å, there is another peak of almost the
same height. It is due to shearing between three antiparallel β-strands. This shearing
mechanical clamp is shown in the right panel of Fig. 3. All of the remaining force
peaks are due to shear in other parts of the structure. The mechanical stability of 1.4
ε/Å, or about 150 pN is quite typical across thousands of proteins [42], whereas the
crystallins’ resistance to ultraviolet photodamage is exceptional [46]. The crystallins
are also endowed with good thermodynamic stability [47].

Fig. 2 The F − d pattern for γ D-crystallin. Displacements at which two distinct mechanical
clamps are operational are indicated. The two lines correspond to two different trajectories

Fig. 3 Two kinds of mechanical clamps in γ D-crystallin. The left panel shows the tensile clamp
and the right panel—the shearing clamp
Mechanostability of Virus Capsids and Their Proteins … 313

A very different kind of a mechanical clamp has been discovered in the survey of
2008 [42]. It is topological in nature and we have dubbed it the cystine slipknot (see
Fig. 4). It can arise in proteins containing the cystine knot motif [48–51] in the native
state. The motif involves three disulfide bonds. Two of them connect two segments
of the backbone in a way that forms an effective ring made of, typically, eight amino
acids. The third bond connects other segments of the backbone across the ring. On
pulling, this third bond may drag one of these segments across the ring and thus form
a slipknot conformation. The related movement generates an isolated force peak with
high values of Fmax – in the range of 1000 pN. In fact, the 13 top strength in the set
of 17,134 are those which are endowed with the cystine slipknot mechanism. The
workings of this mechanism have been elucidated in all-atom simulations [52] but
experimental verification is still missing.
The 2008 survey [42] has been applied to single chains. If several chains are
listed under the same PDB code, the first one was taken into the considerations.
Thus if a structure code corresponds to a proteinic complex then the value of Fmax
applies to one of its components. In most cases, this yields a reasonable estimate of
mechanostability of the whole complex. Many cystine knot proteins, however, form

Fig. 4 Dimeric 1FZV


protein with two cystine
knots. It is shown at a
stretching stage when a
slipknot mechanical clamp is
formed—the slipknot is
dragged through the lower
ring
314 M. Cieplak

dimers which are linked covalently and analyzing their mechanostability requires
more care. For instance, in the case of the placenta growth factor-1 with the struc-
ture coded as 1FZV the monomers are linked by two disulfide bonds. Each of the
monomers contains the cystine knot motif and between zero and two force peaks
related to the formation of the cystine slipknot may arise on stretching, depending
on which termini are used in the process [53, 54]. If the termini in one monomer are
denoted as N and C, and in the other as N and C then one can implement distinct
stretching patterns using pairs N-N , N-C, N-C , and C-C . For instance, if the N and
C termini are involved in stretching, only one slipknot forms, as illustrated in Fig.
4. Once the slipknot arises in the ring which is shown in the lower part of the figure,
the disulfide bond intersecting the upper ring gets aligned in a way that blocks the
second dragging movement in the upper ring [54]. The values of Fmax in the dimeric
cases may be either smaller or bigger compared to Fmax obtained for the single chain,
depending on the protein and the way of pulling. However, whenever a force peak
arises, it comes with a high value of Fmax .
Recently, we have discovered a related version of the cystine slipknot mechanical
clamp: the cystine plug [53]. We have found it only in one protein (human trans-
forming growth factor – β2 with the PDB code of 1TFG). It involves dragging of an
N-terminal ring of 10 residues through the ring of the cystine knot. The NB-terminal
ring is closed by still another disulfide bond. The corresponding Fmax could be of
order 1500 pN.

4 Folding Properties of Selected Proteins

We now turn to the discussion of proteins involved in formation of virus capsids.


The three chains of 1CWP that form a capsomer of CCMV are identical sequen-
tially. However, there are some differences in their structures in the complex. For an
illustration of the folding behavior, we consider 1CWP:A. The dependence of the
folding time, t f old , on T is shown in Fig. 5. The folding time has been determined
by considering four-five batches of 101 trajectories each. The trajectories start from
an extended state and differ by the thermal noise applied. For each batch, the median
time needed to establish all native contacts for the first time (within a criterion based
on the distance within a native contact) is determined. The data points shown in Fig.
5 correspond to the average over the batches. The fastest folding is seen to take place
between 0.3 and 0.325 ε/k B . This fact indicates that the model shows proper kinetic
behavior at the temperatures which are somewhat smaller than the 0.35 – 0.375 ε/k B
corresponding to the room temperature with the callibration of ε = 110 pN Å. For
1CWP:A a compromise choice of 0.325 ε/Å may define an effective “room temper-
ature” value of T but when making stretching surveys we stayed with the fixed 0.3
ε/k B . It should be noted that the range of optimal folding for 1CWP:A is issen to
be much narrower than for other proteins that have been studied within the same or
closely related models [21, 22, 55]. For instance, for the I27 domain of titin (code
1TIT) folding is optimal and T -independent between 0.175 and 0.5 ε/k B [21]. At this
Mechanostability of Virus Capsids and Their Proteins … 315

Fig. 5 Folding time in


1CWP:A as a function of
temperature

moment, it is not clear whether the folding behavior of 1CWP:A also characterises
other capsidic proteins.
Figure 6 represents what we refer to as a scenario diagram. It shows time order
in which various native contacts are established for the first time on the average as
determined at T = 0.3 ε/k B . Note that the folding time is defined through all native
contacts being established simultaneously so the scenario diagrams are focused on
the early stages of collapse to the globular form. The contacts are labeled by their
sequential distance | j − i|. This labeling system does not identify a contact uniquely,
as several contacts may be between pairs of sites separated by the same sequential
distance. However, it indicates the role of this distance in the folding process. There
is a fairly monotonic average dependence, meaning that residues which are close
by sequentially tend to be established earlier than those which are sequentially far
apart. This tendency has been encapsulated by introduction of the relative contact
order parameter, CO, [56, 57] which is argued to correlate well with the experimental
folding times. However, we observe many deviations from the average dependence in
our model and, in particular, the longest ranged contact (between sites 49 and 179) is
first established around 3500 τ whereas the last contact (between sites 56 and 172) is
first established around 4800 τ . In other words, closing the formation of the globular
structure need not involve regions which are most distant sequentially, even though
the initial stages are dominated by formation of the short range contacts. There are
many examples of such deviations in our simulations and some of them are discussed
fully in Ref. [34]. Even though our model is based on the native geometry, we do not
observe t f old to depend on the geometrically conceived parameter CO [22].
316 M. Cieplak

Fig. 6 The scenario of


folding events for 1CWP:A,
as described in the text. The
pairs of numbers indicate
several long-ranged contacts

5 Stretching of Single Capsomeric Proteins

We now consider stretching of chains A and B (chain C behaves very much like
chain B). The F − d plots at the pulling speed of 0.005 Å/τ are shown in the top
panels of Figs. 7 and 8. The corresponding scenario diagram for unfolding for chain
A at ε/k B = 0.3 is shown in Fig. 9. The diagram indicates pulling distances (which
are proportional to the duration of pulling) at which a given contact breaks down
for good (initially the distance between the residues involved may be crossing a
cutoff distance multiply due to thermal fluctuations). The unfolding diagram has
some reverse properties relative to the folding diagram in the sense that long ranged
contacts tend to be unraveled in the initial stages and short ranged contacts – in
the later stages of the process. However, there is an important difference: there is
a significantly more pronounced discretization as a function of time (or pulling
distance) as various groups of contacts get ruptured around common values of d.
These aggregations of rupture events correspond to emergence of force peaks.
Some of the rupture events are significant dynamically and some are just necessary
byproducts of the significant events. Therefore, a given group of contacts that are
torn around a value of d involves contacts of various sequential contact ranges.
The significant rupture events define the corresponding mechanical clamps. One
can test the level of significance of a group of contact by removing them from the
contact map and by checking the effect of this action on the height of the force
peak [21, 24]: a substantial decrease indicates a major contribution of these contacts
to mechanostability. The groups of such important contacts are indicated in Fig. 9.
Mechanostability of Virus Capsids and Their Proteins … 317

Fig. 7 Top panel: stretching


curves for chain A in the
CCMV capsomer. Bottom
panel: stretching curces for
chain A in the CCMV
mutant. In each panel, the
two solid lines (one thicker,
one thinner) correspond to T
= 0.3 ε/k B . The dotted line
corresponds to T = 0

Fig. 8 Similar to Fig. 7 but


for chain B
318 M. Cieplak

Fig. 9 Unfolding scenario


averaged over the two solid
trajectories shown in Fig. 7.
The darker symbols are for
the CCMV protein whereas
the lighter ones for its mutant

The first force peak is due to shearing between two antiparallel strands A (residues
50–60) and G (residues 166–178). The second force peak is due to shear between
antiparallel strands B (67–70) and F (154–160) as well between C (88–99) and G.
The third force peaks is between antiparallel strands C and E (136–139) as well as
between antiparallel strands D (105–111) and F. The final smaller peak is due to
shear between antiparallel strands D and E. The second force peak is the largest
and the corresponding Fmax is equal to 1.75 ± 0.1 ε/Å. A similar F − d pattern is
observed for chains B and C with Fmax = 1.6 ± 0.1 ε/Å. The values of Fmax are listed
in Table 1.
It is interesting to consider what is the sensitivity of the F − d patterns to single
point mutations. A structure coded 1ZA7 corresponds to a K42R mutation (at site
42 lysin is replaced by arginine, both positively charged) on 1CWP. In chain A, the
mutation is implemented on the first residue (as counted from the N terminus) for
which the structure is available. The known structure for chain B starts at residue
27. The bottom panels in Figs. 7 and 8 show the F − d patterns for chains A and
B in 1ZA7 respectively. The patterns for the mutant chains look similar to those for
the wilde type chains. However, the force peaks are taller. The values of Fmax are
2.0 ± 0.1 ε/Å and 1.9 ± 0.1 ε/Å for chains A and B respectively – a shift of about
0.3 ε/Å compared to the wilde type chains. The differences grow bigger on decreasing
the temperature. In particular, we show the F − d curves at T = 0, i.e. when all
thermal fluctuations are ignored. The curves are different not only in terms of the
peak heights but also in terms of the details in the patterns. We have observed a similar
sensitivity to mutations for the T4 lysozymes [58]. The wild type of the lysozyme
Mechanostability of Virus Capsids and Their Proteins … 319

Table 1 Characteristics of selected T1, T3, and T3p virus capsids that are discussed in this chapter.
The first column shows the acronym used, the second—the PDB structure code, the third—the
common name together with the symmetry type, the forth—the number of Cα atoms describing
the model capsid, R̄ denotes the average radius of the capsid, The next three columns give results
obtained through the molecular dynamics simulations at k B T = 0.3ε. k is the spring constant and
Fc – characteristic force associated with the capsid. The last column gives the values of Fmax
obtained for individual chains in the corresponding capsomer
Acronym PDB Name and symmetry N R̄ [Å] k [ε/Å 2 ] Fc [ε/Å] Fmax,i
[ε/Å]
MVM 1MVM Parvovirus minute virus 32,940 110.54 0.217 8.7 2.2
of mice T1
FPV 1C8E Feline panleukopenia 32,040 109.69 0.280 13 2.7
virus T1
SPMV 1STM Satellite panicum mosaic 8460 69.55 0.174 11 –
virus T1
CCMV 1CWP Cowpeak chlorotic mottle 28,620 119.56 0.050 5.5 1.75, 1.6
virus T3 1.6
1ZA7 1ZA7 K42R mutant of CCMV 28,860 118.41 0.050 6.7 2.0, 1.9,
T3 1.9
NV 1IHM Norwalk virus T3 89,700 159.62 0.190 12 1.9, 1.8,
1.6
CPMV 1NY7 Cowpea mosaic virus 33,480 124.29 0.500 15 –
T3p
HRV 1AYN Human rhinovirus 16 48,240 131.60 0.443 32 1.5, 2.1,
T3p 1.6

has the structure denoted by 102L and the mutant – 1B6I. In the mutant, threonine
and lysine in locations 21 nad 124 are both replaced by cysteins. The experimental
studies on stretching of this mutant are described in Ref. [59]. The sensitivity of
the F − d patterns to mutations decreases with a growing T as thermal fluctuations
become increasingly important compared to the terms in the potentials.

6 Stretching of Proteinic Complexes

We now consider the three-protein capsomeric complex shown in Fig. 1. The com-
plex is connected through interchain contacts. Even though the complex also forms
contacts with neighboring capsomers in the CCMV capsid, it is instructive to con-
sider stretching by various combinations of of pairs of the six termini. The termini
will be denoted by N and C for the first chain, N and C for the second chain, and
N and C for the third chain. The F − d curves corresponding to the various ways
of pulling are shown in Fig. 10 and the values of Fmax are summarized in Fig. 11.
320 M. Cieplak

Fig. 10 Stretching curves


for the CCMV capsomer for
various ways of pulling as
indicated. In the lower three
panels, one line is selected as
being representative and the
symbols for three remaining
lines are listed away from the
lines for a better visibility

Fig. 11 The values of Fmax ,


in units of ε/Å, for the
CCMV capsomer as derived
for various choices of the
termini that are involved in
stretching. The diagonal
entries are highlighted—in
this case to termini belong to
one chain in the capsomer
Mechanostability of Virus Capsids and Their Proteins … 321

The modes of pulling can be divided into “diagonal” and “off-diagonal”. The
former refer to a situation in which pulling is implemented by attaching to the termini
of a single chain. The latter - in which the termini belong to different chains. The
diagonal F − d curves look qualitatively similar to those of the isolated chains.
However, the force peaks are higher due to additional stabilization provided by other
chains in the complex. For chains A and B the increase in Fmax is just by 0.1 ε/Å,
but for chain C – by 0.4 ε/Å so that Fmax is equal to 2 ε/Å. The off-diagonal stability
is weaker: the corresponding values of Fmax vary between 0.45 and 1 ε/Å.
For other complexes, the off-diagonal values of Fmax may be larger than the
diagonal ones. This happens, for instance, in some dimers containing the cystine
knots [54] and in the 3D domain-swapped amyloide-prone cystatin C [60]. We have
predicted [44] that this dimer should be able to withstand mechanical stress of about
7 ε/Å or 770 pN if stretched using termini N and N compared to 4.4 ε/Å when
using termini N and C. These values are listed in Fig. 12. This system would thus
provide one of the strongest known shear-based mechanical clamp. The reason for
this behavior is that the two cystatine chains are intertwined in a way in which two
long β-strands of one chain are parallel to two long β-strand of another chain. These
arrangement generates many inter-chain contacts which require a big force to be
sheared if pulled by N and N . For the N-C pulling, shearing involves a smaller
number of contacts between intrachain strands.
We have found [53] a behavior similar to that of the cystatine in a bacterial dimeric
protein with the PDB code of 2B1Y. When pulled along the C-C direction, Fmax is
close to 9 ε/Å, but along N-C, merely 1.5 ε/Å. This protein would then exhibit an
even stronger mechanostability than cystatin provided stretching is performed along
the C-C direction.

Fig. 12 Similar to Fig. 11


but for the dimeric 3D
domained swapped γ
cystatine C. The individual
chains are identical and
hence the larger symmetry of
the matrix compared to that
found in Fig. 11
322 M. Cieplak

7 Nanoindentation of Virus Capsids

Virus capsids are proteinic shells that protect strands, often quite short, of nucleic
acids. The volumes of these shells can be estimated by a novel algorithm presented in
Ref. [61]. A class of capsids are quasispherical and have icosahedral symmetry. Their
structures have been explained in terms of the Caspar and Klug sphere triangulation
theory [62]. Symmetries of possible structures are enumerated by the triangulation
number Tk (the subscript in the symbol is meant to distinguish this number from
the symbol used for temperature). In simple cases, Tk coincides with the number,
n, of chains in a capsomer where n = 1,2,3, etc. If this happens then the number of
proteins in the whole capsid is equal to 60Tk . If Tk is 1, then the 60 proteins form
12 pentameric units. If Tk is larger than 1 then the 12 pentamers are embedded in a
matrix of 10 (Tk -1) hexamers. The short hand notation for such capsids here is T1,
T2, T3, etc. Some capsids are called Tk -pseudo capsids when the number of chains
in a capsomer is larger than Tk but the additional chains act as physical extensions of
the nominal number of chains or if the chains are not identical sequentially. CPMV
(cowpea mosiac virus) is an example of a T3p capsid in which a protein is shared by
two capsomers.
The mechanostability of capsids has been studied through nanoindentation [63].
The method has been applied to less than 10 capsids, including CCMV [64, 65]
and MVM (parvovirus minute virus of mice) [66, 67]. The latter is a T1 capsid. We
have applied the coarse grained model described here to 35 empty capsids [68, 69]
for which the full native structure is known and deposited in the VIPERdb database
[70]. The nanoindentation has been implemented by placing a capsid between two
flat repulsive planes and by reducing their separation, s, at a constant rate of 0.005
Å/τ which is equal to the pulling speed used in our theoretical stretching studies.
(Introducing curvature to the squeezing objects, such as the tip of the AFM, yields
similar results [69].) Fig. 13 shows two trajectories corresponding to T = 0.3 ε/k B
for CCMV and two trajectories for its mutant 1ZA7. Both structures have the same
initial elasticity as defined by the slope of the F(s) curve at the largest values of s.
However, their yield point forces, Fc , at which the F(s) curves dip down are distinct:
they differ by 1.7 ε/Å as summarised in Table 1. At the yield point, the quasispherical
structure collapses into a pancake-like object. The collapse is irreversible within short
time scales and retraction of the planes does not retrace the curve [68]. A schematic
representation of a squeezed conformation of CCMV just past the yield point is shown
in Fig. 14 where it is compared to a similar representation of the native state. The
squeezing process is seen to affect primarily the regions near the indenting planes,
as discussed further in Ref. [68]. The retracing on retraction does take place in the
initial elastic regime. The retracing is approximate due to the presence of thermal
fluctuations. The mutation is seen to affect only the later stages of nanoindentation,
but its effect should be observable experimentally.
The behavior of the F(s) curve is consistent with the experimental value of Fc
and the effective spring constant is smaller by the factor of 3 [68] because of an
“emptier” representation of the structure – a residue is represented just by its Cα
Mechanostability of Virus Capsids and Their Proteins … 323

Fig. 13 Force of resistance to nanoindentation of the CCMV capsid as a function of separation


between the squeezing planes. The two solid lines show two trajectories corresponding to the wild
type capsid and the dashed lines to its mutant

Fig. 14 A coarse grained representation of the CCMV capsid in the native state (left panel) and
when the separation between the squeezing planes is equal to 164 Å. The planes are not shown but
they are placed one above and another below the capsid. The figure shows two panels taken from
Fig. 11 in Ref. [68] which also shows four additional stages in the indentation process

atom. It is also consistent with the continuum shell-like model [71, 72]. However,
the strain field in the molecular model is different [68]. In particular, the molecular
model predicts no bulging out of the capsid at the “equator”, i.e. half-way between
the squeezing planes.
324 M. Cieplak

Our simulations of 35 capsids of various symmetries [69] and comprising up to


135 780 residues have yielded a variety of behaviors: qualitatively different F(s)
plots (for instance, with multiple humps), spring constants, varying across a factor of
20 and a broad range of the values of Fc . CCMV is among the weakest of the capsids
studied: Fc ∼ 5.5 ε/Å (at the selected rate of squeezing) whereas HRV (human
rhinovirus) is among the strongest: Fc ∼ 32 ε/Å i.e. of order 4 nN. HRV is a T3p
virus. Two squeezing trajectories for HRV are shown in Fig. 15. The figure also
shows examples of trajectories for CPMV (Fc ∼ 15 ε/Å) and MVM (Fc ∼ 8.7 ε/Å).
The question we ask now is whether the values of Fc relate to the values of Fmax
derived for the single chains. The data collected in Table 1 suggest that they do
not. For instance, the single chain results for HRV come with the smallest values
of Fmax whereas Fc is found to be the biggest. The largest value of Fmax in Table 1
is for FPV (with capsomers made of one chain) but its Fc is median. We find also
no correlations between Fc and the effective spring constants of the capsids. One
might expect that resistance of capsids to squeezing should grow with the growing
mechanostability of its building blocks – proteins or capsomers – but this is not what
we observe. The reason is that nanoindentation appears to be probing different elastic
modes within proteins than those involved in stretching. Elucidating the exact nature
of these differences remains to be done.

Fig. 15 The F(s) curves for


three virus capsids as
indicated. One trajectory is
shown for CPMV and two
trajectories are shown both
for HRV and MVM
Mechanostability of Virus Capsids and Their Proteins … 325

8 Self-assembly of Model Proteins into Virus Capsids

We now consider how proteins combine into virus capsids. This problem, so far, has
been studied by using models involving some rigid objects, typically full capsomers,
with some creatively invented directional couplings that could bind them [73–80].
None of these models considers the capsid as being made explicitly of proteins –
proteins that keep changing their shape and are endowed with intra- and inter-protein
interactions. Currently, only the all-atom models take the protein perspective into
account, but they have never been used in the context of aggregation. The structure-
based model of proteins we have described here is probably the simplest system that
allows for studies of the capsid disassembly and reassembly at the molecular level
and by the methods of molecular dynamics instead of Monte Carlo usually associated
with the rigid objects.
We have initiated this program of research for single capsids of SPMV and CCMV
[81]. We have considered two cases: the empty capsids and with the molecules of
RNA inside. In our approach, a capsid is dissociated by an application of a high
temperature for a variable period and then encouraged to reassembly by restoring
the room temperature. The reassembly of the capsid proceeds to various extent,
depending on the nature of the dissociated state, but is rarely complete because there
is misfolding and, in addition, some proteins depart too far unless the process takes
place in a confined space.
Figure 16 illustrates the reassembly process in an open space for two starting
denatured states of the empty CCMV. A fuller discussion of the process, for various
starting conformations, can be found in Ref. [81]. Further studies should allow for
a number of capsids (not just one). In addition, the space should be constrained so
that one is able to observe more completely assembled structures.
In this chapter, we have explained the workings of the coarse-grained model of
proteins based on the knowledge of their native structures. The model may provide
a first description of a system of interest that allows for identification of its most
important features. The model may then serve as a scaffolding for more elaborate
approaches. We have focused on proteins that are parts of virus capsids and showed
that mutations in these proteins would yield different patterns of the stretching curves.
The values of Fmax of the capsidic proteins are seen not to be correlated with the
strength of resistance to nanoindenation of the capsids.
The structure-based model can be empirically generalized to consider the behavior
of proteins under the conditions of the solvent flow [40] or at the air-water and oil-
water interfaces [82, 83]. The former requires adding a flow-related term to the
drag force. Inclusion of the hydrodynamic interactions requires adding the diffusion
tensor to the equations of motion as done in Ref. [55] that shows that the interactions
accelerate folding. Studying proteins at the interfaces involves adding interface-
related forces that couple to the hydropathy indices of residues. These forces deform
the proteins and pin them to the interface. One application of this approach is to
explain stabilization of the foam in beer [83]: the barley protein LTP1 and its isoform
LTP1b, that contains a ligand, provide a coating of the bubbles.
326 M. Cieplak

Fig. 16 Examples of the empty CCMV capsid assembly after thermal denaturation at temperature
0.94 ε/k B . The top-left structure resulted from denaturation lasting for 2000 τ . 69% of the inter-
protein contacts are disrupted in this structure. The bottom-left structure was obtained through
denaturation lasting for 4000 τ which disrupted 89% of the inter-protein contacts. The corresponding
structures on the right are obtained by a subsequent evolution of 8000 τ at the room temperature.
In the state shown in the upper-right panel, 3% of the inter-protein contacts are disrupted; in the
lower-right—29%

Acknowledgements M. Cieplak is grateful to M. Chwastyk, P. Cieplak, K. Modro, M. Sikora, and


T. Włodarski for discussions and help with some figures and data. The computer resources were
financed by the European Regional Development Fund under the Operational Programme Innovative
Economy NanoFun POIG.02.02.00-00-025/09. The research on the revised version of this chapter
has been supported by the Polish National Science Centre Grant No. 2014/15/B/ST3/01905.

References

1. Neuman, K.C., Nagy, A.: Single-molecule force spectroscopy: optical tweezers, magnetic
tweezers and atomic force microscopy. Nat. Methods 5, 491–505 (2008)
2. Weiss, S.: Fluorescence spectroscopy of single biomolecules. Science 283, 1676–1683 (1999)
Mechanostability of Virus Capsids and Their Proteins … 327

3. Schuler, B., Lipman, E.A., Eaton, W.A.: Probing the free-energy surface for protein folding
with single-molecule fluorescence spectroscopy. Nature 419, 743–747 (2002)
4. Yang, H., Luo, G.B., Karnchanaphanurach, P., Louie, T.M., Rech, I., Cova, S., Xun, L.Y., Xie,
X.S.: Protein conformational dynamics probed by single-molecule electron transfer. Science
302, 262–266 (2003)
5. Borgia, M.B., Borgia, A., Best, R.B., Steward, A., Nettels, D., Wunderlich, B., Schuler, B.,
Clarke, J.: Single-molecule fluorescence reveals sequence-specific misfolding in multidomain
proteins. Nature 474, 662–665 (2011)
6. Carrion-Vasquez, M., Oberhauser, A.F., Fowler, S.B., Marszalek, P.E., Broedel, P.E.: Mechan-
ical and chemical unfolding of a single protein: a comparison. Proc. Natl. Acad. Sci. USA 96,
3694–3699 (1999)
7. Fernandez, J.M., Li, H.B.: Force-clamp spectroscopy monitors the folding trajectory of a single
protein. Science 303, 1674–1678 (2004)
8. Cecconi, C., Shank, E.A., Bustamante, C., Marqusee, S.: Direct observation of the three-state
folding of a single protein molecule. Science 309, 2057–2060 (2005)
9. Carrion-Vazquez, M., Cieplak, M., Oberhauser, A.F.: Protein mechanics at the single-molecule
level. In: Meyers R.A. (ed.) Encyclopedia of Complexity and Systems Science, pp. 7026–7050.
Springer, New York (2009)
10. Crampton, N., Brockwell, D.J.: Unravelling the design principles for single protein mechanical
strength. Curr. Opin. Struct. Biol. 20, 508–517 (2010)
11. Del Rio, A., Perez-Jimenez, R., Liu, R.C., Roca-Cusachs, P., Fernandez, J.M., Sheetz, M.P.:
Stretching single talin rod molecules activates vinculin binding. Science 323, 638–641 (2009)
12. Vogel, V.: Mechanotransduction involving multimodular proteins: converting force into bio-
chemical signals. Annu. Rev. Biophys. Biomol. Struct. 35, 459–488 (2006)
13. Hervas, R., Oroz, J., Galera-Prat, A., Goni, O., Valbuena, A., Vera, A.M., Gomez-Socilia, A.,
Losada-Urzaiz, F., Uversky, V.N., Menendez, M., Laurents, D.V., Bruix, M., Carrion-Vazquez,
M.: Common features at the start of the neurodegeneration cascade. PLoS Biol. 10, e1001335
(2012)
14. Rief, M., Gautel, M., Oesterhelt, F., Fernandez, J.M., Gaub, H.E.: Reversible unfolding of
individual titin immunoglobulin domains by AFM. Science 276, 1109–1112 (1997)
15. Improta, S., Politou, A.S., Pastore. A.: Immunoglobulin-like modules from titin I-band: exten-
sible components of muscle elasticity. Struct. 4, 323–337 (1996)
16. Marszalek, P.E., Lu, H., Li, H.B., Carrion-Vazquez, M., Oberhauser, A.F., Schulten, K., Fernan-
dez, J.M.: Mechanical unfolding intermediates in titin modules. Nature 402, 100–103 (1999)
17. Lu, H., Schulten, K.: Steered molecular dynamics simulation of conformational changes of
immunoglobulin domain I27 interprete atomic force microscopy observations. Chem. Phys.
247, 141–153 (1999)
18. Paci, E., Karplus, M.: Unfolding proteins by external forces and temperature: the importance
of topology and energetics. Proc. Natl. Acad. Sci. USA 97, 6521–6526 (2000)
19. Bockelmann, U., Essevaz-Roulet, B., Heslot, F.: Molecular stick-slip motion revealed by open-
ing DNA with piconewton forces. Phys. Rev. Lett. 79, 4489–4492 (1997)
20. Hoang, T.X., Cieplak, M.: Molecular dynamics of folding of secondary structures in Go-like
models of proteins. J. Chem. Phys. 112, 6851–6862 (2000)
21. Cieplak, M., Hoang, T.X., Robbins, M.O.: Folding and stretching in a Go-like model of titin,
proteins: function. Struct. Genet. 49, 114–124 (2002)
22. Cieplak, M., Hoang, T.X.: Universality classes in folding times of proteins. Biophys. J. 84,
475–488 (2003)
23. Cieplak, M., Hoang, T.X., Robbins, M.O.: Thermal effects in stretching of Go-like models of
titin and secondary structures. Proteins: Struct. Funct. Bio. 56, 285–297 (2004)
24. Sułkowska, J.I., Cieplak, M.: Mechanical stretching of proteins—a theoretical survey of the
Protein Data Bank. J. Phys.: Cond. Mat. 19, 283201 (2007)
25. Yang, L.J., Tan, C.H., Hsieh, M.J., Wang, J.M., Duan, Y., Cieplak, P., Caldwell, J., Kollman,
P.A., Luo, R.: New-generation amber united-atom force field. J. Phys. Chem. B 110, 13166–
13176 (2006)
328 M. Cieplak

26. Go, N.: Theoretical studies of protein folding. Annu. Rev. Biophys. Bioeng. 12, 183–210 (1983)
27. Abe, H., Go, N.: Noninteracting local-structure model of folding and unfolding transition in
globular proteins. II. Application to two-dimensional lattice proteins. Biopolymers 20, 1013–
1031 (1981)
28. Sali, A., Shakhnovich, E., Karplus, M.: How does a protein fold. Nature 369, 248–251 (1994)
29. Shrivastava, I., Vishveshwara, S., Cieplak, M., Maritan, A., Banavar, J.R.: Lattice model for
rapidly folding protein-like heteropolymers. Proc. Natl. Acad. Sci. USA 92, 9206–9209 (1995)
30. Sułkowska, J.I., Cieplak, M.: Selection of optimal variants of Go-like models of proteins
through studies of stretching. Biophys. J. 95, 3174–3191 (2008)
31. Cieplak, M., Sułkowska, J.I.: Structure-based models of biomolecules: stretchnig of proteins,
dynamics of knots, hydrodynamic effects, and indentation of virus capsids. In: Koliński, A.
(ed.) Chapter 8 in Multiscale Approaches to Protein Modeling: Structure Prediction, Dynamics,
Thermodynamics and Macromolecular Assemblies, pp. 179–208. Springer, New York (2010)
32. Clementi, C., Nymeyer, H., Onuchic, J.N.: Topological and energetic factors: what determines
the structural details of the transition state ensemble and "en-route" intermediates for protein
folding? An investigation for small globular proteins. J. Mol. Biol. 298, 937–953 (2000)
33. Karanicolas, J., Brooks III, C.L.: The origins of asymmetry in the folding transition states of
protein L and protein G. Protein Sci. 11, 2351–2361 (2002)
34. Cieplak, M.: Cooperativity and contact order in protein folding. Phys. Rev. E 69, 031907 (2004)
35. Wallin, S., Zeldovich, K.B., Shakhnovich, E.I.: Folding mechanics of a knotted protein. J. Mol.
Biol. 368, 884–893 (2007)
36. Tsai, J., Taylor, R., Chothia, C., Gerstein, M.: The packing density in proteins: Standard radii
and volumes. J. Mol. Biol. 290, 253–266 (1999)
37. Settanni, G., Hoang, T.X., Micheletti, C., Maritan, A.: Folding pathways of prion and doppel.
Biophys. J. 83, 3533–3541 (2002)
38. Wołek, K., Gómez-Sicilia, Á., Cieplak, M.: Determination of contact maps in proteins: a
combination of structural and chemical approaches. J. Chem. Phys. 143, 243105 (2015)
39. Veitshans, T., Klimov, D., Thirumalai, D.: Protein folding kinetics: timescales, pathways and
energy landscapes in terms of sequence dependent properties. Fold. Des. 2, 1–22 (1997)
40. Szymczak, P., Cieplak, M.: Stretching of proteins in a uniform flow. J. Chem. Phys. 125, 164903
(2006)
41. Valbuena, A., Oroz, J., Hervas, R., Vera, A.M., Rodriguez, D., Menendez, M., Sułkowska, J.I.,
Cieplak, M., Carrion-Vazquez, M.: On the remarkable mechanostability of scaffoldins and the
mechanical clamp motif. Proc. Natl. Acad. Sci. USA 106, 13791–13796 (2009)
42. Sikora, M., Sułkowska, J.I., Cieplak, M.: Mechanical strength of 17 132 model proteins and
cysteine slipknots. PloS Comp. Biol. 5, e1000547 (2008)
43. Wołek, K., Cieplak, M.: Criteria for folding in structure-based models of proteins. J. Chem.
Phys. 144, 185102 (2016)
44. Sikora, M., Cieplak, M.: Mechanical stability of multidomain proteins and novel mechanical
clamps. Proteins: Struct. Funct. Bioinf. 79, 1786–1799 (2011)
45. Sikora, M., Sułkowska, J.I., Witkowski, B.S., Cieplak, M.: BSDB: the biomolecule stretching
database. Nucl. Acid. Res. 39, D443–D450 (2011)
46. Chen, J., Callis, P.R., King, J.: Mechanism of the very efficient quenching of tryptophan fluo-
rescence in human γ D- and γ S-crystallins: the γ -crystallin fold may have evolved to protect
tryptophan resdidues from ultraviolet photodamage. Biochemistry 48, 3708–3716 (2009)
47. Flaugh, S.L., Kosinski-Collins, M.S., King, J.: Interdomain side-chain interactions in human
γ D-crystallin influencing folding and stability. Prot. Sci. 14, 2030–2043 (2005)
48. McDonald, N.Q., Lapatto, R., Murray-Rust, J., Gunning, J., Wlodawer, A., Blundell, T.L.: New
protein fold revealed by a 2.3-A resolution crystal structure of nerve growth factor. Nature 354,
411414 (1991)
49. Murray-Rust, J., McDonald, N.Q., Blundell, T.L., Hosang, M., Oefner, C., Winkler, F., Brad-
shaw, R.A.: Topological similarities in TGF-beta 2, PDGF-BB and NGF define a superfamily
of polypeptide growth factors. Structure 1, 153–159 (1993)
Mechanostability of Virus Capsids and Their Proteins … 329

50. Sun, P.D., Davies, D.R.: The cystine-knot growth-factor superfamily. Annu. Rev. Biophys.
Biomol. Struct. 24, 269–291 (1995)
51. Iyer, S., Acharya, K.R.: The cystine signature and molecular-recognition processes of the
vascular endothelial growth factor family of angiogenic cytokines. FEBS J. 278, 4304–4322
(2011)
52. Peplowski, L., Sikora, M., Nowak, W., Cieplak, M.: Molecular jamming—the cysteine slipknot
mechanical clamp in all-atom simulations. J. Chem. Phys. 134, 085102 (2011)
53. Sikora, M., Cieplak, M.: Cystine plug and other novel mechanisms of large mechanical stability
in dimeric proteins. Phys. Rev. Lett. 109, 208101 (2012)
54. Sikora, M., Cieplak, M.: Formation of cystine slipknots in dimeric proteins. PLoS ONE 8,
e57443 (2013)
55. Niewieczerzał, S., Cieplak, M.: Hydrodynamic interactions in protein folding. J. Chem. Phys.
21, 124905 (2009)
56. Plaxco, K.W., Simons, K.T., Baker, D.: Contact order, transition state placement and the refold-
ing rates of single domain proteins. J. Mol. Biol. 277, 985–994 (1998)
57. Plaxco, K.W., Simons, K.T., Ruczinski, I., Baker, D.: Topology, stability, sequence, and length:
defining the determinants of two-state protein folding kinetics. Biochemistry 39, 11177–11183
(2000)
58. Cieplak, M., Hoang, T.X., Robbins, M.O.: Stretching of proteins in the entropic limit. Phys.
Rev. E 69, 011912 (2004)
59. Yang, G., Cecconi, C., Baase, W.A., Vetter, I.R., Breyer, W.A., Haack, J.A., Matthews, B.W.,
Dahlquist, F.W., Bustamante, C.: Solid-state synthesis and mechanical unfolding of polymers
of T4 lysozyme. Proc. Natl. Acad. Sci. USA 97, 139–144 (2000)
60. Janowski, R., Kozak, M., Jankowska, E., Grzonka, Z., Grubb, A., Abrahamson, M., Jaskólski,
M.: Human cystatin C, an amyloidogenic protein dimerizes through three-dimensional domain
swapping. Nature Struct. Biol. 8, 316–320 (2001)
61. Chwastyk, M., Jaskólski, M., Cieplak, M.: The volume of cavities in proteins and virus capsids.
Proteins 84, 1275–1286 (2016)
62. Caspar, D., Klug, A.: Physical principles in the construction of regular viruses. Cold Spring
Harbor Symp. Quant. Biol. 27, 1–24 (1962)
63. Roos, W.H., Bruisma, R., Wuite, G.J.L.: Physical virology. Nat. Phys. 6, 733–743 (2010)
64. Michel, J.P., Ivanovska, I.L., Gibbons, M.M., Klug, W.S., Knobler, C.M., Wuite, G.J.L.,
Schmidt, C.F.: Nanoindentation studies of full and empty viral capsids and the effects of cap-
sid protein mutations on elasticity and strength. Proc. Natl. Acad. Sci. USA 103, 6184–6189
(2006)
65. Klug, W.S., Bruinsma, R.F., Michel, J.-P., Knobler, C.M., Ivanovska, I.L., Schmidt, C.F., Wuite,
G.J.L.: Failure of viral shells. Phys. Rev. Lett. 97, 228101 (2006)
66. Carrasco, C., Carreira, A., Schaap, I.A.T., Serena, P.A., Gomez-Herrero, J., Mateu, M.G., de
Pablo, P.J.: DNA-mediated anisotropic mechanical reinforcement of a virus. Proc. Natl. Acad.
Sci. USA 103, 13706–13711 (2006)
67. Carrasco, C., Castellanos, M., de Pablo, P.J., Mateu, M.G.: Manipulation of the mechanical
properties of a virus by protein engineering. Proc. Natl. Acad. Sci. USA 105, 4150–4155 (2008)
68. Cieplak, M., Robbins, M.O.: Nanoindentation of virus capsids in a molecular model. J. Chem.
Phys. 132, 015101 (2010)
69. Cieplak, M., Robbins, M.O.: Nnaoindentation of 35 virus capsids in a molecular model: relating
mechanical properties to structure. PLoS ONE 8, e63640 (2013)
70. Carrillo-Tripp, M., Shepherd, C.M., Borelli, I.A., Venkataraman, S., Lander, G., Natarajan, P.,
Johnson, J.E., Brooks III, C.L., Reddy, V.S.: VIPERdb2: and enhanced and web API enabled
relational database for structural virology. Nucl. Acids Res. 37, D436–D442 (2009). http://
viperdb.scripps.edu/
71. Gibbons, M.M., Klug, W.S.: Nonlinear finite-element analysis of nanoindentation of viral
capsids. Phys. Rev. E 75, 031901 (2007)
72. Gibbons, M.M., Klug, W.S.: Influence of nonuniform geometry on nanoindentation of viral
capsids. Biophys. J. 95, 3640–3649 (2008)
330 M. Cieplak

73. Endres, D., Zlotnick, A.: Model-based analysis of assembly kinetics for virus capsids or other
spherical polymers Biophys. J. 83, 1217–1230 (2002)
74. Wales, D.J.: The energy landscape as a unifying theme in molecular science. Phil. Trans. R.
Soc. 363, 357–377 (2005)
75. Johnston, I.G., Louis, A.A., Doye, J.P.K.: Modelling the self-assembly of virus capsids. J.
Phys.: Cond. Matter 22, 104101 (2010)
76. Elrad, O.M., Hagan, M.F.: Mechanisms of size control and polymorphism in viral capsid
assembly. Nano Lett. 8, 3850–3857 (2008)
77. Elrad, O.M., Hagan, M.F.: Encapsulation of a polumer by an icosahedral virus. Phys. Biol. 7,
045003 (2010)
78. Rapaport, D.C.: Role of reversibility in viral capsid growth: a paradigm for self-assembly. Phys.
Rev. Lett. 101, 186101 (2008)
79. Zlotnick, A., Porterfield, J.Z., Wang, J.C.-Y.: To build a virus on a nucleic acid substrate.
Biophys. J. 104, 1595–1604 (2013)
80. Garmann, R.F., Comas-Garcia, M., Gopal, A., Knobler, C.M., Gelbart, W.M.: The assembly
pathway of an icosahedral single-stranded RNA virus depends on the strength of inter-subunit
attractions. J. Mol. Biol. 426, 1050–1060 (2014)
81. Wołek, K., Cieplak, M.: Self-assembly of model proteins into virus capsids. J. Phys. Cond.
Matter 47, 474003 (2017)
82. Cieplak, M., Allen, D.B., Leheny, R.L., Reich, D.H.: Proteins at air-water interfaces: a coarse-
grained approach. Langmuir 30, 12888–96 (2014)
83. Zhao, Y., Cieplak, M.: Structural changes in barley protein LTP1 isoforms at air-water inter-
faces. Langmuir 33, 4769–4780 (2017)
Computer Modelling of the Lipid Matrix
of Biomembranes

Marta Pasenkiewicz-Gierula and Michał Markiewicz

Abstract The best recognised functions of biomembranes are to separate and pro-
tect the cell or the organelle from the environment and to enable communication
and transport between their interior and exterior. The main structural element of any
biomembrane is its lipid matrix, which, in most cases, is a lipid bilayer. Lipid matrix
is a supramolecular dynamic structure where molecules undergo a broad range of
motions. Such structures are difficult to study experimentally; in contrast, classi-
cal molecular modelling methods are well suited for this purpose. In this chapter
we present computational approaches based on classical molecular modelling with
atomic resolution to study lipid bilayers and their limitations, the studied bilayer
models and the results obtained using these methods. The necessity of model vali-
dation is stressed.

1 Introduction (Functions of Biomembranes, Molecular


Composition, Lipid Matrix as the Basic Structural
Element)

Biomembranes are omnipresent in the living world. Each cell is bounded by a


cell (plasma) membrane. Also, sub-cellular structures (organelles and nucleus) are
enclosed in internal membranes (Fig. 1). Biomembranes are thin lamellar structures
consisting of a great number of molecules of several chemical types, among them pro-
teins, peptides, phospholipids, glycolipids, sterols, terpenoids. The main function of a
biomembrane is to separate and protect the cell or the organelle from the environment
and to enable communication and transport between their interior and exterior. The
properties of biomembranes are largely determined by two components: (1) mem-
brane proteins, and (2) the lipid matrix, which is the main structural element of any

M. Pasenkiewicz-Gierula (B) · M. Markiewicz


Department of Computational Biophysics and Bioinformatics,
Faculty of Biochemistry, Biophysics, and Biotechnology,
Jagiellonian University, ul. Gronostajowa 7, 30-387 Krakow, Poland
e-mail: marta.pasenkiewicz-gierula@uj.edu.pl

© Springer Nature Switzerland AG 2019 331


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_11
332 M. Pasenkiewicz-Gierula and M. Markiewicz

Fig. 1 Schematic picture of animal cell membranes. The plasma membrane and internal membranes
are indicated

biomembrane. The matrix determines the bulk membrane properties and provides a
proper dynamic and active milieu for membrane proteins such that they can perform
their biological functions, among which are the inter-compartmental communication
and controlled transport of various types of molecules. The matrix also constitutes a
protective barrier that prohibits uncontrolled flow of larger-size polar molecules and
ions from or to the cytoplasm, although small-size molecules such as oxygen and
carbon dioxide and to a smaller extent water readily diffuse through membranes. The
integrity of the lipid matrix is assured by weak intermolecular interactions, mainly
hydrogen bonding, dispersion and electrostatic interactions.
In most cases, the lipid matrix is a phospholipid bilayer whose molecular com-
position varies among cell types within the same organism and depends on the cell
function [216]. The composition may change with time and environmental factors,
but it is strictly controlled [135]. Usually, changes in the lipid composition would
result in alteration of the physical properties of the membrane, which would then
affect the function of proteins immersed in the lipid bilayer [42]. A general feature
of the matrix is not only heterogeneity with respect to the lipid composition but also
with respect to the lateral distribution of the lipids. Cholesterol, which is a natural
component of the animal cell plasma membrane, enhances inhomogeneous lateral
distribution of membrane lipids by stimulating the formation of transient membrane
domains enriched in cholesterol. Moreover, cholesterol locally modulates physical
Computer Modelling of the Lipid Matrix of Biomembranes 333

properties of the bilayer. Both are crucial for the biological activity of membrane
proteins and peptides, which depends on the lipid composition and physical state of
their local surroundings (domain) in the membrane.
A lipid bilayer is a supramolecular soft liquid-crystalline material of certain struc-
tural features and physical properties that are key to the biological functions of
biomembranes. Bilayer properties follow directly from the structural characteristics
of lipids, the main bilayer building-blocks, and of water. Lipid molecules are amphi-
pathic and in water spontaneously form bilayers or other ordered aggregates. This
chapter is devoted to the computer modelling of lipid bilayers predominantly com-
posed of phospholipids, mainly phosphatidylcholine (PC) (Fig. 2), and of cholesterol
(Chol) (Figs. 2 and 3), which model the lipid matrices of animal cell membranes using
molecular modelling methodology with atomic resolution. Excellent reviews of both
the earlier stages of the computer modelling of biologically relevant lipid systems
and of the later stages are in the Refs. [10, 47, 187, 206] and in the Refs. [104, 113,
217], respectively.

2 Computer Models of Biomembranes

2.1 Particular Features of the Lipid Matrix (Lamellar


Structure, Disorder of Hydrocarbon Chains, Hydration
of Lipid Head Groups, Multi-scale Dynamics)

The lipid matrix of a biologically active biomembrane is in the liquid-crystalline


phase. In this bilayer phase the constituting phospholipid molecules undergo a broad
range of motions. The fastest of these, excluding the vibrations of covalent bonds
and valence angles, is intramolecular trans-gauche isomerisation causing constant
conformational changes in the lipid acyl chains. On the slower side there are trans-
lational (lateral and transversal) and rotational (about parallel and perpendicular
axes) diffusions of the whole lipid molecule or a fragment of it as well as collective
motions of groups of lipid molecules. The internal flexibility and motional free-
dom of phospholipids leads to the conformational disorder of the acyl chains of the
matrix phospholipids (Fig. 4) and superposition of motions occurring on different
time scales, which, together with the inhomogeneous lateral distribution of lipids,
results in dynamic heterogeneity of the matrix. This dynamic heterogeneity was fully
recognised in one of the first rigorous structural studies of a fluid lipid bilayer [225,
226].
For the reasons stated above, one can infer that even though a lipid matrix con-
stitutes only an element of the biomembrane, it is not easy to study experimentally.
Thus, in most cases, biophysical experimental studies of membranes are carried
out on so-called model membranes that are generally a single-lipid-type, or at least
several-lipid-type bilayers arranged in uni- or multi-lamellar liposomes or flat mem-
branes.
334 M. Pasenkiewicz-Gierula and M. Markiewicz

Fig. 2 Chemical structures of the main fragments of commonly occurring phospholipids and
cholesterol. On the left-hand side are phosphatidylcholine (PC), phosphatidylethanolamine (PE),
phosphatidylserine (PS), phosphatidylglycerol (PG) heads; in the middle are glycerol (GLY) and
sphingosine (SPH) skeleton; on the right-hand side are myristoyl (M), palmitoyl (P), stearoyl (S),
oleoyl (O) acyl chains. The atoms in the PC head, glycerol skeleton, and myristoyl chain have been
numbered in accordance with Sundaralingam [200]. At the bottom there are monogalactosyldiacyl-
glycerol (MGDG) head and cholesterol (Chol) with atoms numbered in accordance with the IUPAC
convention. The chemical symbols for carbon atoms, C, and hydrogen atoms in the CH3 , CH2 and
CH groups have been omitted
Computer Modelling of the Lipid Matrix of Biomembranes 335

Fig. 3 A space-filling representation of the cholesterol molecule. The smooth α-face (Alpha) and
rough β-face (Beta) of cholesterol are apparent

Fig. 4 Examples of various conformations of a PC molecule. The molecules were arbitrarily chosen
from a liquid-crystalline POPC bilayer simulated for 70 ns [111]. PC molecules are in the united
atom representation and atoms are represented in standard colours

However, due to the existence of distinct horizontal regions within the bilayer [10]
of contrasting properties (water phase, interfacial region, hydrophobic core) (Fig. 5),
conformational disorder of phospholipid acyl chains (Fig. 4) and motional freedom of
lipid molecules, even model membranes, create experimental difficulties. In effect,
experimental methods provide detailed information on global bilayer parameters
such as the membrane width and average surface area per lipid e.g. [95, 132, 133,
225], the thickness of the hydration shell e.g. [132, 133, 165], the phase state e.g.
[96, 103] etc. However, they only provide averaged conformational and motional
characteristics of bilayer lipids, where the averaging is strictly related to the time
window of the experimental method used e.g., [118].
As has been already stressed, the main characteristic of a lipid bilayer is the
dynamics of the constituting lipid molecules. A single molecule contributes to the
global properties of the bilayer but its actual conformational state does not have
much significance as it changes over a short time scale. Nevertheless, to understand
the supramolecular, extended, integral, and flexible structure of a lipid bilayer, the
details of the dynamical behaviour of individual lipid molecules in the bilayer must
be well recognised.
336 M. Pasenkiewicz-Gierula and M. Markiewicz

Fig. 5 Snapshot of a liquid-crystalline POPC bilayer at the end of 70-ns MD simulation [111].
The POPC molecules are in the united atom representation and atoms are represented in standard
colours. Water molecules are blue

Detailed information about the dynamic structure of the model membrane and
of each lipid molecule as well as the motional events that occur over time scales
up to microseconds can be obtained using classical molecular modelling methods.
In principle this methodology has a spatial atomic resolution and time resolution
in the femtosecond time scale, thus it is particularly well suited for studying such
disordered and dynamic structures as lipid bilayers. Nevertheless, models gener-
ated with molecular modelling methodology have to be validated against a range of
experimentally obtained properties e.g. [7, 160, 161].
Amphipathic phospholipid molecules can form a lamellar structure (as is the lipid
bilayer) only in the presence of water and this is a spontaneous self-assembling pro-
cess. In addition to the phospholipid shape (the ratio of the cross-section of the head
group to that of the acyl chains), water, ions, and temperature determine the lyotropic
phase state (e.g. lamellar, hexagonal, micellar, cubic) of the assembly. Above the main
thermotropic phase transition temperature, when the phospholipid acyl chains are in
a melted state (disordered), PC bilayers are in the lamellar phase when the system
composition is ~40 wt% water, e.g. [132, 133, 165, 223]. Other phospholipids like
phosphatidylethanolamine (PE), phosphatidylserine (PS), sphingomyelin (SM), and
phosphatidylglycerol (PG) etc. (Fig. 2) require different amounts of water, depend-
ing on the charge and volume of their polar head groups, their capacity as H-bond
donors, and also on the length and degree of unsaturation (the number of double
Computer Modelling of the Lipid Matrix of Biomembranes 337

C=C bonds) of their hydrocarbon chains [24, 119, 165, 227]. In multi-lamellar lipo-
somes, the equilibrium number of water molecules that hydrate a saturated PC bilayer
is ~30/PC [50, 132, 207, 211] of which, on average, ~5 water molecules are strongly
bound by a PC [50, 132].

2.2 Starting Configuration of the Computer Model


and Commonly Used Force Fields

One of the first monographs that provided practical information and some theoreti-
cal background on building realistic and reliable computer models of a lipid bilayer,
related problems and limitations is Ref. [151]. In those early days, the starting con-
figuration of the bilayer was created from spatially ordered phospholipid molecules
with acyl chains in the extended all-trans conformation, and, thus, the initial struc-
ture corresponded to the crystal state e.g., Ref. [30, 146]. However, the biologically
active lipid matrix is in the liquid-crystalline phase, where phospholipid acyl chains
are in a melted state. This means that, on average, a certain percentage of torsion
angles in a chain (~25%) are in the gauche conformation and the probability of the
gauche conformation changes little along the chain, except for the last torsion angle,
where the probability of gauche is higher [92, 97]. MD simulation of a lipid bilayer,
which was initially in the crystal state, required a long equilibration time and part of
the equilibration process was often carried out at an elevated temperature to speed
up breaking of the crystal order e.g., Refs. [142, 146]. It thus seems more rational
to start the simulation from a random initial configuration of lipid molecules in the
bilayer and disordered (randomly distributed gauche conformations of torsion angles
along the chains in accordance with the equilibrium population) acyl chains of the
lipid molecules, as has been done e.g. in Refs. [11, 31].
In classical molecular modelling, atoms move in a conservative potential on the
potential energy surface that is calculated in the framework of the force field descrip-
tion [9]. A force field is the functional form and parameters and should be considered
as a single entity [89]. The most-widely used functional form has three terms describ-
ing bonded interactions (bond stretching, angle bending, and bond rotations) and two
terms describing non-bonded interactions (van der Waals and Coulomb) and some-
times also improper torsion and 1–4 interaction terms [89]. The potential energy
of the molecular system is an analytical function of the positions of the atoms in
the system [94]. The force field parameters are necessary to compute the value of
the total energy of the molecular system and forces acting on each atom. The force
field can contain parameters for all atoms in the system (all atom force field) or
parameters where some groups of atoms, typically methyl and methylene groups
are treated as interaction units (united atom force field). The most-commonly used
force fields in the molecular modelling of lipid bilayers are OPLS (optimized poten-
tials for liquid simulations) [76, 77], CHARMM (chemistry at Harvard molecular
mechanics) [106], AMBER (assisted model building with energy refinement) [11,
338 M. Pasenkiewicz-Gierula and M. Markiewicz

29] and GROMOS (Groningen molecular simulation) [214]. A search of the PubMed
Central database indicates that of 196 papers on molecular modelling of the POPC
bilayer published after the year 2010, 79 used CHARMM, 40 Berger, 35 GROMOS,
20 OPLS, 10 AMBER/Lipid and 5 Slipids [68, 69] force fields. All these force fields
have similar functional forms (Eq. 1) but their parameters were adjusted to repro-
duce different physico-chemical quantities of the molecular system and thus, should
not be inter-changed. These force fields also use different ways of assigning atom
types to atoms in the system. One should always keep in mind that due to the way in
which the parameters were derived, the force field can be used to predict only certain
properties of a molecular system.
   Vn
 
E( R) K b (b − b0 )2 + K θ (θ − θ 0 )2 + [1 + cos(nφ − φ0 )
b θ φ
2
 12  6 
 r∗ r∗  qi q j
+ ε −2 + , (1)
i< j
ri j ri j i, j
ri j

The first three summations in Eq. (1) (bonded interactions) are over bonds (1–2
interactions), angles (1–3 interactions), and torsions (1–4 interactions). The last two
summations (non-bonded interactions) over pairs of atoms i and j exclude 1–2 and
1–3 interactions and often use separate parameters for 1–4 interactions as compared
with those used for atoms separated by more than three covalent bonds. Non-bonded
interactions include the “van der Waals” term (dispersion and repulsion) represented
by a Lennard-Jones 6–12 potential, and the electrostatic term, where partial charges
qi of atoms interact via Coulomb’s law. b0 , θ 0 , K b , K θ , V n , ϕ 0 , ε, r*, and qi , are the
potential function parameters. R represents coordinates of the atoms present in the
molecular system [162].
The functional form and parameters of a given force field are transferable, which
means that molecules of similar atom types can be modelled using the same set of
parameters and energy function [9, 89]. OPLS, CHARMM, GROMOS, and AMBER
force fields are used to model large molecular systems and therefore their functional
forms are simple as an adequate compromise between accuracy and computational
efficiency. Newer versions of the lipid force field parameters can be found in a number
of papers, namely OPLS [84, 105, 191], CHARMM [81, 152], AMBER [183, 205],
GROMOS [185].
The set of parameters used to model water or aqueous solutions (force field for
water, called a water model) should be compatible with that for the biomolecules.
The most common water models used in MD simulations of lipid bilayers hydrated
with explicit water are TIP3P (transferable intermolecular potential three-point) [75]
with further modifications for simulations with Ewald summation [164], and SPC
(simple point charge) [8]. These water models are rigid and have three interaction
sites (three-point models, where point charges are centred on each of the three water
atoms). TIP3P and SPC have no Lennard-Jones parameters on the hydrogen atoms
and this makes the models compatible with most classical force fields, although they
Computer Modelling of the Lipid Matrix of Biomembranes 339

perform differently with different force fields [109]. For use with the CHARMM
force field, the TIP3P water model was slightly modified and Lennard-Jones terms
on the hydrogen atoms were included [107, 109, 134]. A rigid water model with four
interaction sites (TIP4P) [75] has also been used in MD simulations of lipid bilayers
[33, 63, 190], although it is less common due to its additional computational expense.
All force fields listed above use fixed-point charges. In order to allow the elec-
tron density to respond to the local electric fields, a polarizable force field for lipid
molecules based on the Drude oscillator [87, 88] was developed [27, 98]. The polar-
izable force fields reproduce electrostatic interactions better, and, while adding addi-
tional computational complexity [74] they provide a more accurate representation of
a molecular system.

2.3 Limitations of a Computer Model (Size and Time)

A typical mammalian cell has a diameter of ~10 × 10−6 m (10 μm) and thus a sur-
face area of ~10−10 m2 . Estimating that the cross-sectional area of a lipid molecule
is ~100 × 10−20 m2 and assuming that lipids occupy only 10% of the membrane sur-
face (the rest are proteins), one can roughly estimate that one leaflet of the lipid
matrix is built of ~107 lipid molecules. The computer model cannot be built of so
many molecules due to its computational complexity. In classical molecular mod-
elling, atoms that constitute the model interact through a many-body potential. This
potential explicitly depends upon the atoms’ positions. As many-body interactions
are an intractable problem to solve, the non-bonded interactions are in most cases
approximated by the sum of pairwise interactions. For N atoms in the model, there are
approximately N 2 interactions (the complexity of the algorithm is denoted O(N 2 )). In
effect, the time required to compute non-bonded interactions without further approx-
imations is proportional to N 2 . Thus, the first limitation of the computer model with
an atomic resolution is related to the number of its atoms. The model of the membrane
matrix must thus be a patch of the lipid bilayer that, by applying two-dimensional
periodic boundary conditions, is algorithmically made horizontally continuous, and
by applying three-dimensional boundary conditions, is additionally made vertically
periodic.
The second main limitation of the computer model is the time scale of dynamical
processes that can be simulated. In classical molecular dynamics (MD) simulations,
the movements of atoms are governed by the classical equation of motion, which
in most cases is Newton’s equation. The position of each atom is obtained by the
numerical solution of the equation at successive discrete time points, every time step.
When the time step is constant, its value is determined by the fastest movements in
the model, which are bond vibrations. The fastest vibrations in the molecule are those
of covalent bonds that link hydrogen atoms and their time constant is ~10 fs. To probe
this motion faithfully, the time step should be less than 1 fs (10−15 s). When these
vibrations are eliminated, then the time step can be extended to 2 fs. Thus, to evaluate
the dynamical characteristics of a lipid bilayer at equilibrium, often 109 or even more
340 M. Pasenkiewicz-Gierula and M. Markiewicz

time steps are required. When designing a computer model of the bilayer, one should
thus consider the number of atoms in the model and the number of integration time
steps as well as the computational power available in order to estimate the total
elapsed real time of the simulation. If the calculations are not likely to finish within
a reasonable time period, one has to compromise the size of the model or the length
of the time scale investigated. To expand the time and length scales of these systems
beyond what is feasible with atomic models; coarse grained (CG) models for lipid
aggregates can be employed. A very successful and widely used CG lipid model
is the MARTINI force field [115]. How lipids and cholesterol are mapped to the
MARTINI CG representation is shown in Fig. 6 of Chapter Modeling of Membrane
Proteins.

2.4 Spontaneous Self-assembly of Lipids to Form Bilayers

Phospholipids and other lipids spontaneously aggregate into ordered structures by


self-assembly. This process takes place when a large number of lipid molecules are
mixed with water. The spontaneous self-assembly of lipids in water is a consequence
of the structural characteristics of lipid molecules, which are amphipathic, and water,
which is polar and has a unique capacity to form hydrogen bonds. In liquid water,
water molecules form a dynamic network of hydrogen bonds that is perturbed by
nonpolar hydrocarbon chains of lipids. The hydrophobic effect is a driving force
for the aggregation of hydrocarbon chains that minimises their contact with water.
At the same time, the polar lipid heads interact favourably with water. In effect,
lipids form, by self-organization, supramolecular aggregates of sizes and shapes that
depend on a number of parameters, particularly on the charge and relative size of the
lipid head group, the number of acyl chains in the lipid molecule and their flexibility,
the water/lipid molar ratio, and the presence of counter ions.
A reassuring test of the correctness of the force field description and applicability
and usefulness of classical molecular modelling in the study of lipid bilayers was
MD simulations that reproduced the self-assembly process of PC molecules in water
performed by Marrink et al. [114]. The initial structure in each of these simulations
was a random solution of PC in water with over 45 water molecules/lipid. In the course
of much less than 100 ns of MD simulation sampled with 5 fs time step, a bilayer was
formed, with properties matching the experimental data. Simulations were performed
for dipalmitoylPC (DPPC), palmitoyloleoylPC (POPC), and dioleoylPC (DOPC)
which differed in the number of mono-unsaturated chains, as well as of dioleoylPE
(DOPE), using the GROMACS [114]. A similar computer experiment was carried
out on a binary 1:1 mixture of DOPC and DOPE which self-assembled into a bilayer
within ~25 ns [34].
Computer Modelling of the Lipid Matrix of Biomembranes 341

Fig. 6 Examples of a water molecules H-bonded to phosphate oxygen atoms, b a water molecule
bridging phosphate oxygen atoms of two PC molecules (intermolecular water bridge), c a water
molecule anchoring clathrate around the choline moiety and a phosphate oxygen atom (intramolec-
ular water anchor), d charge pairs between two methyl groups of a choline moiety and a phosphate
oxygen atom, e Na+ coordinated by four PC molecules
342 M. Pasenkiewicz-Gierula and M. Markiewicz

2.5 Single-Type Lipid Computer Models of Lipid Bilayers

Due to limited computational power, the first computer models of hydrated phos-
pholipid bilayers with an atomic resolution described in the literature consisted of
lipids of a single-type. These computer models comprised from 36 to 200 PC or PE
molecules e.g., Refs. [30, 32, 38, 61, 169, 196], although in most cases their MD
simulation times were far below 1 ns [30, 32, 38, 61, 169], or ~2 ns [196]. The aim of
these simulations was mainly to assess the reliability of computer models by com-
paring the results of simulations with the experimental results and to improve the
methodology. For these reasons, computer models comprised predominantly those
phospholipids for which experimental data were available, and they were mainly
saturated PCs DPPC and DMPC [6, 38, 112, 196], but also DLPE [32] and mono-
unsaturated POPC [61]. Nevertheless, even these short simulations provided a wealth
of information about lipid bilayers, particularly about the dynamics of lipids and their
interactions with water. A significant extension of both the spatial and temporal scales
of bilayer MD simulations was made by Lindahl and Edholm [99], who carried out
simulations of a fully hydrated bilayer consisting of 1024 DPPC molecules for 10 ns.
With technological advances, particularly advances in the development of algorithms
[7], a much larger time scale is accessible for simulations these days and a 100-ns MD
simulation of a lipid bilayer is now standard. At present, single-type lipid bilayers
are mainly used in computer modelling studies of membrane proteins or peptides e.g.
[78, 80, 136] (this subject is broadly discussed in the chapter Modeling of Membrane
Proteins), of the collective behaviour of lipids in the bilayer e.g., [45, 47, 168], of
membrane permeation e.g., [154, 159, 184, 203] or interactions with ions in differ-
ent membrane thermotropic phases e.g., [195]. In an impressive, large scale study
(2.7-million-atoms) of a ribosome anchored to the membrane channel embedded in
a single-lipid POPC bilayer, a 50-ns MD simulation was performed by the Schulten
group [54]. Single-lipid bilayers are also used as reference systems in studies of
the effects of a certain membrane component on the main bilayer constituents (see
below).

2.5.1 Single Cis and Trans Unsaturated Phosphatidylcholine Bilayers

The PCs that occur most frequently in nature are those with cis unsaturated acyl
chains. As such, like PCs with saturated acyl chains, they are the most commonly
used PCs in model studies. In contrast, PCs with trans unsaturated acyl chains are
rather rare in nature. Nevertheless, they have a negative impact on human health.
Even though the effect of trans unsaturation of the PC acyl chains has been studied
both experimentally and computationally, e.g. [127, 172, 177, 194, 197, 222] such
studies are scarce. A recent comparative MD simulation study of saturated, cis and
trans mono-unsaturated bilayers of Kulig et al. [85] indicated that trans unsaturated
chains are more flexible than cis unsaturated chains (cf. Sect. 4.1). In effect, the
packing of trans unsaturated chains, thus their order in the bilayer, is higher than
Computer Modelling of the Lipid Matrix of Biomembranes 343

cis unsaturated chains. Also, interactions between cholesterol and trans unsaturated
chains are stronger than cis unsaturated chains, which results in a higher ordering
effect of cholesterol in trans unsaturated bilayers.

2.6 Mixed Lipid Bilayers (Phospholipid-Phospholipid,


Phospholipid-Cholesterol; Membrane Functions
of Cholesterol)

2.6.1 Binary Mixed Phospholipid Bilayers

The lipid matrix of a cell membrane contains different kinds of lipids [215]. A
mixed-lipid bilayer is thus a more realistic model of the lipid matrix of biomem-
branes, although it is more difficult to analyse than a single-lipid bilayer. As lipid
molecules of the same kind tend to cluster together [62], and mix nonideally with
lipids of other kinds [62, 166], the lateral distribution of lipids in the bilayer is often
inhomogeneous and the bilayer has compositionally distinct microdomains. A recent
comprehensive review on the molecular modelling of bilayers of heterogeneous com-
position is in Ref. [144]. The first atomistic computer models of mixed-lipid bilayers
consisted of two kinds of phospholipid. The Berkowitz group carried out MD sim-
ulations of bilayers comprising DPPC and DPPS at a ratio of 5:1 [142] and the
simulations provided detailed information about lipid-lipid interactions and showed
that ions strongly affect them. More exotic binary bilayers of DMPC and dimyris-
toyltrimethylammonium propane (DMTAP, a cationic lipid with no phosphate group)
at a varying mole ratio were constructed and MD simulated by the Vattulainen group
[55]. There, the effect of the lipid composition on the structure and electrostatic
properties of the bilayer was investigated. Bilayers composed of DOPC and DOPE
at a varying mole ratio were simulated by the Marrink group [34]. They found that
the equilibrium properties of these bilayers as a function of their PC/PE compo-
sition are nonlinear. However, they found no indication of domain formation, but
suggested that only MD simulation times in the microsecond range might reveal that
this process really takes place. Yet another binary bilayer made of POPE and POPG
in the proportion 3:1 was MD simulated by the Pasenkiewicz-Gierula group [128].
There, the organization of the bilayer interfacial region was analysed in detail. Other
computer simulations of binary phospholipid bilayers followed e.g. [91, 229, 231].
As was mentioned above, at present, atomistic MD simulations cannot be used to
model the process of micro-domain formation in binary lipid bilayers due to the
timescale of the process. However, using the CG MD simulation, the Voth group
[188] observed phase separation of a mixed 1:1 DPPC/DPPE bilayer.
344 M. Pasenkiewicz-Gierula and M. Markiewicz

2.6.2 Binary Mixed Phospholipid-Cholesterol Bilayers

The binary lipid bilayers that have been most studied using molecular modelling
methodology are composed of PC or SM and Chol. This is because SM, PC, and Chol
(Figs. 2 and 3) constitute three major classes of lipids in the outer leaflet of the animal
cell membrane. The cholesterol content of cell membranes is usually 20–50 mol%
of total the lipids [124] but in ocular lens membranes, the Chol content often exceeds
that of the phospholipids [208]. Chol has numerous functions in biomembranes.
From a biophysical perspective the main membrane function of Chol is to modulate
the physical properties of the lipid matrix, for example to regulate its fluidity and
the phase behaviour [125, 220], to increase its mechanical strength [15, 40], and to
increase its hydrophobic barrier [72, 199]. The first MD simulation of a fully hydrated
PC-Chol bilayer was carried out by Robinson et al. [170]. This simulation was short;
nevertheless it provided an interesting insight into the cholesterol ordering effect
and showed the formation of hydrogen bonds (H-bonds) between Chol and PC. This
simulation was followed by a much longer one by Tu et al. [209], which demonstrated
that Chol has a significant influence on the subnanosecond time scale PC dynamics.
Computer simulation studies on bilayers containing cholesterol published before
2009 are reviewed and summarised in Refs. [12, 181].
The molecular level membrane effects of cholesterol which were identified ear-
lier are the so-called ordering [138] and condensing [116] effects. The ordering
effect describes the ability of Chol molecules to increase the order of acyl chains in
phospholipid-Chol bilayers in the liquid-crystalline phase. A measure of the chain
order is the molecular order parameter, Smol , or deuterium order parameter, SCD . An
effect which is closely related to the Chol ordering effect is the condensing effect
that denotes that Chol induces an increase in the membrane surface density or, in
other words, decreases the surface area occupied by phospholipid molecules in bilay-
ers containing Chol. Both effects are easily detected in MD simulations but basic,
atomic-level mechanisms that are responsible for the effects are not easy to indicate,
so they have not been fully explained yet.
The atomic and molecular level mechanisms behind the cholesterol effects on the
membrane are reviewed in Ref. [181]. In short, as in the case of most biomolecules,
there is a direct relationship between the Chol structure that has been optimised over
the long process of natural evolution, and its biological function [122]. Chol consists
of three structural elements, namely the rigid steroid ring, the polar 3β-hydroxyl
group, and a short hydrocarbon chain attached to the ring at position 17 (cf. Fig. 2).
In addition, two methyl substituents, called C18 and C19 for short, are attached to the
ring at positions 10 and 13 (Figs. 2 and 3). They make the cholesterol ring asymmet-
ric—one of its sides is flat (α-face), the other is rough (β-face). Any modification of
these structural elements decreases the effects of Chol on lipid bilayers. A systematic
MD simulation study of the effect of modifying the chemical structure of Chol on the
ability of Chol to affect the properties of the bilayer was carried out by Róg et al. [163,
176, 180, 182, 213]. The first modification involved a change of the β-configuration
of the Chol hydroxyl group to α [176]. This epimeric form of cholesterol (epicholes-
terol, Echol) is rare in nature. MD simulations of the DMPC-Echol bilayer confirmed
Computer Modelling of the Lipid Matrix of Biomembranes 345

the experimental results of Dufourc et al. [36], and Demel et al. [35] that Echol has
weaker ordering and condensing effects on bilayers than Chol. The second modifica-
tion deprived Chol of the ability to be an H-bond donor by substituting the Chol OH
group with a ketone group [182]. Ketosterone is an artificial steroid as the 3-ketone
group is not present in sterols. The interactions of PC polar groups as well as water
with the ketone group are much weaker than those with the Chol OH group. Thus,
ketosterone is not firmly anchored in the bilayer interfacial region as is Chol and
its ordering and condensing effects are much weaker. Moreover, MD simulations
showed that ketosterone is able to undergo flip-flops between the bilayer leaflets in
a relatively short time of ~50 ns, whereas Chol does not flip-flop even on a much
longer time scale. The third modification deprived Chol of two methyl groups (C18
and C19) from the rough, β-face [180]. This made the cholesterol ring symmetric
and both its faces flat. Contrary to expectations, the effects of such a modified sterol
on the membrane order and condensation were weaker than those of cholesterol. To
obtain a better understanding of the functional significance of each methyl group of
Chol, one or two methyl groups were sequentially removed from the Chol molecule
[163]. This “chemical” experiment clearly showed that the removal of a single C18
methyl group or simultaneous removal of the other two methyl groups (C19 and
C21, the latter attached to C20 in the acyl chain) strongly affects the Chol ordering
effect. Desmosterol, which is a direct precursor of Chol and differs from Chol only
by one double bond in the sterol acyl chain, influences a saturated bilayer less than
cholesterol [213]. Smondyrev and Berkowitz carried out MD simulation studies of
other chemically modified structures of Chol and showed that an additional ketone
group at position 6 [193] as well as replacing the Chol OH group with an SO4 group
(cholesterol sulphate) [192] decreases the Chol effect on the lipid bilayer.
Detailed analyses of the results of studies on the ordering and condensing effects
of various sterols allowed Aittoniemi et al. [1] to find a strong correlation between
the tilt of the sterol ring (the angle between the ring plane and the bilayer normal) and
the sterol ordering and condensing abilities—the smaller the tilt, the more ordered
and condensed the bilayer is. This correlation arises from basic interactions between
Chol and lipids, and, as was shown in the studies of “chemical” modifications of
the Chol molecule as well as those with Chol precursors, all structural elements
of the cholesterol molecule are important and effective in these interactions. In all
binary lipid bilayers that contain sterol molecules investigated in the MD simulations
cited above, Chol had the smallest tilt and the strongest effect on the bilayer of
all these sterols [1]. A more recent MD simulation study of the Chol condensing
effect confirmed a correlation between an average tilt angle of the Chol ring and the
magnitude of the Chol condensing effect [3].
The PC-Chol bilayers discussed in this section so far contained no more than
50 mol% Chol and modelled a “typical” animal cell membrane [124]. However,
there are natural cell membranes that contain more than 50 mol% Chol. An example
of such membranes is the fibre cell membrane of the eye lens [208] where Chol
not only saturates the membrane but also causes pure Chol domains to form within
the membrane [66]. Model studies on PC-Chol bilayers with an increasing Chol
content allowed the Subczynski group to make the extension of the phase diagrams
346 M. Pasenkiewicz-Gierula and M. Markiewicz

for Chol/PC mixture to the region where PC bilayers are saturated and oversaturated
with Chol [108].
The biological purpose of oversaturating amount of Chol in the membranes of
the eye lens cells was puzzling. Computer modelling studies on the PC-Chol bilayer
revealed that at saturating Chol content, cholesterol suppresses vertical fluctuations
of atoms in a bilayer [158, 224] which smooths the bilayer surface. As one of the
principal properties of the lens is transparency and light-scattering is one of the
factors compromising the transparency, cholesterol-induced smoothing of the surface
of the eye lens membranes helps to maintain lens transparency by decreasing light-
scattering [158, 224]. A very recent MD simulation study [159] strongly supported
the hypothesis that pure Chol domains present in the lipid matrix of the eye lens cell
membranes provide barriers for oxygen transport to the lens centre, and thus protect
the lens against cataract development [198].

2.6.3 Ternary Mixed Phospholipid-Cholesterol Bilayers

As was stressed above, the lateral distribution of molecular components in mem-


branes is heterogeneous, and this often leads to the formation of compositionally
distinct microdomains. One of the most debated lateral microdomains in biological
membranes is called a functional lipid (membrane) raft and is enriched with sphin-
gomyelin (Fig. 2) and cholesterol e.g. [58, 67, 102, 137]. A provisional definition of
the lipid raft was introduced in 2006: “Membrane rafts are small (10–200 nm), het-
erogeneous, highly dynamic, sterol- and sphingolipid-enriched domains that com-
partmentalize cellular processes…” [157]. Raft-like domains can and do form in
model membranes e.g. [219] composed of saturated, unsaturated phospholipids, and
cholesterol. Computer simulations of the process of a raft-like domain forming spon-
taneously in ternary lipid mixtures were performed by, amongst others, Pandit et al.
[143] using MD simulation with atomic resolution, and by Risselada and Marrink
using CG MD simulation with MARTINI force field [167]. In the former case, during
the 200-ns MD simulation, the onset of spontaneous phase separation and domain
formation in the ternary mixture of DOPC/SM/Chol with a 1:1:1 composition was
observed. In the latter case, a ternary mixture of saturated and unsaturated PC and
Chol completely phase-separated into two domains of which one was the raft-like
domain, on a submicrosecond time scale. These CG MD simulations were carried
out on an initially random mixture of lipids arranged both as a flat bilayer and as a
small unilamellar vesicle.
An MD simulation study of Hall et al. [56] indicated that glycosphingolipids
affect the biophysical properties of lipid rafts; in particular they slow down lateral
diffusion of the raft lipids.
Computer Modelling of the Lipid Matrix of Biomembranes 347

2.6.4 Asymmetric Bilayers

The distribution of lipids in the lipid matrix of a biomembrane is not only laterally
inhomogeneous but also asymmetric across the matrix. The latter means that the lipid
composition of the two bilayer leaflets is different. In animal cell membranes the outer
leaflet is enriched with SM, PC, and Chol, and inner in PS, phosphatidylinositol (PI)
(both are anionic), and PE. In the first computer model of an asymmetric bilayer
found in the literature [23], one leaflet consisted of DPPC and the other of randomly
distributed DPPC and DPPS. An MD simulation of this bilayer did not show any effect
of the mixed-lipid leaflet on the single-lipid leaflet. An asymmetric bilayer consisting
of four lipid species, PC, SM, PE, and PS was constructed and MD simulated by
Vacha et al. [212]. In that study a realistic model of the inner and outer bilayer
leaflets was created as the system comprised two parallel asymmetric bilayers. The
inner leaflets of both bilayers, separated by the “interior” water layer, consisted of
PS and PE, the outer leaflets consisted of PC and SM. The number of added Na+ and
K+ ions exceeded the number needed to neutralize the negative charge on PS. The
simulations indicated that phospholipid head groups preferentially bind sodium over
potassium ions, and also that some water molecules are able to permeate across the
bilayers on a 100 ns timescale. An asymmetric bilayer containing Chol and SM in
one leaflet and Chol and PS in the other, was MD simulated by Bhide et al. [14]. The
authors observed practically no interaction between the two leaflets but observed a
more extended network of interactions between SM and Chol than between PS and
Chol. This might suggest that SM is more effective in the formation of domains than
PS.
The Marrink group [65] carried out large-scale CG 40-μs MD simulations of
a multicomponent bilayer consisting of 63 different lipid species asymmetrically
distributed across the two leaflets, to make a realistic model of the lipid matrix
of a mammalian plasma membrane. This model showed the formation of transient
domains with a liquid-ordered character in both bilayers, although in each bilayer
they consisted of different lipids. The domains were coupled across the two bilayer
leaflets. The later result might seem at variance with the experimental results obtained
for a much simpler bilayer which did not reveal evidence of transbilayer coupling
between the leaflets [39].

3 Intermolecular Interactions in Hydrated Lipid Bilayers

3.1 Bilayer Interface

As has been already stressed in Introduction, the lamellar structure and properties of
lipid bilayers follow directly from the structural characteristics of lipids and water.
Phospholipid bilayers form spontaneously in water and do not exist on their own in
the absence of water. Water must thus play a significant role not only in the formation
348 M. Pasenkiewicz-Gierula and M. Markiewicz

but also in the stability of the bilayer. The hydrophobic effect causes the lipid acyl
chains to assemble together in order to minimise their contact with water. At the
same time, the lipid head groups stay in contact with water—polar phosphate and
carbonyl groups can form hydrogen bonds with water molecules but the non-polar
choline group cannot form such bonds. The formation of H-bonds between PC and
water is evident in MD simulations of hydrated phospholipid bilayers (Fig. 6a). The
first thorough analysis of interactions between water and polar groups of PC in an
MD simulated bilayer was carried out by Alper et al. [2]. Also, they and Damodaran
and Merz [30] were the first to identify clathrate-like structures of water around
choline groups in PC bilayers.
A careful analysis of the interfacial water of an MD simulated PC bilayer showed
that water molecules can simultaneously form H-bonds with two PC polar groups
[149]. These bifurcated H-bonds were named “water bridges” (Fig. 6b). The earlier
quantum mechanical calculations of Frischleder et al. [48] showed that the binding
energy of a water bridge between two phosphate oxygen atoms is significantly higher
than that of a single H-bond. Thus, water bridges linking two or more PC molecules
lower the system’s energy and stabilize the bilayer structure. Such water-mediated
interactions between PC oxygen atoms have been postulated previously e.g. [16,
131] but only recently has their existence been shown experimentally [221]. Water
molecules can also bridge choline and phosphate or carbonyl groups by simultane-
ously belonging to a clathrate around the choline group and being H-bonded to one
of the polar groups. Such water molecules were evidenced in MD simulations of a PC
bilayer hydrated by normal and heavy water [181]; to distinguish them from water
bridges they were named “anchoring water” (Fig. 6c). Intermolecular water anchors
can also be expected to contribute to the stabilization of the bilayer structure.
PC molecules cannot form direct H-bonds among themselves as they are only H-
bond acceptors but, as was discussed above, in the hydrated PC bilayer they may be
linked indirectly, via water bridges and anchors. PC molecules can, however, interact
directly via Coulomb interactions as they contain groups that are positively (choline
moiety) and negatively (phosphate and carbonyl oxygen atoms) charged, whereas
their net electrostatic charge is zero. These charge-charge interactions were named
“charge pairs” (Fig. 6d) and they certainly contribute to the bilayer stability [150].
Detailed analyses of water bridges and charge pairs formed at the PC bilayer/water
interface in the POPC, palmitoylelaidoylPC (PEPC), and DMPC bilayers revealed
that these interactions make up an extended network that links PC molecules; this
network involves a large majority (more than 96%) of the bilayer lipid molecules
at any instant [127]. An analysis of the inter-lipid network discussed above did not
include water anchors. Murzyn et al. [127] found a strong correlation between the
cross-sectional surface area available to a PC head group, either average or individual,
in the bilayer and the number of H-bonds, water bridges and charge pairs a given PC
molecule makes—the larger the area the greater the number of PC-water H-bonds
but the smaller the number of short distance PC-PC interactions; the latter results in
a less branched inter-lipid network in bilayers with a larger average surface area per
lipid.
Computer Modelling of the Lipid Matrix of Biomembranes 349

A lipid bilayer has a strong effect on the properties of the water near its sur-
faces. The results of earlier studies of the effect of the phospholipid bilayer on the
properties of the hydrating water are summarized in Refs. [13, 130, 173]. In a recent
comparative MD simulation study [110] the effect of the DOPC and monogalactosyl-
diacylglycerol, MGDG, bilayers on the properties of the surface water was analysed
in detail. The study showed that ordering the water dipoles by the PC head groups
extended further into the water phase than that by the galactolipid head groups,
whereas inside the bilayer/water interface the ordering was higher in the galactolipid
than the PC bilayer. The study also showed that near the surface of both bilayers the
net orientation of water dipoles was close to horizontal.
In the PC bilayer containing Chol a repertoire of short-distance inter-lipid inter-
actions is greater than in the pure PC bilayer [147]. Chol is both an H-bond donor
and acceptor and the OH group of Chol can form direct H-bonds with phosphate and
carbonyl oxygen atoms of PC. Also, a Chol OH group and a choline moiety of PC
can form a charge pair. Such a charge pair was identified by Chiu et al. as a weak PC-
Chol hydrogen bond [25]. Unfortunately, high level quantum chemistry calculations
have not been performed yet to establish how to classify this short-distance PC-Chol
interaction. In the DMPC bilayer containing Chol [147], a network of inter-lipid
interactions forms as in the bilayer without Chol, and it involves a large majority of
DMPC and Chol molecules, although it is less branched than in the DMPC bilayer
without Chol [150].
Several phospholipids, in particular PE, PS, PG, and SM, unlike PC (Fig. 2), are
both H-bond donors and acceptors, thus they are able to make direct inter-lipid H-
bonds. Short-distance interactions between these phospholipids at the bilayer/water
interface in the absence or presence of PC have been analysed e.g. in Refs. [32, 34, 41,
91, 126, 128, 141, 230]. A comparative MD simulation study of DPPE and DPPC
bilayers [91] showed that these direct inter-lipid H-bonds at the bilayer interface
result in a smaller cross-sectional surface area per lipid, and a higher acyl chain
order, and are responsible for the higher temperature of the main phase transition of
the PE than PC bilayer. In binary PC-PE bilayers, with increasing PE content, the
average surface area per lipid noticeably decreases and the chain order increases [34,
91].
At the water/bilayer interface, ions also interact with phospholipids. One of the
first bilayer simulations that included ions was carried out on a PS bilayer by Pandit
and Berkowitz [141]. PS is a donor and acceptor of H-bonds but is also negatively
charged. The authors [141] observed that, once the negative charge of the PS serine
group (cf. Fig. 2) is compensated by Na+ counterions, the PS molecule becomes
analogous to the PE molecule, and a PS bilayer in the presence of Na+ has similar
properties to a PE bilayer. They also showed that Na+ ions are generally coordinated
by both serine carboxyl and phosphate groups. In a much longer MD simulation of a
PS bilayer, Mukhopadhyay et al. [126] observed that Na+ ions penetrate deeper into
the bilayer/water interface and are mainly coordinated by carbonyl oxygen atoms.
The disparity between the results of Pandit and Berkowitz [141] and Mukhopadhyay
et al. [126] was, most likely, due to the slow penetration of the bilayer/water interface
by Na+ ; to reach a stable distribution of Na+ ions the bilayer has to be equilibrated
350 M. Pasenkiewicz-Gierula and M. Markiewicz

for at least 10 ns [126]. The MD simulations of Mukhopadhyay et al. additionally


showed ion-mediated inter-lipid interactions, where an Na+ ion was coordinated to
oxygen atoms belonging to different PS molecules. PG is also negatively charged,
and in MD simulated PG bilayers Na+ ions were also preferentially located close to
the carbonyl groups and were coordinated to oxygen atoms belonging to different
PG molecules thus forming ion-bridges [41, 230]. As Zhao et al. [230] showed, these
ion-bridges formed an extended and stable network of ion-mediated inter-lipid links.
In a POPC-Chol bilayer simulated for 60 ns, Na+ ions were bound preferentially
to phosphate and also to carbonyl oxygen atoms [111] and formed ion bridges by
coordinating up to four PC molecules (Fig. 6e).
A systematic MD simulation study of the effects of different mono-, di- and
trivalent cations on a PC bilayer was carried out by Cordomi et al. [28]. They showed
that the effect of ions on the properties of a lipid bilayer depends on the specific
characteristics of each of the ions, i.e., radius, charge, and coordination properties.
On average, one cation has 2–3 PC molecules in its first coordination shell and it
preferentially binds to carbonyl and phosphate oxygen atoms, except K+ , which does
not bind stably to any of the PC oxygen atoms. The binding of cations also depends
on the thermotropic phase of the bilayer; an MD simulation study by Stepniewski
et al. [195] showed that in a PC bilayer in the gel phase there are no Na+ ions in
the carbonyl groups region, whereas in the liquid-crystalline phase Na+ ions locate
preferentially in this bilayer region.
Most salt solutions used in bilayer studies are chlorides, thus the effect of Cl−
anions on the bilayer has also been studied using MD simulation e.g., [13] and
citations therein; [28]. These studies showed that Cl− ions only weakly associate with
choline groups and the maximum in their density distribution is 0.7–1.2 nm shifted
towards the water phase relative to the maximum in the cation density distribution
[28]. As was shown by Mukhopadhyay et al. [126], Cl− ions have little effect on the
bilayer properties.
A comprehensive review of the structural organisation of the bilayer/water inter-
face, as well as intermolecular interactions and their dynamics at the interface is
given in Ref. [145].

3.2 Bilayer Hydrophobic Core

Once a bilayer is formed, in its hydrophobic core there is a balance between attrac-
tive van der Waals interactions among adjacent acyl chains and inter-chain entropic
repulsion. The extent of the attractive interaction depends on the phospholipid chain
length and the degree of unsaturation. Longer saturated chains attract one another
more strongly than shorter chains. They are therefore more densely packed in the
bilayer core. In consequence, their mobility is decreased and the main phase tran-
sition temperature of the bilayer is increased. A cis-double bond located near the
middle of the chain, which is typical for mono-unsaturated chains of PCs in animal
cell membranes, interferes with the chain packing. In effect, cis-unsaturated chains
Computer Modelling of the Lipid Matrix of Biomembranes 351

are less densely packed and have considerable motional freedom in the bilayer core.
These decrease the cooperativity of the chain interactions and cause a decline in
the main phase transition temperature of cis-unsaturated compared with that of sat-
urated bilayers. It is interesting to note that bilayers made of phospholipids with
trans-double bonds have a significantly higher main phase transition temperature
than those made of corresponding cis-unsaturated phospholipids and, in general,
their properties are more similar to those of bilayers made of saturated than cis-
unsaturated phospholipids [83]. MD simulations of Róg et al. [172] and Kulig et al.
[85] provided a plausible explanation of these similarities (cf. Sect. 2.5.1 and 4.1).
As has already been discussed above (Sect. 2.6.2), in binary PC-Chol bilayers,
Chol both induces a higher order of PC acyl chains (ordering effect) [138] and
makes their packing denser (condensing effect) [116], although the atomic-level
mechanisms that are responsible for the effects are not easy to indicate precisely.
Thus, there is still no general consent regarding the molecular basis of both effects.
Many researchers claim that phospholipid acyl chains strongly interact with steroid
rings and this makes the chains more straight and ordered—this concept was first put
forward by Levine and Wilkins [93]—and the attractive character of the interaction
increases the packing of atoms in the bilayer. There are two ways to increase the
chain order as measured by one of the order parameters. One of them is to reduce the
number of gauche rotamers along acyl chains, and the other is to reduce the tilting
of acyl chains; tilt, by definition is the angle (θ ) between the chain vector (linking
the carbon atom next to the carbonyl group with the last in the chain) and the bilayer
normal (Fig. 7). However, such a definition of the chain tilt might be ambiguous. In
the liquid-crystalline bilayer, there is no collective tilt of chains. To say that, one has
to consider both the azimuthal, φ, and the polar, θ , chain angles (Fig. 7). Generally
speaking, no collective tilt means that due to the axial symmetry of the bilayer, for a
given θ angle, there are 2π φ angles of equal probability; this means that the average
value of θ over the whole range of angles is zero. In the liquid-crystalline bilayer
the chains are randomly tilted relative to the normal within the confines of a cone
[86, 156] with some distribution. In the tilt analysis, one is interested only in the
absolute value of θ angles. Due to internal flexibility of phospholipid acyl chains,
the chain tilt in the liquid-crystalline bilayer cannot be measured in spectroscopic
experiments. But for such a rigid molecule as is Chol spectroscopic methods can
provide an average tilt of the molecule from the average cosine square of θ [148]. In
MD simulations, the distributions of both θ and φ angles can be determined e.g. [1,
146, 202]. MD simulations clearly show that Chol increases Smol along the whole
chain either saturated e.g., [44] or mono-unsaturated e.g., [158, 179], by decreasing
the average chain tilt and narrowing the tilt angles distribution e.g., [158, 174].
However, it has a relatively mild effect on the probability of the trans conformation
of torsion angles along the chain, particularly in the case of mono-unsaturated chains
[158, 179].
Based on the analysis of the radial distribution function of carbon atoms in the
hydrophobic bilayer core, Róg and Pasenkiewicz-Gierula [174, 175, 177] postulated
that an increased packing of atoms in the bilayer (Chol condensing effect) originates
from interactions between the chains, and not between the chains and the Chol
352 M. Pasenkiewicz-Gierula and M. Markiewicz

Fig. 7 Definition of the tilt


angle θ (polar) and the
azimuthal angle φ

rings. This explanation of the Chol condensing effect is in line with the experimental
hypothesis postulated by Hyslop et al. [64], i.e. that Chol induces an increase in the
van der Waals interactions of acyl chains, while its van der Waals interactions with
the chains are less favourable [64]. Also, the free energy calculations of Zhang et al.
showed favourable changes in lipid–lipid interactions near cholesterol molecules
[228]. In binary PC-Chol bilayers, the Chol induced condensing effect is limited
only to that fragment of each chain that penetrates the bilayer core to the same depth
as the cholesterol ring [3, 175]. A more recent MD simulation study [117] reveals
that in the PC-Chol bilayer Chol molecules avoid direct Chol-Chol contacts, and at a
higher Chol content form a three-fold symmetric arrangement with the nearest Chol
molecules. This induces a particular relative orientation of Chol adjacent PC acyl
chains and their ordering. The main conclusion of this study was that Chol molecules
act collectively in the lipid bilayer [117].
Computer Modelling of the Lipid Matrix of Biomembranes 353

4 Dynamics of Lipids in the Bilayer (Internal Dynamics,


Translational and Rotational Diffusion, Anomalous
Lateral Diffusion)

4.1 Trans-Gauche Isomerization

The fastest motion having a direct influence on the bilayer properties is trans-gauche
isomerisation. This causes constant conformational changes in lipid acyl chains and,
together with the vibrations of the covalent bonds and valence angles, makes lipid
molecules internally flexible. This gives rise to the liquid-like (fluid) character of
the bilayer. In saturated acyl chains, there are three low energy conformations: trans
(t, torsion angle 180°), gauche-plus (g+ , torsion angle 60°) and gauche-minus (g– ,
torsion angle −60°). The trans conformation has the lowest torsional energy, thus it is
the most probable and has the longest lifetime of the three conformations. In naturally
occurring mono-unsaturated acyl chains the torsion angle associated with the double
bond is mainly in cis conformation. This conformation is stable (has a much longer
lifetime than those for single bonds) because the rotation around the double bond
is restricted. The rigidity of the double bond obviously affects the rotational states
of the single bonds connected directly to the double bond. The effect of the double
bond on the conformation of the adjacent single bonds was first observed in MD
simulations described in Refs. [129, 172], even though the torsional parameters for
these bonds there were not fully correct as the rotation around these single bonds
was unrestricted (no barriers for rotation). The parameterisation for the single bonds
derived in a rigorous way [4] takes into account that the most probable conformation
around each of the single bonds next to the double bond are skew-plus (s+ , torsion
angle 120°) and skew-minus (s− , torsion angle −120°). The profiles of probabilities
and lifetimes for t, g+− and s+− along saturated and mono-unsaturated chains of POPC
in pure POPC and POPC-Chol 1:1 bilayers were calculated by Plesnar et al. [158].
These results are in overall agreement with the experimental data of Tuchtenhagen
et al. [210]. The most recent calculations that lead to the revised parameters for the
single bonds next to the trans double bond determined that in addition to their most
probable s+ and s− conformations, mentioned above, the cis conformation is also
highly probable as are, to a lesser extent, any other conformation of these single
bonds [85]. This is due to the relatively low barriers for rotation around the single
bonds next to the trans double bond.

4.2 Rotational Diffusion

Acyl chains of phospholipids in a liquid-crystalline bilayer possess considerable


intra-molecular flexibility (trans-gauche isomerisation of torsion angles correspond-
ing to single bonds), thus over a short time scale their rotational motion cannot be
354 M. Pasenkiewicz-Gierula and M. Markiewicz

treated as a rigid rod motion. However, over a timescale much longer than trans-
gauche isomerisation, the overall effect of the isomerisation along the acyl chain
might be approximated by a fast rotation of the chain around its long axis, which
would give the chain an apparent cylindrical shape. As has already been mentioned
in Sect. 3.2, chains are randomly tilted relative to the normal within the confines of
a cone; this tilting results from chain rotation around the axis perpendicular to the
bilayer normal and restrictions from a relatively dense environment of other acyl
chains [86, 101, 156]. However, it is not easy to indicate whether the perpendicu-
lar axis is associated with one particular or several covalent bonds or whether it is
the axis of rotation of the whole phospholipid molecule. As the timescale of this
perpendicular rotation is much longer than isomerisation [156], in the first approx-
imation it might be acceptable to assume that indeed over a longer time scale the
rotational motion of acyl chains can indeed be approximated by a rigid rod rotational
diffusion. This rotation takes place in a restoring potential [86, 101, 120, 151] that
acts to align the chains along the bilayer normal. A thorough analysis of the nuclear
magnetic resonance (NMR) spectra of PC bilayers provided correlation times for
trans-gauche isomerisation of the order of 10−10 s (~0.1 ns) and for chain reorienta-
tion of the order of 10−8 –10−7 s (10–100 ns) [46, 156]. These times generally agree
well with those obtained in MD simulations of lipid bilayers e.g., [43, 101, 123,
146, 158]. The lifetimes of trans and gauche rotamers along the PC chain do not
change significantly and for the trans and gauche conformations fall within a range
150–300 ps, and ~50–80 ps, respectively e.g., [146, 151, 158]. The rotational motion
of a PC molecule or fragments of a PC molecule were analysed in e.g. Refs. [43, 101,
123, 146, 151]. In each of these papers a different approach was used to calculate
the motional parameters. Pasenkiewicz-Gierula and Róg [146] assessed rotational
correlation times from the rotational autocorrelation function (RAF) for Legendre
polynomials P1 (cosθ) and P2 (cosθ), where θ is the angle between the chosen vec-
tor at time t 0 and time t + nΔt. RAFs were calculated from a 2-ns MD trajectory
for three fragments of the DMPC molecule: P-N vector, O21-C1 (shoulder) vector,
and the chain vector defined as a vector linking a carbon atom next to the carbonyl
group with the centre of gravity of the chain. The RAF as a function of time was
then fitted to the sum of exponentials, although each decay curve was practically
a single-exponential function. This analysis clearly indicated that each of the three
fragments of the DMPC molecule rotate with different correlation times and the
rotation of the acyl chain is the slowest. The estimated rotational correlation times
from RAFs for P1 (cosθ) are ~ 4–6 × 10−8 s for the chain vector, ~2 × 10−8 s for the
shoulder vector, and ~0.7 × 10−8 s for the P-N vector [146]. A qualitatively similar
result was obtained by Moore et al. [123], who calculated the rotational diffusion
coefficients for the rotation of certain DMPC vectors relative to the molecular-fixed
reference frame, from an angular mean square displacement (MSD) function. It is
not possible to obtain, in general, the rotational correlation time from the diffusion
coefficient for restricted rotation in a restoring potential, so a numerical comparison
of the results of both papers is not possible. Nevertheless, both papers demonstrated
that different lipid fragments rotate to a large extent independently of one another.
However, of the fragments, chain rotation was the fastest in Ref. [123] and the slowest
Computer Modelling of the Lipid Matrix of Biomembranes 355

in Ref. [146]. Essmann and Berkowitz [43] derived rotational diffusion coefficients
from time correlation functions for Wigner rotation matrices, first assuming a free
rotor model for the DPPC molecule rotating within a pre-defined reference frame,
and roughly estimated that rotation around the long molecular axis is one order of
magnitude faster than that around the perpendicular axis.
The results of Pasenkiewicz-Gierula and Róg [146] and Moore et al. [123] indicate
that a PC molecule in the bilayer does not rotate as a rigid rod and actually each of the
PC chains rotates independently. As the azimuthal angle φ of an acyl chain vector
(Fig. 7) is not restricted and covers the whole range of angles 0–360° with equal
probability, there is certainly not a single axis of the perpendicular chain rotation. So
what is the origin of the PC acyl chain rotation? Using NMR and X-ray diffraction,
Hauser et al. [59] determined that the glycerol backbone of a PC molecule is not as
rigid and there are two conformations about the C2–C3 bond (cf. Fig. 2) that rapidly
interconvert on the NMR time scale (estimated as 1010 conversions per s). This
interconversion destroys the parallel alignment of the PC acyl chains. To compensate
for the effect of this interconversion and maintain the parallel alignment of the PC
acyl chains, the first four torsion angles in each of the chains must synchronously and
appropriately change [59]. However, in the liquid-crystalline bilayer, the chains are
not aligned parallel to each other, and the transient tilt of one chain is independent
of that of the other chain. On the basis of the analyses of Hauser et al. [59] one could
conclude that a transition between low energy conformations of any of the first four
torsion angles can bring about changes in the tilt of the acyl chain even though all other
torsions are trans. A simple test (unpublished results), where in a well equilibrated
PC bilayer that was MD simulated for 200 ns [158] all torsion angles in the acyl
chains were manually changed to trans conformation, whereas the conformations of
those in the glycerol backbone were unchanged (torsions for the C2–C3, C2–O21,
C3–O31, and O31–C31 bonds, cf. Fig. 2) showed a broad distribution of tilt angles
of PC acyl chains. This indicates that the chain tilting is to a large extent governed
by conformational states about the bonds in the glycerol backbone and that the chain
perpendicular rotation involves a combination of torsional events in the backbone.
In addition to this, the third torsion angle in each PC chain (corresponding to the
C31–C32 and C21–C22 bond, respectively, cf. Fig. 2) has markedly low barriers for
rotation [105], and thus can rapidly change its value triggering fast local changes
in the orientation of the associated acyl chain fragment; this change can propagate
along the chain.

4.3 Translational Diffusion

A phospholipid bilayer in the liquid-crystalline phase is a quasi-two-dimensional


system, as the molecules can translocate laterally, whereas their vertical moves are
significantly restricted. The lateral displacement of lipids in the bilayer as a func-
tion of time can be determined from the mean square displacement function, which,
in the framework of a two dimensional random walk model is related to the lateral
356 M. Pasenkiewicz-Gierula and M. Markiewicz

self-diffusion coefficient. Experimental methods provide a range of translational self-


diffusion coefficients for molecules in lipid bilayers, whose extremes differ by two
orders of magnitude, depending on the time window of the applied method e.g. [49,
189, 201]. To reconcile this discrepancy, a model of lipids diffusion was proposed
[218], where, over a short time scale, lipids “rattle-about” in a vacant space, and
over a longer time scale undergo discrete jumps whose lengths are about the same as
the diameter of a lipid molecule. Pastor and Feller [151] estimated that in a bilayer
consisting of 72 lipid molecules it would take ~170 ns for all molecules to jump
once. 100 and 40-ns MD simulations of PC bilayers consisting, respectively, of 128
and 1152 lipid molecules [45] gave no evidence for a jump-diffusion model of lipid
molecules in the bilayer and for two clearly distinct regimes, rattling and jumps.
Instead, they showed that the lateral motion of neighbouring lipids is strongly cor-
related and lipids move as a loosely defined transient local clusters. Moreover, these
clusters undergo concerted motions over much longer time scales. Thus, the motion
of lipids is correlated over tens of nanometers and show two-dimensional collective
flows [45]. Theoretical analyses of the lateral diffusion of lipids on a hundred ns time
scale carried out in the framework of the generalized Langevin equation [82] showed
that diffusion displays a clear signature of subdiffusion, with fractional diffusion con-
stants that are compatible with the experimental results obtained using fluorescence
correlation spectroscopy [186]. This anomalous diffusion (subdiffusion) of lipids in
the bilayer is consistent with the collective flow patterns in the lateral motions of
lipids observed by Falck et al. [45]. The results of Kneller et al. [82] indicate that
the lateral displacement of lipids in the bilayer over a hundred ns time scale cannot
be analysed in the framework of the normal Brownian diffusion model. A similar
conclusion was drawn from the MD simulation study of the Vattulainen group [70,
73], who additionally demonstrated the effect of membrane crowding on the lipid
lateral diffusion.

5 Mechanical Properties of a Lipid Bilayer (Rigidity,


Pressure Profile Across the Bilayer, Curvature)

The thickness of the lipid matrix of a biomembrane is a few nanometres as it is


the sum of the length of two phospholipid molecules. At the same time, the matrix
covers the surface of the cell, which, in the case of an animal cell of approximately
10 μm in diameter, exceeds 100 μm2 . This clearly shows that the lipid matrix has
to have outstanding mechanical properties. As a material, the lipid bilayer is elastic
(deformable), durable and volumetrically incompressible [57, 90]. However, con-
sidering the mechanical properties of an animal cell biomembrane, it should be
appreciated that beneath the lipid bilayer there is a membrane skeleton, an inter-
nal network of protein filaments that is coupled to the biomembrane and both these
supramolecular structures respond to mechanical deformation of the cell. The dis-
proportion between the thickness of the bilayer and its lateral dimension attracts the
Computer Modelling of the Lipid Matrix of Biomembranes 357

attention of researchers to the study of the bilayer surface and mechanical proper-
ties. These properties can also be studied using MD simulation methodology. The
mechanical properties of the DOPC and MGDG bilayers are compared in Ref. [5].
The bending rigidity modulus calculated is higher for the MGDG than POPC bilayer
due to the higher number of inter-lipid interactions at this bilayer surface. This results
in a smaller surface area per molecule and thus in an increased rigidity of the MGDG
bilayer compared to the DOPC bilayer.
One of the basic surface properties is its curvature. Unfortunately, limitations on
the spatial and temporal scales of current atomistic MD simulations, as well as the
use of periodic boundary conditions, make direct observation and calculation of the
lipid bilayer curvature a non-trivial task. One of the methods for determining the
curvature involves calculating the depth-dependent distribution of intra-membrane
pressures, the lateral pressure profile. To calculate the profile, the bilayer is divided
into thin slices parallel to the interface plane and then the pressure tensor is calculated
for each slice [100, 139]. On the basis of Helfrich’s theory [60], one can calculate
the spontaneous curvature and Gaussian curvature modulus by integrating the lateral
pressure profile [140].
The lateral pressure profile model is a valuable analytical tool for explaining pro-
cesses such as membrane protein activation. It was shown that changes in the lateral
pressure profile may result in biologically significant changes in protein conforma-
tions [17, 18, 20, 21, 121, 155]. Along these lines is the lateral pressure hypothesis of
the anaesthetic mode of action. Recent computer simulation studies on the influence
of anaesthetics such as ethanol [204] or 1-alkanols [52] on the lateral pressure profile
of a membrane seem to confirm the mechanically driven mechanism of anaesthesia
[17, 19]. Since the lipid composition of cell membranes strongly affects the activity
of membrane proteins, the effects of the phospholipid head group, acyl chain length,
unsaturation, cholesterol content, and surface area per lipid on the pressure profile
across the bilayer were studied using MD simulation methods e.g. in Refs. [22, 53,
139, 153]. It was shown in those studies that all these factors have a considerable
effect on the lateral pressure profile.

6 Simple Models of Specific Biomembranes

Some of the lipid bilayers discussed in this chapter may be viewed as models of lipid
matrices of specific biomembranes. Binary POPC-Chol [26, 179] or SM-Chol [143,
178] bilayers may serve as simple models of a “generic” animal cell membrane,
particularly of its outer leaflet. A binary bilayer made of PE and PG at a 3:1 molar
ratio [128, 231] can serve as a model for the inner bacterial membrane. Binary
bilayers of mono- and digalactosyldiacylglycerol with polyunsaturated acyl chains
are good models of a photosynthetic membrane. More realistic models of an animal
cell membrane are discussed in Sect. 2.6.4 on asymmetric bilayers where two leaflets
of the bilayer have a different but relevant lipid composition. POPC and Chol are the
main lipid species found in human and pig gastric mucus, thus POPC-Chol bilayer
358 M. Pasenkiewicz-Gierula and M. Markiewicz

can also serve as a model for the gastric mucosal cell membrane [111]. A mixture of
DPPC and DPPG at a 7:3 molar ratio in the form of a monolayer might be used as a
model for the lung surfactant [79]. A ternary mixed cardiolipin, PC, and PE bilayer
may constitute a model for the inner mitochondrial membrane [171]. As was already
discussed in Sect. 2.6.3, a ternary mixed bilayer composed of saturated, unsaturated
phospholipids and cholesterol can model a raft-like domain in the bulk membrane
[143].

7 Concluding Remarks

Biomembranes, which are supramolecular structures of great structural and dynam-


ical complexity, are experimentally challenging. These assemblies comprise a very
large number of molecules of different types and among them only a few, mainly
transmembrane proteins, have well-defined conformations. In contrast, the main
structural element of any biomembrane, the lipid matrix, consists predominantly
of molecules that rapidly change their conformations, so only the lamellar structure
of the matrix is well preserved. However, even this lyotropic phase may in certain
environmental conditions locally change to a non-lamellar phase to activate some
membrane proteins e.g., [51]. A very particular feature of any lipid matrix is that
its molecular organisation, necessary for biological functionality, is controlled by
basic physical principles and relies mainly on weak physical interactions between
molecules whose key characteristic is amphipathicity (lipids) and polarity (water).
Detailed atomic-level information about the dynamical structure and processes
that take place in the lipid bilayer can be obtained using an MD simulation method
with atomic resolution. At present, the method allows one to observe the processes
that take place over a 10−6 s time scale for systems containing over 200,000 atoms
e.g. [37, 71]. MD simulations complement experimental studies that, in general, have
a worse spatial and temporal resolution than computer modelling, but do not have
similar restrictions on the size of the system studied, and, in many cases, have a much
longer observation window than molecular processes. In addition to complementing,
MD simulations stimulate experimental studies. They are also of great assistance in
explaining experimental results by indicating basic mechanisms that are responsible
for them. This positive feed-back between experiment and MD simulation leads to a
better understanding of the properties of lipid bilayers and their role in biomembrane
functioning and also helps us to improve those models necessary for the interpretation
of experimental data.
Computer models need to be validated by experimental results. Among the plen-
itude of data provided by MD simulations there are some that can also be obtained
experimentally, and they are used to validate the computer model e.g. [7, 160, 161].
When validation is positive, one can trust those results of MD simulation that are not
accessible to current experimental methods.
The topics discussed in this chapter are necessarily biased towards the research
interests of the authors and their colleagues, such as the bilayer/water interface, short-
Computer Modelling of the Lipid Matrix of Biomembranes 359

range intermolecular interactions that stabilise the bilayer, the effect of cholesterol
and lipid dynamics in the bilayer and therefore certain important issues relating to
lipid bilayers are not referenced here. Excellent reviews on a broader range of topics
were cited at the end of the Introduction and more specific topics are discussed in
papers cited throughout this chapter.

Acknowledgements MPG is grateful to Akihiro Kusumi and W. K. Subczynski for numerous


discussions. The Polish National Science Centre is acknowledged for the financial support (grants
no. N301 472638; N301 02131/0553; 2011/01/B/NZ1/00081; 2016/22/M/NZ1/0187). Faculty of
Biochemistry, Biophysics and Biotechnology of Jagiellonian University is a partner of the Leading
National Research Centre (KNOW) supported by the Ministry of Science and Higher Education.

References

1. Aittoniemi, J., Rog, T., Niemela, P., Pasenkiewicz-Gierula, M., Karttunen, M., Vattulainen,
I.: Tilt: major factor in sterols’ ordering capability in membranes. J. Phys. Chem. B 110(51),
25562–25564 (2006)
2. Alper, H.E., Bassolinoklimas, D., Stouch, T.R.: The limiting behavior of water hydrating a
phospholipid monolayer—a computer-simulation study. J. Chem. Phys. 99(7), 5547–5559
(1993)
3. Alwarawrah, M., Dai, J.A., Huang, J.Y.: A molecular view of the cholesterol condensing
effect in DOPC lipid bilayers. J. Phys. Chem. B 114(22), 7516–7523 (2010)
4. Bachar, M., Brunelle, P., Tieleman, D.P., Rauk, A.: Molecular dynamics simulation of a
polyunsaturated lipid bilayer susceptible to lipid peroxidation. J. Phys. Chem. B 108(22),
7170–7179 (2004)
5. Baczynski, K., Markiewicz, M., Pasenkiewicz-Gierula, M.: A computer model of a polyun-
saturated monogalactolipid bilayer. Biochimie 118, 129–140 (2015)
6. Bassolinoklimas, D., Alper, H.E., Stouch, T.R.: Solute diffusion in lipid bilayer-
membranes—an atomic-level study by molecular-dynamics simulation. Biochemistry 32(47),
12624–12637 (1993)
7. Benz, R.W., Castro-Roman, F., Tobias, D.J., White, S.H.: Experimental validation of molec-
ular dynamics simulations of lipid bilayers: a new approach. Biophys. J. 88(2), 805–817
(2005)
8. Berendsen, H., Postma, J., Van Gunsteren, W., Hermans, J.: Interaction Models for Water in
Relation to Protein Hydration. Intermolecular Forces, vol. 331. Reidel, Dordrecht (1981)
9. Berendsen, H.J.C.: Simulating the Physical World, Hierarchical Modeling from Quantum
Mechanics to Fluid Dynamics. Cambridge University Press, Cambridge (2007)
10. Berendsen, H.J.C., Tieleman, D.P.: Molecular dynamics: studies of lipid bilayers. In: Schleyer,
R. (ed.) Encyclopedia of Computational Chemistry, pp. 1639–1650. Wiley and Sons (1998)
11. Berger, O., Edholm, O., Jahnig, F.: Molecular dynamics simulations of a fluid bilayer of
dipalmitoylphosphatidylcholine at full hydration, constant pressure, and constant temperature.
Biophys. J. 72(5), 2002–2013 (1997)
12. Berkowitz, M.L.: Detailed molecular dynamics simulations of model biological membranes
containing cholesterol. Biochim. Biophys. Acta-Biomem. 1788(1), 86–96 (2009)
13. Berkowitz, M.L., Bostick, D.L., Pandit, S.: Aqueous solutions next to phospholipid membrane
surfaces: insights from simulations. Chem. Rev. 106(4), 1527–1539 (2006)
14. Bhide, S.Y., Zhang, Z.C., Berkowitz, M.L.: Molecular dynamics simulations of SOPS and
sphingomyelin bilayers containing cholesterol. Biophys. J. 92(4), 1284–1295 (2007)
15. Bloom, M., Evans, E., Mouritsen, O.G.: Physical-properties of the fluid lipid-bilayer compo-
nent of cell-membranes—a perspective. Q. Rev. Biophys. 24(3), 293–397 (1991)
360 M. Pasenkiewicz-Gierula and M. Markiewicz

16. Buldt, G.: The headgroup conformation of phospholipids in membranes. J. Membr. Biol.
58(2), 81–100 (1981)
17. Cantor, R.S.: The lateral pressure profile in membranes: a physical mechanism of general
anesthesia. Biochemistry 36(9), 2339–2344 (1997)
18. Cantor, R.S.: Lateral pressures in cell membranes: a mechanism for modulation of protein
function. J. Phys. Chem. B 101(10), 1723–1725 (1997)
19. Cantor, R.S.: The lateral pressure profile in membranes: a physical mechanism of general
anesthesia. Toxicol. Lett. 101, 451–458 (1998)
20. Cantor, R.S.: The influence of membrane lateral pressures on simple geometric models of
protein conformational equilibria. Chem. Phys. Lipids 101(1), 45–56 (1999)
21. Cantor, R.S.: Lipid composition and the lateral pressure profile in bilayers. Biophys. J. 76(5),
2625–2639 (1999)
22. Carrillo-Tripp, M., Feller, S.E.: Evidence for a mechanism by which ω-3 polyunsaturated
lipids may affect membrane protein function. Biochemistry 44(30), 10164–10169 (2005)
23. Cascales, J.J.L., Otero, T.F., Smith, B.D., Gonzalez, C., Marquez, M.: Model of an asymmetric
DPPC/DPPS membrane: effect of asymmetry on the lipid properties. A molecular dynamics
simulation study. J. Phys. Chem. B 110(5), 2358–2363 (2006)
24. Cevc, G., Watts, A., Marsh, D.: Titration of the phase-transition of phosphatidylserine bilayer-
membranes—effects of Ph, surface electrostatics, ion binding, and headgroup hydration. Bio-
chemistry 20(17), 4955–4965 (1981)
25. Chiu, S.W., Jakobsson, E., Mashl, R.J., Scott, H.L.: Cholesterol-induced modifications in lipid
bilayers: a simulation study. Biophys. J. 83(4), 1842–1853 (2002)
26. Chiu, S.W., Jakobsson, E., Scott, H.L.: Combined Monte Carlo and molecular dynamics sim-
ulation of hydrated lipid-cholesterol lipid bilayers at low cholesterol concentration. Biophys.
J. 80(3), 1104–1114 (2001)
27. Chowdhary, J., Harder, E., Lopes, P.E.M., Huang, L., MacKerell, A.D., Roux, B.: A polarizable
force field of dipalmitoylphosphatidylcholine based on the classical drude model for molecular
dynamics simulations of lipids. J. Phys. Chem.B 117(31), 9142–9160 (2013)
28. Cordomi, A., Edholm, O., Perez, J.J.: Effect of ions on a dipalmitoyl phosphatidylcholine
bilayer. A molecular dynamics simulation study. J. Phys. Chem. B 112(5), 1397–1408 (2008)
29. Cornell, W.D., Cieplak, P., Bayly, C.I., Gould, I.R., Merz, K.M., Ferguson, D.M., Spellmeyer,
D.C., Fox, T., Caldwell, J.W., Kollman, P.A.: A 2nd generation force-field for the simulation
of proteins, nucleic-acids, and organic-molecules. J. Am. Chem. Soc. 117(19), 5179–5197
(1995)
30. Damodaran, K.V., Merz, K.M.: Head group water interactions in lipid bilayers—a comparison
between Dmpc-based and Dlpe-based lipid bilayers. Langmuir 9(5), 1179–1183 (1993)
31. Damodaran, K.V., Merz, K.M.: A comparison of Dmpc-based and Dlpe-based lipid bilayers.
Biophys. J. 66(4), 1076–1087 (1994)
32. Damodaran, K.V., Merz, K.M., Gaber, B.P.: Structure and dynamics of the dilauroylphos-
phatidylethanolamine lipid bilayer. Biochemistry 31(33), 7656–7664 (1992)
33. Davis, J.E., Patel, S.: Charge equilibration force fields for lipid environments: applications to
fully hydrated DPPC bilayers and DMPC-embedded gramicidin a. J. Phys. Chem. B 113(27),
9183–9196 (2009)
34. de Vries, A.H., Mark, A.E., Marrink, S.J.: The binary mixing behavior of phospholipids in a
bilayer: a molecular dynamics study. J. Phys. Chem. B 108(7), 2454–2463 (2004)
35. Demel, R.A., Bruckdorfer, K.R., Vandeene, L.l.: Effect of sterol structure on permeability of
liposomes to glucose, glycerol and Rb+ . Biochim. Biophys. Acta 255(1), 321–330 (1972)
36. Dufourc, E.J., Parish, E.J., Chitrakorn, S., Smith, I.C.P.: Structural and dynamical details of
cholesterol-lipid interaction as revealed by deuterium NMR. Biochemistry 23(25), 6062–6071
(1984)
37. Dzieciuch-Rojek, M., Poojari, C., Bednar, J., Bunker, A., Kozik, B., Nowakowska, M., Vat-
tulainen, I., Wydro, P., Kepczynski, M., Rog, T.: Effects of membrane PEGylation on entry
and location of antifungal drug itraconazole and their pharmacological implications. Mol.
Pharmaceut. 14(4), 1057–1070 (2017)
Computer Modelling of the Lipid Matrix of Biomembranes 361

38. Egberts, E., Marrink, S.J., Berendsen, H.J.C.: Molecular-dynamics simulation of a phospho-
lipid membrane. Eur. Biophys. J. Biophy. Let. 22(6), 423–436 (1994)
39. Eicher, B., Heberle, F.A., Marquardt, D., Rechberger, G.N., Katsaras, J., Pabst, G.: Joint
small-angle X-ray and neutron scattering data analysis of asymmetric lipid vesicles. J. Appl.
Crystallogr. 50(Pt 2), 419–429 (2017)
40. El-Sayed, M., Guion, T., Fayer, M.: Effect of cholesterol on viscoelastic properties of dipalmi-
toylphosphatidylcholine multibilayers as measured by a laser-induced ultrasonic probe. Bio-
chemistry 25(17), 4825–4832 (1986)
41. Elmore, D.E.: Molecular dynamics simulation of a phosphatidylglycerol membrane. FEBS
Lett. 580(1), 144–148 (2006)
42. Epand, R.M.: Role of membrane lipids in modulating the activity of membrane-bound
enzymes. In: Yeagle, P.L. (ed.) The Structure of Biological Membranes, pp. 499–509. CRC
Press, Boca Raton (2005)
43. Essmann, U., Berkowitz, M.L.: Dynamical properties of phospholipid bilayers from computer
simulation. Biophys. J. 76(4), 2081–2089 (1999)
44. Falck, E., Patra, M., Karttunen, M., Hyvonen, M.T., Vattulainen, I.: Lessons of slicing mem-
branes: interplay of packing, free area, and lateral diffusion in phospholipid/cholesterol bilay-
ers. Biophys. J. 87(2), 1076–1091 (2004)
45. Falck, E., Rog, T., Karttunen, M., Vattulainen, I.: Lateral diffusion in lipid membranes through
collective flows. J. Am. Chem. Soc. 130(1), 44–45 (2008)
46. Feigenson, G.W., Chan, S.I.: Nuclear magnetic relaxation behavior of lecithin multilayers. J.
Am. Chem. Soc. 96(5), 1312–1319 (1974)
47. Feller, S.E.: Molecular dynamics simulations of lipid bilayers. Curr. Opin. Colloid Interface
Sci. 5(3–4), 217–223 (2000)
48. Frischleder, H., Gleichmann, S., Krahl, R.: Quantum-chemical and empirical calculations on
phospholipids. 3. Hydration of dimethylphosphate anion. Chem. Phys. Lipids 19(2), 144–149
(1977)
49. Galla, H.J., Hartmann, W., Theilen, U., Sackmann, E.: On 2-dimensional passive random-
walk in lipid bilayers and fluid pathways in biomembranes. J. Membr. Biol. 48(3), 215–236
(1979)
50. Gawrisch, K., Arnold, K., Gottwald, T., Klose, G., Volke, F.: D-2 Nmr-studies of phos-
phate—water interaction in dipalmitoyl phosphatidylcholine—water-systems. Stud. Biophys.
74, 13–14 (1978)
51. Goss, R., Lohr, M., Latowski, D., Grzyb, J., Vieler, A., Wilhelm, C., Strzalka, K.: Role of
hexagonal structure-forming lipids in diadinoxanthin and violaxanthin solubilization and de-
epoxidation. Biochemistry 44(10), 4028–4036 (2005)
52. Griepernau, B., Bockmann, R.A.: The influence of 1-alkanols and external pressure on the
lateral pressure profiles of lipid bilayers. Biophys. J. 95(12), 5766–5778 (2008)
53. Gullingsrud, J., Schulten, K.: Lipid bilayer pressure profiles and mechanosensitive channel
gating. Biophys. J. 86(6), 3496–3509 (2004)
54. Gumbart, J., Trabuco, L.G., Schreiner, E., Villa, E., Schulten, K.: Regulation of the protein-
conducting channel by a bound ribosome. Structure 17(11), 1453–1464 (2009)
55. Gurtovenko, A.A., Patra, M., Karttunen, M., Vattulainen, I.: Cationic DMPC/DMTAP lipid
bilayers: molecular dynamics study. Biophys. J. 86(6), 3461–3472 (2004)
56. Hall, A., Rog, T., Karttunen, M., Vattulainen, I.: Role of glycolipids in lipid rafts: a view
through atomistic molecular dynamics simulations with galactosylceramide. J. Phys. Chem.
B 114(23), 7797–7807 (2010)
57. Hamill, O.P., Martinac, B.: Molecular basis of mechanotransduction in living cells. Physiol.
Rev. 81(2), 685–740 (2001)
58. Hancock, J.F.: Lipid rafts: contentious only from simplistic standpoints. Nat. Rev. Mol. Cell
Biol. 7(6), 456–462 (2006)
59. Hauser, H., Pascher, I., Sundell, S.: Preferred conformation and dynamics of the glycerol
backbone in phospholipids—an Nmr and X-ray single-crystal analysis. Biochemistry 27(26),
9166–9174 (1988)
362 M. Pasenkiewicz-Gierula and M. Markiewicz

60. Helfrich, W.: Elastic properties of lipid bilayers—theory and possible experiments. Z Natur-
forsch C C 28(11–1), 693–703 (1973)
61. Heller, H., Schaefer, M., Schulten, K.: Molecular dynamics simulation of a bilayer of 200
lipids in the gel and liquid-crystal phases. J. Phys. Chem. 97, 8343–8360 (1993)
62. Huang, J., Swanson, J.E., Dibble, A.R., Hinderliter, A.K., Feigenson, G.W.: Nonideal mixing
of phosphatidylserine and phosphatidylcholine in the fluid lamellar phase. Biophys. J. 64(2),
413–425 (1993)
63. Hub, J.S., Salditt, T., Rheinstadter, M.C., de Groot, B.L.: Short-range order and collective
dynamics of DMPC bilayers: a comparison between molecular dynamics simulations, X-ray,
and neutron scattering experiments. Biophys. J. 93(9), 3156–3168 (2007)
64. Hyslop, P.A., Morel, B., Sauerheber, R.D.: Organization and interaction of cholesterol and
phosphatidylcholine in model bilayer membranes. Biochemistry 29, 1025–1038 (1990)
65. Ingolfsson, H.I., Melo, M.N., van Eerden, F.J., Arnarez, C., Lopez, C.A., Wassenaar, T.A.,
Periole, X., de Vries, A.H., Tieleman, D.P., Marrink, S.J.: Lipid organization of the plasma
membrane. J. Am. Chem. Soc. 136(41), 14554–14559 (2014)
66. Jacob, R.F., Cenedella, R.J., Mason, R.P.: Direct evidence for immiscible cholesterol domains
in human ocular lens fiber cell plasma membranes. J. Biol. Chem. 274(44), 31613–31618
(1999)
67. Jacobson, K., Mouritsen, O.G., Anderson, R.G.W.: Lipid rafts: at a crossroad between cell
biology and physics. Nat. Cell Biol. 9(1), 7–14 (2007)
68. Jambeck, J.P.M., Lyubartsev, A.P.: Derivation and systematic validation of a refined all-atom
force field for phosphatidylcholine lipids. J. Phys. Chem. B 116(10), 3164–3179 (2012)
69. Jambeck, J.P.M., Lyubartsev, A.P.: An extension and further validation of an all-atomistic
force field for biological membranes. J. Chem. Theory Comput. 8(8), 2938–2948 (2012)
70. Javanainen, M., Hammaren, H., Monticelli, L., Jeon, J.H., Miettinen, M.S., Martinez-Seara,
H., Metzler, R., Vattulainen, I.: Anomalous and normal diffusion of proteins and lipids in
crowded lipid membranes. Faraday Discuss. 161, 397–417 (2013)
71. Javanainen, M., Martinez-Seara, H., Vattulainen, I.: Nanoscale membrane domain formation
driven by cholesterol. Sci. Rep. 7 (2017)
72. Jedlovszky, P., Mezei, M.: Effect of cholesterol on the properties of phospholipid membranes.
2. Free energy profile of small molecules. J. Phys. Chem. B 107(22), 5322–5332 (2003)
73. Jeon, J.H., Javanainen, M., Martinez-Seara, H., Metzler, R., Vattulainen, I.: Protein crowding
in lipid bilayers gives rise to non-gaussian anomalous lateral diffusion of phospholipids and
proteins. Phys. Rev. X6(2) (2016)
74. Jiang, W., Hardy, D.J., Phillips, J.C., Mackerell Jr., A.D., Schulten, K., Roux, B.: High-
performance scalable molecular dynamics simulations of a polarizable force field based on
classical Drude oscillators in NAMD. J. Phys. Chem. Lett. 2(2), 87–92 (2011)
75. Jorgensen, W.L., Chandrasekhar, J., Madura, J.D., Impey, R.W., Klein, M.L.: Comparison of
simple potential functions for simulating liquid water. J. Chem. Phys. 79(2), 926–935 (1983)
76. Jorgensen, W.L., Maxwell, D.S., TiradoRives, J.: Development and testing of the OPLS all-
atom force field on conformational energetics and properties of organic liquids. J. Am. Chem.
Soc. 118(45), 11225–11236 (1996)
77. Jorgensen, W.L., Tirado-Rives, J.: The OPLS [optimized potentials for liquid simulations]
potential functions for proteins, energy minimizations for crystals of cyclic peptides and
crambin. J. Am. Chem. Soc. 110(6), 1657–1666 (1988)
78. Kaszuba, K., Rog, T., Bryl, K., Vattulainen, I., Karttunen, M.: Molecular dynamics simulations
reveal fundamental role of water as factor determining affinity of binding of beta-blocker
Nebivolol to beta(2)-adrenergic receptor. J. Phys. Chem. B 114(25), 8374–8386 (2010)
79. Kaznessis, Y.N., Kim, S.T., Larson, R.G.: Simulations of zwitterionic and anionic phospho-
lipid monolayers. Biophys. J. 82(4), 1731–1742 (2002)
80. Kim, T., Im, W.: Revisiting hydrophobic mismatch with free energy simulation studies of
transmembrane Helix Tilt and rotation. Biophys. J. 99(1), 175–183 (2010)
81. Klauda, J.B., Venable, R.M., Freites, J.A., O’Connor, J.W., Tobias, D.J., Mondragon-Ramirez,
C., Vorobyov, I., MacKerell, A.D., Pastor, R.W.: Update of the CHARMM all-atom additive
Computer Modelling of the Lipid Matrix of Biomembranes 363

force field for lipids: validation on six lipid types. J. Phys. Chem. B 114(23), 7830–7843
(2010)
82. Kneller, G.R., Baczynski, K., Pasenkiewicz-Gierula, M.: Communication: consistent picture
of lateral subdiffusion in lipid bilayers: molecular dynamics simulation and exact results. J.
Chem. Phys. 135(14) (2011)
83. Koynova, R., Caffrey, M.: Phases and phase transitions of the phosphatidylcholines. Biochim.
Biophys. Acta-Rev. Biomem. 1376(1), 91–145 (1998)
84. Kulig, W., Pasenkiewicz-Gierula, M., Rog, T.: Topologies, structures and parameter files for
lipid simulations in GROMACS with the OPLS-aa force field: DPPC, POPC, DOPC, PEPC,
and cholesterol. Data Brief 5, 333–336 (2015)
85. Kulig, W., Pasenkiewicz-Gierula, M., Rog, T.: Cis and trans unsaturated phosphatidylcholine
bilayers: a molecular dynamics simulation study. Chem. Phys. Lipids 195, 12–20 (2016)
86. Kusumi, A., Pasenkiewicz-Gierula, M.: Rotational diffusion of a steroid molecule in phos-
phatidylcholine membranes—effects of alkyl chain-length, unsaturation, and cholesterol as
studied by a spin-label method. Biochemistry 27(12), 4407–4415 (1988)
87. Lamoureux, G., MacKerell, A.D., Roux, B.: A simple polarizable model of water based on
classical Drude oscillators. J. Chem. Phys. 119(10), 5185–5197 (2003)
88. Lamoureux, G., Roux, B.: Modeling induced polarization with classical Drude oscillators:
theory and molecular dynamics simulation algorithm. J. Chem. Phys. 119(6), 3025–3039
(2003)
89. Leach, A.R.: Molecular Modelling, Principles and Applications, 2nd edn. Pearson Education,
Harlow, UK (2001)
90. Lee, A.G.: How to understand lipid–protein interactions in biological membranes. In: Yeagle,
P.L. (ed.) Structure of Biological Membranes. CRC Press, Boca Raton (2012)
91. Leekumjorn, S., Sum, A.K.: Molecular simulation study of structural and dynamic properties
of mixed DPPC/DPPE bilayers. Biophys. J. 90(11), 3951–3965 (2006)
92. Lehnert, R., Eibl, H.-J., Müller, K.: Order and dynamics in lipid bilayers from 1,2-dipalmitoyl-
sn-glycero-phospho-diglycerol as studied by NMR spectroscopy. J. Phys. Chem. B 108,
12141–12150 (2004)
93. Levine, Y.K., Wilkins, M.H.F.: Structure of oriented lipid bilayers. Nat. New Biol. 230(11),
69 (1971)
94. Levitt, M., Hirshberg, M., Sharon, R., Daggett, V.: Potential-energy function and parameters
for simulations of the molecular-dynamics of proteins and nucleic-acids in solution. Comput.
Phys. Commun. 91(1–3), 215–231 (1995)
95. Lewis, B.A., Engelman, D.M.: Lipid bilayer thickness varies linearly with acyl chain-length
in fluid phosphatidylcholine vesicles. J. Mol. Biol. 166(2), 211–217 (1983)
96. Lewis, R.N.A.H., McElhaney, R.N.: Calorimetric and spectroscopic studies of the ther-
motropic phase behavior of lipid bilayer model membranes composed of a homologous series
of linear saturated phosphatidylserines. Biophys. J. 79(4), 2043–2055 (2000)
97. Lewis, R.N.A.H., Mcelhaney, R.N., Monck, M.A., Cullis, P.R.: Studies of highly asymmetric
mixed-chain diacyl phosphatidylcholines that form mixed-interdigitated gel phases—fourier-
transform infrared and h-2 Nmr spectroscopic studies of hydrocarbon chain conformation and
orientational order in the liquid-crystalline state. Biophys. J. 67(1), 197–207 (1994)
98. Li, H., Chowdhary, J., Huang, L., He, X.B., MacKerell, A.D., Roux, B.: Drude polarizable
force field for molecular dynamics simulations of saturated and unsaturated zwitterionic lipids.
J. Chem. Theory Comput. 13(9), 4535–4552 (2017)
99. Lindahl, E., Edholm, O.: Mesoscopic undulations and thickness fluctuations in lipid bilayers
from molecular dynamics simulations. Biophys. J. 79(1), 426–433 (2000)
100. Lindahl, E., Edholm, O.: Spatial and energetic-entropic decomposition of surface tension
in lipid bilayers from molecular dynamics simulations. J. Chem. Phys. 113(9), 3882–3893
(2000)
101. Lindahl, E., Edholm, O.: Molecular dynamics simulation of NMR relaxation rates and slow
dynamics in lipid bilayers. J. Chem. Phys. 115(10), 4938–4950 (2001)
364 M. Pasenkiewicz-Gierula and M. Markiewicz

102. Lingwood, D., Simons, K.: Lipid rafts as a membrane-organizing principle. Science
327(5961), 46–50 (2010)
103. Luzzati, V., Husson, F.: Structure of liquid-crystalline phases of lipid-water systems. J. Cell
Biol. 12(2), 207 (1962)
104. Lyubartsev, A.P., Rabinovich, A.L.: Recent development in computer simulations of lipid
bilayers. Soft Matter 7(1), 25–39 (2011)
105. Maciejewski, A., Pasenkiewicz-Gierula, M., Cramariuc, O., Vattulainen, I., Rog, T.: Refined
OPLS all-atom force field for saturated phosphatidylcholine bilayers at full hydration. J. Phys.
Chem. B 118(17), 4571–4581 (2014)
106. MacKerell, A.D. Jr., Brooks, B., Brooks, III C.L., Nilsson, L., Roux, B., Won, Y., Karplus, M.:
Charmm: the energy function and its parameterization with an overview of the program. In:
von Rague Schleyer, P. (ed.) Encyclopedia of Computational Chemistry, vol. 2, pp 271–277.
Wiley (1998)
107. MacKerell, A.D., Bashford, D., Bellott, M., Dunbrack, R.L., Evanseck, J.D., Field, M.J.,
Fischer, S., Gao, J., Guo, H., Ha, S., Joseph-McCarthy, D., Kuchnir, L., Kuczera, K., Lau,
F.T.K., Mattos, C., Michnick, S., Ngo, T., Nguyen, D.T., Prodhom, B., Reiher, W.E., Roux,
B., Schlenkrich, M., Smith, J.C., Stote, R., Straub, J., Watanabe, M., Wiorkiewicz-Kuczera,
J., Yin, D., Karplus, M.: All-atom empirical potential for molecular modeling and dynamics
studies of proteins. J. Phys. Chem. B 102(18), 3586–3616 (1998)
108. Mainali, L., Raguz, M., Subczynski, W.K.: Formation of cholesterol bilayer domains precedes
formation of cholesterol crystals in cholesterol/dimyristoylphosphatidylcholine membranes:
EPR and DSC studies. J. Phys. Chem. B 117(30), 8994–9003 (2013)
109. Mark, P., Nilsson, L.: Structure and dynamics of the TIP3P, SPC, and SPC/E water models at
298 K. J. Phys. Chem. A 105(43), 9954–9960 (2001)
110. Markiewicz, M., Baczynski, K., Pasenkiewicz-Gierula, M.: Properties of water hydrating the
galactolipid and phospholipid bilayers: a molecular dynamics simulation study. Acta Biochim.
Pol. 62(3), 475–481 (2015)
111. Markiewicz, M., Pasenkiewicz-Gierula, M.: Comparative model studies of gastric toxicity of
nonsteroidal anti-inflammatory drugs. Langmuir 27(11), 6950–6961 (2011)
112. Marrink, S.J., Berkowitz, M., Berendsen, H.J.C.: Molecular dynamics simulation of a mem-
brane/water interface: the ordering of water and its relation to the hydration force. Langmuir
9(11), 3122–3131 (1993)
113. Marrink, S.J., de Vries, A.H., Tieleman, D.P.: Lipids on the move: simulations of membrane
pores, domains, stalks and curves. Biochim. Biophys. Acta-Biomem. 1788(1), 149–168 (2009)
114. Marrink, S.J., Lindahl, E., Edholm, O., Mark, A.E.: Simulation of the spontaneous aggregation
of phospholipids into bilayers. J. Am. Chem. Soc. 123(35), 8638–8639 (2001)
115. Marrink, S.J., Risselada, H.J., Yefimov, S., Tieleman, D.P., de Vries, A.H.: The MARTINI
force field: coarse grained model for biomolecular simulations. J. Phys. Chem. B 111(27),
7812–7824 (2007)
116. Marsh, D., Smith, I.C.P.: Interacting spin label study of fluidizing and condensing effects of
cholesterol on lecithin bilayers. Biochim. Biophys. Acta 298(2), 133–144 (1973)
117. Martinez-Seara, H., Rog, T., Karttunen, M., Vattulainen, I., Reigada, R.: Cholesterol induces
specific spatial and orientational order in cholesterol/phospholipid membranes. Plos One 5(6)
(2010)
118. McConnell, H.: Molecular motion in biological membranes. In: Berliner, L. (ed.) Spin Label-
ing: Theory and Applications, pp. 525–561. Academic Press, New York (1976)
119. Mcintosh, T.J., Simon, S.A.: Area per molecule and distribution of water in fully hydrated
dilauroylphosphatidylethanolamine bilayers. Biochemistry 25(17), 4948–4952 (1986)
120. Meirovitch, E., Igner, D., Igner, E., Moro, G., Freed, J.H.: Electron-spin relaxation and order-
ing in smectic and supercooled nematic liquid-crystals. J. Chem. Phys. 77(8), 3915–3938
(1982)
121. Meyer, G.R., Gullingsrud, J., Schulten, K., Martinac, B.: Molecular dynamics study of MscL
interactions with a curved lipid bilayer. Biophys. J. 91(5), 1630–1637 (2006)
Computer Modelling of the Lipid Matrix of Biomembranes 365

122. Miao, L., Nielsen, M., Thewalt, J., Ipsen, J.H., Bloom, M., Zuckermann, M.J., Mouritsen,
O.G.: From lanosterol to cholesterol: structural evolution and differential effects on lipid
bilayers. Biophys. J. 82(3), 1429–1444 (2002)
123. Moore, P.B., Lopez, C.F., Klein, M.L.: Dynamical properties of a hydrated lipid bilayer from
a multinanosecond molecular dynamics simulation. Biophys. J. 81(5), 2484–2494 (2001)
124. Mouritsen, O.G.: Life—As a Matter of Fat, The Emerging Science of Lipidomics. Springer-
Verlag, Berlin Heidelberg (2005)
125. Mouritsen, O.G., Jorgensen, K.: Dynamical order and disorder in lipid bilayers. Chem. Phys.
Lipids 73(1–2), 3–25 (1994)
126. Mukhopadhyay, P., Monticelli, L., Tieleman, D.P.: Molecular dynamics simulation of a
palmitoyl-oleoyl phosphatidylserine bilayer with Na+ counterions and NaCl. Biophys. J.
86(3), 1601–1609 (2004)
127. Murzyn, K., Rog, T., Jezierski, G., Takaoka, Y., Pasenkiewicz-Gierula, M.: Effects of phospho-
lipid unsaturation on the membrane/water interface: a molecular simulation study. Biophys.
J. 81(1), 170–183 (2001)
128. Murzyn, K., Rog, T., Pasenkiewicz-Gierula, M.: Phosphatidylethanolamine-
phosphatidylglycerol bilayer as a model of the inner bacterial membrane. Biophys. J.
88(2), 1091–1103 (2005)
129. Murzyn, K., Róg, T., Pasenkiewicz-Gierula, M.: Comparison of the conformation and the
dynamics of saturated and monounsaturated hydrocarbon chains of phosphatidylcholines.
Curr. Top. Biophys. 23(1), 87–94 (1999)
130. Murzyn, K., Zhao, W., Karttunen, M., Kurdziel, M., Rog, T.: Dynamics of water at membrane
surfaces: effect of headgroup structure. Biointerphases 1(3), 98–105 (2006)
131. Nagle, J.F.: Theory of lipid monolayer and bilayer phase-transitions—effect of headgroup
interactions. J. Membr. Biol. 27(3), 233–250 (1976)
132. Nagle, J.F.: Area lipid of bilayers from Nmr. Biophys. J. 64(5), 1476–1481 (1993)
133. Nagle, J.F., Tristram-Nagle, S.: Structure of lipid bilayers. Biochim. Biophys. Acta-Rev.
Biomem. 1469(3), 159–195 (2000)
134. Neria, E., Fischer, S., Karplus, M.: Simulation of activation free energies in molecular systems.
J. Chem. Phys. 105(5), 1902–1921 (1996)
135. Neumann, S., van Meer, G.: Sphingolipid management by an orchestra of lipid transfer pro-
teins. Biol. Chem. 389(11), 1349–1360 (2008)
136. Niemela, P.S., Miettinen, M.S., Monticelli, L., Hammaren, H., Bjelkmar, P., Murtola, T.,
Lindahl, E., Vattulainen, I.: Membrane proteins diffuse as dynamic complexes with lipids. J.
Am. Chem. Soc. 132(22), 7574–7575 (2010)
137. Niemela, P.S., Ollila, S., Hyvonen, M.T., Karttunen, M., Vattulainen, I.: Assessing the nature
of lipid raft membranes. PLoS Comput. Biol. 3(2), 304–312 (2007)
138. Oldfield, E., Meadows, M., Rice, D., Jacobs, R.: Spectroscopic studies of specifically deu-
terium labeled membrane systems. Nuclear magnetic resonance investigation of the effects
of cholesterol in model systems. Biochemistry 17(14), 2727–2740 (1978)
139. Ollila, S., Hyvonen, M.T., Vattulainen, I.: Polyunsaturation in lipid membranes: dynamic
properties and lateral pressure profiles. J. Phys. Chem. B 111(12), 3139–3150 (2007)
140. Orsi, M., Michel, J., Essex, J.W.: Coarse-grain modelling of DMPC and DOPC lipid bilayers.
J. Phys. Condens. Mat. 22(15) (2010)
141. Pandit, S.A., Berkowitz, M.L.: Molecular dynamics simulation of dipalmitoylphos-
phatidylserine bilayer with Na+ counterions. Biophys. J. 82(4), 1818–1827 (2002)
142. Pandit, S.A., Bostick, D., Berkowitz, M.L.: Mixed bilayer containing dipalmitoylphos-
phatidylcholine and dipalmitoylphosphatidylserine: lipid complexation, ion binding, and elec-
trostatics. Biophys. J. 85(5), 3120–3131 (2003)
143. Pandit, S.A., Jakobsson, E., Scott, H.L.: Simulation of the early stages of nano-domain forma-
tion in mixed bilayers of sphingomyelin, cholesterol, and dioleylphosphatidylcholine. Bio-
phys. J. 87(5), 3312–3322 (2004)
144. Pandit, S.A., Scott, H.L.: Multiscale simulations of heterogeneous model membranes.
Biochim. Biophys. Acta-Biomem. 1788(1), 136–148 (2009)
366 M. Pasenkiewicz-Gierula and M. Markiewicz

145. Pasenkiewicz-Gierula, M., Baczynski, K., Markiewicz, M., Murzyn, K.: Computer modelling
studies of the bilayer/water interface. Biochim. Biophys. Acta-Biomem. 1858(10), 2305–2321
(2016)
146. Pasenkiewicz-Gierula, M., Rog, T.: Conformations, orientations and time scales characterising
dimyristoylphosphatidylcholine bilayer membrane. molecular dynamics simulation studies.
Acta Biochim. Pol. 44(3), 607–624 (1997)
147. Pasenkiewicz-Gierula, M., Rog, T., Kitamura, K., Kusumi, A.: Cholesterol effects on the
phosphatidylcholine bilayer polar region: a molecular simulation study. Biophys. J. 78(3),
1376–1389 (2000)
148. Pasenkiewicz-Gierula, M., Subczynski, W.K., Kusumi, A.: Rotational diffusion of a steroid
molecule in phosphatidylcholine-cholesterol membranes: fluid-phase microimmiscibility in
unsaturated phosphatidylcholine-cholesterol membranes. Biochemistry 29(17), 4059–4069
(1990)
149. Pasenkiewicz-Gierula, M., Takaoka, Y., Miyagawa, H., Kitamura, K., Kusumi, A.: Hydrogen
bonding of water to phosphatidylcholine in the membrane as studied by a molecular dynamics
simulation: location, geometry, and lipid-lipid bridging via hydrogen-bonded water. J. Phys.
Chem. A 101(20), 3677–3691 (1997)
150. Pasenkiewicz-Gierula, M., Takaoka, Y., Miyagawa, H., Kitamura, K., Kusumi, A.: Charge
pairing of headgroups in phosphatidylcholine membranes: a molecular dynamics simulation
study. Biophys. J. 76(3), 1228–1240 (1999)
151. Pastor, R.W., Feller, S.E.: Time scales of lipid dynamics and molecular dynamics. In: Merz,
K.M., Roux, B. (eds.) Biological Membranes, a Molecular Perspective from Computation
and Experiment, pp. 3–29. Birkhäυσερ, Boston (1996)
152. Pastor, R.W., MacKerell, A.D.: Development of the CHARMM force field for lipids. J. Phys.
Chem. Lett. 2(13), 1526–1532 (2011)
153. Patra, M.: Lateral pressure profiles in cholesterol-DPPC bilayers. Eur. Biophys. J. Biophy.
Let. 35(1), 79–88 (2005)
154. Patra, M., Salonen, E., Terama, E., Vattulainen, I., Faller, R., Lee, B.W., Holopainen, J.,
Karttunen, M.: Under the influence of alcohol: the effect of ethanol and methanol on lipid
bilayers. Biophys. J. 90(4), 1121–1135 (2006)
155. Perozo, E., Rees, D.C.: Structure and mechanism in prokaryotic mechanosensitive channels.
Curr. Opin. Struct. Biol. 13(4), 432–442 (2003)
156. Petersen, N.O., Chan, S.I.: More on motional state of lipid bilayer membranes—interpretation
of order parameters obtained from nuclear magnetic-resonance experiments. Biochemistry
16(12), 2657–2667 (1977)
157. Pike, L.J.: Rafts defined: a report on the keystone symposium on lipid rafts and cell function.
J. Lipid. Res. 47(7), 1597–1598 (2006)
158. Plesnar, E., Subczynski, W.K., Pasenkiewicz-Gierula, M.: Saturation with cholesterol
increases vertical order and smoothes the surface of the phosphatidylcholine bilayer: a molec-
ular simulation study. Biochim. Biophys. Acta-Biomem. 1818(3), 520–529 (2012)
159. Plesnar, E., Subczynski, W.K., Pasenkiewicz-Gierula, M.: Is the cholesterol bilayer domain
a barrier to oxygen transport into the eye lens? Biochim. Biophys. Acta-Biomem. 1860,
434–441 (2018)
160. Poger, D., Caron, B., Mark, A.E.: Validating lipid force fields against experimental data:
progress, challenges and perspectives. Biochim. Biophys. Acta-Biomem. 1858(7), 1556–1565
(2016)
161. Poger, D., Mark, A.E.: On the validation of molecular dynamics simulations of saturated and
cis-monounsaturated phosphatidylcholine lipid bilayers: a comparison with experiment. J.
Chem. Theory. Comput. 6(1), 325–336 (2010)
162. Ponder, J.W., Case, D.A.: Force fields for protein simulations. Adv. Protein Chem. 66, 27–85
(2003)
163. Poyry, S., Rog, T., Karttunen, M., Vattulainen, I.: Significance of cholesterol methyl groups.
J. Phys. Chem. B 112(10), 2922–2929 (2008)
Computer Modelling of the Lipid Matrix of Biomembranes 367

164. Price, D.J., Brooks, C.L.: A modified TIP3P water potential for simulation with Ewald sum-
mation. J. Chem. Phys. 121(20), 10096–10103 (2004)
165. Rand, R.P., Parsegian, V.A.: Hydration forces between phospholipid-bilayers. Biochim. Bio-
phys. Acta 988(3), 351–376 (1989)
166. Reviakine, I., Brisson, A.: Formation of supported phospholipid bilayers from unilamellar
vesicles investigated by atomic force microscopy. Langmuir 16(4), 1806–1815 (2000)
167. Risselada, H.J., Marrink, S.J.: The molecular face of lipid rafts in model membranes. Proc.
Natl. Acad. Sci. USA 105(45), 17367–17372 (2008)
168. Roark, M., Feller, S.E.: Molecular dynamics simulation study of correlated motions in phos-
pholipid bilayer membranes. J. Phys. Chem. B 113(40), 13229–13234 (2009)
169. Robinson, A.J., Richards, W.G., Thomas, P.J., Hann, M.M.: Head group and chain behavior
in biological-membranes—a molecular-dynamics computer-simulation. Biophys. J. 67(6),
2345–2354 (1994)
170. Robinson, A.J., Richards, W.G., Thomas, P.J., Hann, M.M.: Behavior of cholesterol and its
effect on head group and chain conformations in lipid bilayers—a molecular-dynamics study.
Biophys. J. 68(1), 164–170 (1995)
171. Rog, T., Martinez-Seara, H., Munck, N., Oresic, M., Karttunen, M., Vattulainen, I.: Role of
cardiolipins in the inner mitochondrial membrane: insight gained through atom-scale simu-
lations. J. Phys. Chem. B 113(11), 3413–3422 (2009)
172. Rog, T., Murzyn, K., Gurbiel, R., Takaoka, Y., Kusumi, A., Pasenkiewicz-Gierula, M.: Effects
of phospholipid unsaturation on the bilayer nonpolar region: a molecular simulation study. J.
Lipid. Res. 45(2), 326–336 (2004)
173. Rog, T., Murzyn, K., Pasenkiewicz-Gierula, M.: The dynamics of water at the phospholipid
bilayer surface: a molecular dynamics simulation study. Chem. Phys. Lett. 352(5–6), 323–327
(2002)
174. Rog, T., Pasenkiewicz-Gierula, M.: Cholesterol effects on the phosphatidylcholine bilayer
nonpolar region: a molecular simulation study. Biophys. J. 81, 2190–2202 (2001)
175. Rog, T., Pasenkiewicz-Gierula, M.: Cholesterol effects on the phospholipid condensation and
packing in the bilayer: a molecular simulation study. FEBS Lett. 502, 68–71 (2001)
176. Rog, T., Pasenkiewicz-Gierula, M.: Effects of epicholesterol on the phosphatidylcholine
bilayer: a molecular simulation study. Biophys. J. 84(3), 1818–1826 (2003)
177. Rog, T., Pasenkiewicz-Gierula, M.: Non-polar interactions between cholesterol and phospho-
lipids: a molecular dynamics simulation study. Biophys. Chem. 107(2), 151–164 (2004)
178. Rog, T., Pasenkiewicz-Gierula, M.: Cholesterol-sphingomyelin interactions: a molecular
dynamics simulation study. Biophys. J. 91(10), 3756–3767 (2006)
179. Rog, T., Pasenkiewicz-Gierula, M.: Cholesterol effects on a mixed-chain phosphatidylcholine
bilayer: a molecular dynamics simulation study. Biochimie 88(5), 449–460 (2006)
180. Rog, T., Pasenkiewicz-Gierula, M., Vattulainen, I., Karttunen, M.: What happens if choles-
terol is made smoother: importance of methyl substituents in cholesterol ring structure on
phosphatidylcholine-sterol interaction. Biophys. J. 92(10), 3346–3357 (2007)
181. Rog, T., Pasenkiewicz-Gierula, M., Vattulainen, I., Karttunen, M.: Ordering effects of choles-
terol and its analogues. Biochim. Biophys. Acta 1788, 97–121 (2009)
182. Rog, T., Stimson, L.M., Pasenkiewicz-Gierula, M., Vattulainen, I., Karttunen, M.: Replacing
the cholesterol hydroxyl group with the ketone group facilitates sterol flip-flop and promotes
membrane fluidity. J. Phys. Chem. B 112(7), 1946–1952 (2008)
183. Rosso, L., Gould, I.R.: Structure and dynamics of phospholipid bilayers using recently devel-
oped general all-atom force fields. J. Comput. Chem. 29(1), 24–37 (2008)
184. Samanta, S., Hezaveh, S., Milano, G., Roccatano, D.: Diffusion of 1,2-Dimethoxyethane and
1,2-dimethoxypropane through phosphatidycholine bilayers: a molecular dynamics study. J.
Phys. Chem. B 116(17), 5141–5151 (2012)
185. Schuler, L.D., Daura, X., Van Gunsteren, W.F.: An improved GROMOS96 force field for
aliphatic hydrocarbons in the condensed phase. J. Comput. Chem. 22(11), 1205–1218 (2001)
186. Schwille, P., Korlach, J., Webb, W.W.: Fluorescence correlation spectroscopy with single-
molecule sensitivity on cell and model membranes. Cytometry 36(3), 176–182 (1999)
368 M. Pasenkiewicz-Gierula and M. Markiewicz

187. Scott, H.L.: Modeling the lipid component of membranes. Curr. Opin. Struct. Biol. 12(4),
495–502 (2002)
188. Shi, Q., Voth, G.A.: Multi-scale modeling of phase separation in mixed lipid bilayers. Biophys.
J. 89(4), 2385–2394 (2005)
189. Shin, Y.K., Ewert, U., Budil, D.E., Freed, J.H.: Microscopic versus macroscopic diffusion
in model membranes by electron-spin-resonance spectral-spatial imaging. Biophys. J. 59(4),
950–957 (1991)
190. Shinoda, W., Shimizu, M., Okazaki, S.: Molecular dynamics study on electrostatic properties
of a lipid bilayer: polarization, electrostatic potential, and the effects on structure and dynamics
of water near the interface. J. Phys. Chem. B 102(34), 6647–6654 (1998)
191. Siu, S.W.I., Pluhackova, K., Bockmann, R.A.: Optimization of the OPLS-AA force field for
long hydrocarbons. J. Chem. Theory. Comput. 8(4), 1459–1470 (2012)
192. Smondyrev, A.M., Berkowitz, M.L.: Molecular dynamics simulation of dipalmitoylphos-
phatidylcholine membrane with cholesterol sulfate. Biophys. J. 78(4), 1672–1680 (2000)
193. Smondyrev, A.M., Berkowitz, M.L.: Effects of oxygenated sterol on phospholipid bilayer
properties: a molecular dynamics simulation. Chem. Phys. Lipids 112(1), 31–39 (2001)
194. Soni, S.P., Ward, J.A., Sen, S.E., Feller, S.E., Wassall, S.R.: Effect of trans unsaturation on
molecular organization in a phospholipid membrane. Biochemistry 48(46), 11097–11107
(2009)
195. Stepniewski, M., Bunker, A., Pasenkiewicz-Gierula, M., Karttunen, M., Rog, T.: Effects of
the lipid bilayer phase state on the water membrane interface. J. Phys. Chem. B 114(36),
11784–11792 (2010)
196. Stouch, T.R.: Lipid-membrane structure and dynamics studied by all-atom molecular-
dynamics simulations of hydrated phospholipid-bilayers. Mol. Simulat. 10(2–6), 335–362
(1993)
197. Subczynski, W.K., Hyde, J.S., Kusumi, A.: Effect of alkyl chain unsaturation and cholesterol
intercalation on oxygen transport in membranes: a pulse ESR spin labeling study. Biochem-
istry 30(35), 8578–8590 (1991)
198. Subczynski, W.K., Mainali, L., Raguz, M., O’Brien, W.J.: Organization of lipids in fiber-cell
plasma membranes of the eye lens. Exp. Eye Res. 156, 79–86 (2017)
199. Subczynski, W.K., Wisniewska, A., Yin, J.-J., Hyde, J.S., Kusumi, A.: Hydrophobic barriers of
lipid bilayer membranes formed by reduction of water penetration by alkyl chain unsaturation
and cholesterol. Biochemistry 33, 7670–7681 (1994)
200. Sundaralingam, M.: Molecular structures and conformations of the phospholipids and sphin-
gomyelins. Ann. NY Acad. Sci. 195, 324–355 (1972)
201. Tabony, J., Perly, B.: Quasi-elastic neutron-scattering measurements of fast local translational
diffusion of lipid molecules in phospholipid-bilayers. Biochim. Biophys. Acta 1063(1), 67–72
(1991)
202. Takaoka, Y., Pasenkiewicz-Gierula, M., Miyagawa, H., Kitamura, K., Tamura, Y., Kusumi, A.:
Molecular dynamics generation of nonarbitrary membrane models reveals lipid orientational
correlations. Biophys. J. 79(6), 3118–3138 (2000)
203. Tepper, H.L., Voth, G.A.: Mechanisms of passive ion permeation through lipid bilayers:
insights from simulations. J. Phys. Chem. B 110(42), 21327–21337 (2006)
204. Terama, E., Ollila, O.H.S., Salonen, E., Rowat, A.C., Trandum, C., Westh, P., Patra, M.,
Karttunen, M., Vattulainen, I.: Influence of ethanol on lipid membranes: from lateral pressure
profiles to dynamics and partitioning. J. Phys. Chem. B 112(13), 4131–4139 (2008)
205. Tessier, M.B., DeMarco, M.L., Yongye, A.B., Woods, R.J.: Extension of the GLYCAM06
biomolecular force field to lipids, lipid bilayers and glycolipids. Mol. Simulat. 34(4), 349–363
(2008)
206. Tieleman, D.P., Marrink, S.J., Berendsen, H.J.C.: A computer perspective of membranes:
molecular dynamics studies of lipid bilayer systems. Biochim. Biophys. Acta-Rev. Biomem.
1331(3), 235–270 (1997)
207. Tristram-Nagle, S., Nagle, J.F.: Lipid bilayers: thermodynamics, structure, fluctuations, and
interactions. Chem. Phys. Lipids 127(1), 3–14 (2004)
Computer Modelling of the Lipid Matrix of Biomembranes 369

208. Truscott, R.J.: Age-related nuclear cataract: a lens transport problem. Ophthalmic. Res. 32,
185–194 (2000)
209. Tu, K.C., Klein, M.L., Tobias, D.J.: Constant-pressure molecular dynamics investigation of
cholesterol effects in a dipalmitoylphosphatidylcholine bilayer. Biophys. J. 75(5), 2147–2156
(1998)
210. Tuchtenhagen, J., Ziegler, W., Blume, A.: Acyl-chain conformational ordering in liquid-
crystalline bilayers—comparative Ft-Ir and H-2-Nmr studies of phospholipids differing in
headgroup structure and chain-length. Eur. Biophys. J. 23(5), 323–335 (1994)
211. Ulrich, A.S., Volke, F., Watts, A.: The dependence of phospholipid headgroup mobility on
hydration as studied by deuterium-Nmr spin-lattice relaxation-time measurements. Chem.
Phys. Lipids. 55(1), 61–66 (1990)
212. Vacha, R., Berkowitz, M.L., Jungwirth, P.: Molecular model of a cell plasma membrane with
an asymmetric multicomponent composition: water permeation and ion effects. Biophys. J.
96(11), 4493–4501 (2009)
213. Vainio, S., Jansen, M., Koivusalo, M., Rog, T., Karttunen, M., Vattulainen, I., Ikonen, E.:
Significance of sterol structural specificity—desmosterol cannot replace cholesterol in lipid
rafts. J. Biol. Chem. 281(1), 348–355 (2006)
214. van Gunsteren, W.F., Daura, X., Mark, A.E.: Gromos force field. In: von Rague Schleyer, P.
(ed.) Encyclopedia of Computational Chemistry, vol. 2, pp. 1211–1216. Wiley (1998)
215. van Meer, G.: Cellular lipidomics. EMBO J. 24(18), 3159–3165 (2005)
216. van Meer, G., Voelker, D.R., Feigenson, G.W.: Membrane lipids: where they are and how they
behave. Nat. Rev. Mol. Cell Biol. 9(2), 112–124 (2008)
217. Vattulainen, I., Rog, T.: Lipid simulations: a perspective on lipids in action. Cold Spring
Harbor Perspect. Biol. 3(4) (2011)
218. Vaz, W.L.C., Almeida, P.F.: Microscopic versus macroscopic diffusion in one-component
fluid phase lipid bilayer-membranes. Biophys. J. 60(6), 1553–1554 (1991)
219. Veatch, S.L., Keller, S.L.: Seeing spots: complex phase behavior in simple membranes.
Biochim. Biophys. Acta-Mol. Cell Res. 1746(3), 172–185 (2005)
220. Vist, M.R., Davis, J.H.: Phase-Equilibria of cholesterol dipalmitoyl-phosphatidylcholine
mixtures—H-2 nuclear magnetic-resonance and differential scanning calorimetry. Biochem-
istry 29(2), 451–464 (1990)
221. Volkov, V.V., Palmer, D.J., Righini, R.: Heterogeneity of water at the phospholipid membrane
interface. J. Phys. Chem. B 111(6), 1377–1383 (2007)
222. Vollhardt, D.: Effect of unsaturation in fatty acids on the main characteristics of Langmuir
monolayers. J. Phys. Chem. C 111(18), 6805–6812 (2007)
223. White, S.H., Jacobs, R.E., King, G.I.: Partial specific volumes of lipid and water in mixtures
of egg lecithin and water. Biophys. J. 52(4), 663–665 (1987)
224. Widomska, J., Raguz, M., Subczynski, W.K.: Oxygen permeability of the lipid bilayer mem-
brane made of calf lens lipids. Biochim. Biophys. Acta-Biomem. 1768(10), 2635–2645 (2007)
225. Wiener, M.C., White, S.H.: Structure of a Fluid Dioleoylphosphatidylcholine bilayer deter-
mined by joint refinement of X-Ray and neutron-diffraction data. 2. Distribution and packing
of terminal methyl-groups. Biophys. J. 61(2), 428–433 (1992)
226. Wiener, M.C., White, S.H.: Structure of a Fluid Dioleoylphosphatidylcholine bilayer deter-
mined by joint refinement of X-ray and neutron-diffraction data. 3. Complete structure. Bio-
phys. J. 61(2), 434–447 (1992)
227. Wilkinson, D.A., Nagle, J.F.: Dilatometry and calorimetry of saturated phos-
phatidylethanolamine dispersions. Biochemistry 20(1), 187–192 (1981)
228. Zhang, Z., Lu, L., Berkowitz, M.L.: Energetics of cholesterol transfer between lipid bilayers.
J. Phys. Chem. B 112(12), 3807–3811 (2008)
229. Zhao, W., Gurtovenko, A.A., Vattuainen, I., Karttunen, M.: Cationic Dimyristoylphos-
phatidylcholine and Dioleoyloxytrimethylammonium propane lipid bilayers: atomistic insight
for structure and dynamics. J. Phys. Chem. B 116(1), 269–276 (2012)
230. Zhao, W., Rog, T., Gurtovenko, A.A., Vattulainen, I., Karttunen, M.: Atomic-scale struc-
ture and electrostatics of anionic palmitoyloleoylphosphatidyl-glycerol lipid bilayers with
Na+ counterions. Biophys. J. 92(4), 1114–1124 (2007)
370 M. Pasenkiewicz-Gierula and M. Markiewicz

231. Zhao, W., Rog, T., Gurtovenko, A.A., Vattulainen, I., Karttunen, M.: Role of phosphatidyl-
glycerols in the stability of bacterial membranes. Biochimie 90(6), 930–938 (2008)
Modeling of Membrane Proteins

Dorota Latek, Bartosz Trzaskowski, Szymon Niewieczerzał,


Przemysław Miszta, Krzysztof Młynarczyk, Aleksander D˛ebiński,
Wojciech Puławski, Shuguang Yuan, Agnieszka Sztyler, Urszula Orzeł,
Jakub Jakowiecki and Sławomir Filipek

D. Latek · S. Niewieczerzał · P. Miszta · K. Młynarczyk · A. D˛ebiński · W. Puławski · A. Sztyler


J. Jakowiecki · S. Filipek (B)
Faculty of Chemistry, University of Warsaw, ul. Pasteura 1, 02-093 Warsaw, Poland
e-mail: sfilipek@chem.uw.edu.pl
D. Latek
e-mail: dlatek@chem.uw.edu.pl
S. Niewieczerzał
e-mail: szniew@gmail.com
P. Miszta
e-mail: pmiszta@chem.uw.edu.pl
K. Młynarczyk
e-mail: kmlynarczyk@chem.uw.edu.pl
A. D˛ebiński
e-mail: adebinski@chem.uw.edu.pl
W. Puławski
e-mail: woj.pul@gmail.com
A. Sztyler
e-mail: agnieszka_sztyler@student.uw.edu.pl
J. Jakowiecki
e-mail: jjakowiecki@chem.uw.edu.pl
B. Trzaskowski
Centre of New Technologies, University of Warsaw, ul. Banacha 2C, 02-097 Warsaw, Poland
e-mail: b.trzaskowski@cent.uw.edu.pl
S. Yuan
Laboratory of Physical Chemistry of Polymers and Membranes, Ecole Polytechnique Federale de
Lausanne (EPFL), 1015 Lausanne, Switzerland
e-mail: yuanshg@cnbc.uw.edu.pl
S. Yuan
Biological and Chemical Research Centre, University of Warsaw, ul. Zwirki i Wigury 101, 02-089
Warsaw, Poland
U. Orzeł
Applications of Physics in Biology and Medicine, Faculty of Physics, University of Warsaw,
02-089 Warsaw, Poland
e-mail: u.orzel@student.uw.edu.pl

© Springer Nature Switzerland AG 2019 371


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_12
372 D. Latek et al.

Abstract The membrane proteins are still the “Wild West” of structural biology.
Although more and more membrane proteins structures are determined, their func-
tioning is still difficult to investigate because they are fully functional only in the
membranous environments. Several specific methodologies were developed to inves-
tigate various aspects of their cellular life but still they are challenging for compu-
tational methods. In this chapter we summarize the efforts made on elucidation the
structural and dynamical properties of different types of membrane proteins empha-
sizing on those computational methods which were designed and employed particu-
larly to study membrane proteins including their interactions in complex membranous
systems. This chapter was updated in all subsections compared to the 1st edition.

1 Introduction

About 30% of the genes included in the human genome encode membrane pro-
teins. These proteins participate in a large number of normal and abnormal cell
processes, including: (1) transport of ions, water and small solutes via pumps and
channels; (2) signaling via receptors; (3) metabolism via membrane enzymes; (4)
entry of pathogens into cells, (5) programmed cell death; and (6) intercellular struc-
tural interactions. This is why a greater attention must be paid to the structures of
these proteins and how they relate to normal and abnormal function. Crystallization
is the method of choice for generating high-resolution structural models. However,
membrane proteins have both hydrophobic and hydrophilic surfaces, a duality that
makes them more difficult to crystallize than water-soluble proteins. Therefore, rel-
atively few structures of membrane proteins have been solved at the level of atomic
resolution compared to soluble proteins. In addition, high-resolution structures are
important but not sufficient to understand how membrane proteins (and soluble pro-
teins as well) function. To explore questions of molecular mechanism, protein-protein
interactions, and others, it is necessary to carry out biochemical, biophysical but also
computational studies that are assisted by structural knowledge. Molecular dynamics
simulations will become increasingly valuable for understanding membrane protein
function, as they can reveal the dynamic behavior not seen in the static structures.
Significant increase of computational power, in synergy with more efficient compu-
tational methodologies, allows to carry out molecular dynamics simulations of any
structurally known membrane protein in its native environment, covering timescales
of up to 0.1 ms in all-atom simulations. At the frontiers of membrane protein sim-
ulations are receptors, ion channels, aquaporins, passive and active transporters,
and bioenergetic proteins. The membrane environment influences the function of
membrane proteins, through electrostatic and steric interactions as well as through
the membrane’s internal pressure. Therefore, the environment needs to be properly
taken into account in simulation studies.
This chapter describes the usage of major methodologies that can be employed
for the research of membrane protein structure and function. The quantum methods
can be used for investigations of active sites of membrane enzymes, like membrane
Modeling of Membrane Proteins 373

proteases, to study in detail the mechanisms of their action, what is similar to studying
of soluble enzymes. On the contrary, the methods for membrane protein structure
prediction must be highly specialized to include specific nature of these proteins
and the effect of the membrane. Usually, it is followed by prediction of location
in the membrane including individual tilt of protein in the membrane. Factors like
lipid tension and a hydrophobic mismatch must also be taken into account. Steered
molecular dynamics simulations help to investigate unfolding processes of membrane
proteins. Uncovered stable regions of protein structure that keep the whole protein
stable provide unique insight into intra-protein interactions in balance with protein-
lipid relations. Interactions in the membrane between proteins lead to the formation
of homo- and hetero-oligomers. Such assemblies can be very important for the proper
function of the cell though the properties of large protein-lipid rafts are still to be
discovered because of their size. The coarse-grain approaches are used to overcome
the space and time limitations in molecular dynamics simulations. Specific coarse-
grain force fields are successfully used to explain dynamics of large portions of
membrane with proteins inside. On the other hand the implicit solvent methods
provide smooth potentials to investigate processes inside the membrane as well as at
the water-membrane border. Similarly to soluble proteins one can also use docking
methods to locate ligands such as agonists, antagonists and inverse agonists in the
binding site of membrane receptors. After binding they change a receptor shape
and due to action of molecular switches linked together by an extended hydrogen
bond network this change propagates through the receptor to the other side of a lipid
bilayer. In this way, specifically for the membrane proteins, the signal is transmitted
from exterior to the inside of the cell and can be traced to some extent by simulation
methods. Ligands can come to the binding site either from the aqueous side (similarly
to soluble proteins) or directly from the membrane provided they are hydrophobic
enough. A little is known about the folding processes of membrane proteins which are
markedly different from that of soluble proteins but unfortunately the computational
methods are still at the very beginning in this area.

2 Classifications of Membrane Proteins

In a common classification the integral membrane proteins are divided into five
main types depending on their localization inside the membrane: type I single-
pass transmembrane with cytoplasmic C-termini, type II single-pass transmembrane
with extracellular C-termini, multipass transmembrane, lipid chain-anchored, GPI-
anchored and peripheral membrane proteins [1]. Anchored membrane proteins do not
span across the membrane like integral proteins, but they are attached to it on one side
through a covalently bound lipid or glycosylphosphatidylinositol (GPI)—a glycol-
ipid attached to the C-termini during posttransalational modification. Although not
discussed here, an important type of membrane proteins should also be mentioned,
namely peripheral proteins which bound noncovalently to the surface of membrane
or another transmembrane protein. A distinct anisotropic environment of the lipid
374 D. Latek et al.

bilayer leads to characteristic amino acid composition of membrane proteins to mini-


mize the insertion energy into the hydrophobic core of the membrane. Consequently,
structural characteristic of membrane proteins became simplified along the evolu-
tion. From the structural point of view integral polytopic (multipass) transmembrane
proteins, which are of the most interest in this chapter, can be classified into two
main groups: transmembrane helical (TMH) bundles and β-barrels [2]. Members
of both of these groups act as molecular channels (e.g. voltage-gated ion channels
like potassium channels, porins) or transporters (e.g. ABC transporters, ATPases),
enzymes, receptors (e.g. G-protein coupled receptors) and can be involved in electron
transfer during photosynthesis and respiration (e.g. bacteriorhodopsin-like proteins,
cytochromes). Detailed information about the classification, function and in some
cases also the structures of membrane proteins can be found in various databases
accessible online (see Table 1). Worth-mentioning is the OPM database with all
representative structures of membrane proteins available to date together with an
indicated membrane position and its width. A method implemented in OPM for
Positioning of a Protein in a Membrane (PPM) is based on optimization of free
energy of protein transfer from water to the membrane environment [3].
Despite the described above, a well-established classification system of mem-
brane proteins, assignment of a novel membrane protein to existing families and
finding their cellular localization in many cases is not trivial either. Small monotopic
and bitopic membrane proteins are much more similar to each other than in case
of globular proteins. Lack of significant structural differences between folds from
different families makes structural space of membrane proteins continuous. Conse-
quently, many discrepancies between classification systems has been reported, e.g.
between SCOP and CATH structural databases [21]. Due to lack of structural rep-
resentation of many membrane proteins their classification can be made only based
on their sequence which is certainly more difficult. Many classification methods
employ profile-profile alignment techniques [22]. Others are using motifs databases
such as PROSITE [23] or TOPDOM [24] to detect important functional motifs typical
for the specific membrane protein family, for example a well-studied GxxxG motif
[25]. A more sophisticated way is to use various heuristic algorithms e.g. fuzzy K-
nearest neighbor [26], support vector machine [27], instance-based learner [28], the
least Mahalanobis distance algorithm [29], covariant discriminant analysis [1] which
perform classification using as an input a feature-based representation of a protein
sequence i.a. a composition of amino acids, hydrophobicity, a sequence length and
physical properties of amino acids. Nevertheless, all the above methods classify-
ing membrane proteins and differentiating them from globular proteins implement a
common hypothesis stated in 80s that TM proteins should have a special amino acid
composition, in which a large fraction of hydrophobic residues would favor insertion
of a protein into the membrane, otherwise not possible because of a high energetic
cost of burying polar residues into the non-polar environment [30–32]. Add indeed,
the membrane cores of proteins are composed of hydrophobic amino acids in 70%
on average [33] and a characteristic feature is a belt of mainly aliphatic amino acids
flanked by two “aromatic girdles” composed of Trp and Tyr facing the lipid head-
groups [34]. The average sensitivity of prediction of a protein type (membrane vs.
Modeling of Membrane Proteins 375

Table 1 Databases of membrane proteins


Database Website Data provided References
Mptopo http://blanco.biomol.uci.edu/ 3D structures and topology [4]
mptopo
PDBTM http://pdbtm.enzim.hu 3D structures [5, 6]
MPDB http://www.mpdb.tcd.ie/ 3D structures and functional [7]
index.asp annotation, experimental
data
NaVa http://nava.liacs.nl Natural sequence variants of [8]
GPCRs
GLIDA http://pharminfo.pharm. GPCRs ligands [9]
kyoto-u.ac.jp/services/glida
Mpstruc http://blanco.biomol.uci.edu/ 3D structures [2]
mpstruc
GPCRRD http://zhanglab.ccmb.med. Homology models of GPCRs [10]
umich.edu/GPCRRD
OMPdb http://www.ompdb.org/ OMP classification and [11]
topologies
GPCRDB http://gpcrdb.org/ Classification and functional [12–14]
annotation and 3D structures
and topologies and
homology models
GPCR-SSFE http://www.ssfa-7tmr.de/ Homology models of GPCRs [15, 16]
2.0 ssfe2/
IUPHAR http://www. 3D structures and [17, 18]
guidetopharmacology.org/ classification and functional
annotation
OPM http://opm.phar.umich.edu Membrane position [3]
TCDB http://www.tcdb.org Classification of membrane [19, 20]
transporters

globular) for one of the best predictors (PHOBIUS) is quite high—99% [35] enabling
a reliable genome annotation of sequence data.
376 D. Latek et al.

3 Prediction Methods

3.1 Predictions of Topology of Membrane Proteins

3.1.1 α-Helical Bundles

Except for the general tools for genome annotation there are also classifiers which
point to specific membrane protein families and its division into classes. For exam-
ple to classify members of a GPCRs family several computational methods has been
used, namely a phylogenetic analysis (an A-F GPCRs classification system [36]; with
a Hidden Markov Models-based search (GRAFS [37]—see Fig. 1), self-organizing
maps [38], neighbor-joining [39], unweighted pair group method with arithmetic
mean [40], multidimensional scaling [41]. A useful hierarchical integration of var-
ious alignment-based and alignment-free classification methods was implemented
in a 7TMRmine web server for discovering 7TMRs (seven transmembrane region-
containing receptors) [42]. Several methods were also developed to identify β-barrel
transmembrane proteins and members of the OMP (outer membrane protein) family
which use machine-learning methods [43–47] combined with analysis of amino acids
composition [48, 49], sequence profiles, alignment of secondary structure blocks
[50], C-terminal pattern identification [51] or empirical scores [52].
Even more important than the classification of a membrane protein is information
about its topology. The correct topology can be predicted for 70% of all mem-
brane proteins, mostly by predictors based on Hidden Markov Models (HMMs) (see
Table 2). However, accurate prediction of the start and end of a TM segment still
represents a challenge [34]. Most of the methods for predicting membrane proteins
topology are pointed to either transmembrane helical proteins (TMH) or transmem-
brane β-barrels proteins (TMB), because in these two cases slightly different rules
are taken into account. In case of TMH proteins predictors use the following rules to
distinguish them from globular proteins and to find their topology [53]. Membrane
spanning helices are 20–30 amino acids long and the fraction of hydrophobic amino
acids is high in membrane helices. However, one issue has to be mentioned con-
cerning detection of TM helices based on their hydrophobicity. Namely, there are
other motifs which are highly hydrophobic such as signal peptides, signal anchors,
amphipathic helices or re-entrant helices—helices that enter and exit the membrane
on the same side e.g. in aquaporins [54]. Filtering-out such motifs by e.g. SignalP
[55] or TargetP [56] prior to the TMH topology prediction is certainly beneficial.
In some TM topology predictors detection of signal peptides or re-entrant regions
is already implemented e.g. in Phobius and PolyPhobius [57, 58], TOP-MOD [59]
and OCTOPUS [60]. Globular regions between transmembrane helices are relatively
short and the charge distribution in loops is such as described by the “positive-inside”
rule which states that loops that do not translocate across the membrane are more
positively charged (i.a. Lys and Arg) compared to the ones that translocate [61].
Some membrane proteins have the “inside-out” topology, which means that they
consist of hydrophilic interior and hydrophobic exterior exposed to lipids e.g. bacte-
Modeling of Membrane Proteins 377

Fig. 1 The phylogenetic tree of GPCRs. The image taken with permissions from http://gpcr.scripps.
edu

riorhodopsin [62]. However, in most cases the presence of motifs at helices interfaces
together with the hydrogen bonding network turned out to be more crucial for the
stability of membrane proteins than the hydrophobic effect [63, 64].
The above rules for TMH protein topology prediction were implemented in algo-
rithms that present either statistical or machine-learning approach. Development of
the former methods was started by Kyte and Doolitle [30] with a simple predictor
of membrane spanning helical regions based on calculating an average hydropho-
bicity index for amino acids in a window moving along the protein sequence (a
sliding window). If the average hydrophobicity was above the certain threshold, the
current region was proposed to be a TM helix. In addition to hydrophobicity com-
monly observed amphilphilicity of TM helices was also taken into account [92].
The mentioned above “positive-inside” rule was incorporated in TM helices pre-
diction by van Heijne in TopPred [93]. Later approaches to the TM regions predic-
378 D. Latek et al.

Table 2 Web servers for transmembrane topology prediction


Web server Website Method used References
TMH proteins
Tmpred http://www.ch.embnet.org/ Sliding window and [65]
software/TMPRED_form. positive-inside rule
html
PHDhtm https://www.predictprotein. NN [66, 67]
org/
DAS http://tmdas.bioinfo.se/DAS/ Dense-alignment surface [68]
index.html
SOSUI http://harrier.nagahama-i- Sliding window and [69]
bio.ac.jp/sosui/sosui_submit. positive-inside rule
html
PRED-TMR/PRED- http://athina.biol.uoa.gr/ Sliding-window and edge [70]
TMR2 PRED-TMR2/ detection
CCTOP http://cctop.enzim.ttk.mta. HMM [71]
hu/
TMHMM/prodiv- http://www.cbs.dtu.dk/ HMM [72]
TMHMM services/TMHMM/
SPLIT http://split4.pmfst.hr/split/4/ Sliding window and [73]
positive-inside rule
TM-Finder http://tmfinder.research. Sliding-window and [74]
sickkids.ca/cgi-bin/ hydrophobicity and helicity
TMFinderForm.cgi
Phobius/poly- http://phobius.sbc.su.se HMM [58]
Phobius
MEMSAT3 http://bioinf.cs.ucl.ac.uk/ Dynamic programming [75]
web_servers
SCAMPI2 http://scampi.cbr.su.se Positive-inside rule [35, 76]
OCTOPUS http://octopus.cbr.su.se HMM & NN [60]
SPOCTOPUS http://octopus.cbr.su.se HMM & NN [77]
MPEx http://blanco.biomol.uci.edu/ Sliding-window and [78]
mpex hydrophobicity scales
TOPCONS http://topcons.cbr.su.se HMM, consensus method [79, 80]
MetaTM http://metatm.sbc.su.se SVM, consensus method [81]
HTM-ONE http://mizuguchilab.org/ NN & integrated 1D [82]
netasa/htmone/ predictions
(continued)
Modeling of Membrane Proteins 379

Table 2 (continued)
Web server Website Method used References
TMB proteins
B2TMPRED http://gpcr.biocomp.unibo.it/ SVM [83]
cgi/predictors/outer/pred_
outercgi.cgi
HMM-B2TMR http://gpcr.biocomp.unibo.it/ HMM [46]
biodec
PRED-TMBB http://biophysics.biol.uoa.gr/ HMM [84]
PRED-TMBB
TBBPred http://crdd.osdd.net/raghava/ NN + SVM [85]
tbbpred/
ConBBPRED http://bioinformatics.biol. Consensus method: [86]
uoa.gr/ConBBPRED HMM&NN&SVM
ProfTMB http://www.predictprotein. HMM [87, 67]
org
TransFold http://bioinformatics.bc.edu/ Statistical potentials [88]
clotelab/transFold
TMBpro http://tmbpro.ics.uci.edu NN [89]
BOCTOPUS2 http://boctopus.cbr.su.se SVM & HMM [90, 91]
Abbreviations used:
k-NN k-nearest neighbor algorithm
SVM Support vector machines
NN Neural network
HMM Hidden Markov Model

tion improved the definition of hydrophobicity scale [78] e.g. by adding backbone
constraints related to the alpha helix dehydrating and salt-bridge formation [94] or
by creating knowledge-based scales derived from a database limited to membrane
proteins [95]. Some methods used scales other than hydrophobicity, namely other
properties of amino acids such as [96, 97] charge, aromaticity, size, conformational
properties, electronic properties [98] by which TM regions can be described. Such
amino acids properties were for example estimated based on TM proteins with known
topologies as in TMpred [65]. Combining different scales and properties of amino
acids, as in a SPLIT predictor [73] or a SOSUI predictor [69] which is based on the
Kyte-Doolitle’s hydrophobicity scale, amphiphilicity, relative and net charges and
protein length also proved to be successful. An interesting approach implemented in
PRED-TMR [70] was focused on propensities of terminal amino acids in each TM
helix. As in other fragmentary predictions such as secondary structure or solvent
accessibility prediction usage of sequence profiles instead of protein sequences also
380 D. Latek et al.

improved prediction of TM regions [66, 99, 100]. Nevertheless, lack of close homol-
ogous for 20–30% of membrane proteins (e.g. a GPCRs family) [101] still decreased
the prediction accuracy rates and prone to development of the DAS (dense-alignment
surface) method [68] in which a sequence alignment to non-homologous membrane
proteins used to predict TM regions is improved by usage of a special scoring matrix
and so-called low-stringency dot plots representing similarities between segments
of a certain length and not the whole protein sequences. TM regions can be easily
identified by such grid-like arrangements on plots.
Not only was the description of TM regions improved in topology prediction, but
also algorithms themselves. Kitsas et al. [102] implemented a higher order statis-
tics in his predictor. Machine-learning based approach was started by Rost [66] in
PHDhtm—a predictor employing a neural network. Later, Hidden Markov Models
(HMMs) were used in prediction of TM helices in HMMTOP [71] and TMHMM
[72]. Lio and Vannucci [103] incorporated wavelets in a TM regions predictor and
Nugent and Jones used support vector machines (SVM) in their predictor [104].
Ahmed combined together SVM and HMMs together with a commonly used rules
of TM regions prediction, Shen and Chou [100] used a K-nearest neighbor method
and recently Osmanbeyoglu et al. [105] used an active learning approach. The con-
sensus methods also proved their efficiency in TM topology prediction e.g. TOP-
CONS [79] merges results from OCTOPUS, TMHMM and SCAMPI and MetaTM
[81] derives a consensus TM prediction based on TopPred, PHDhtm, HMMTOP,
TMHMM, PolyPhobius and Memsat.
An interesting approach to prediction of one-dimensional structural features of
TMH proteins were presented recently by Ahmad et al. [82] as a HTM-ONE server.
HTM-ONE is based on a neural network which is trained not with one structural
feature, e.g. TM topology as in most of described above predictors, but simultane-
ously with a number of features: solvent-accessible surface, dihedral angles, kink
angles of TM helices, contacts between helices and PSSM (position-specific scoring
matrices).

3.1.2 The β-Barrel Proteins

The number of crystallized TM β-barrel proteins is much lower than TMH. Addition-
ally, the membrane spanning β-strands are shorter and of less particular amino acids
composition than TM helices [34]. Consequently, the topology prediction is more
difficult in case of TMB proteins. Schultz [106] analyzed β-barrel membrane pro-
teins and assumed several rules describing their topology. The number of β-strands
is always even with N and C-termini at the periplasmic barrel end. Tilt of β-strand is
around 45 degrees and only one of the possible tilt directions is energetically favor-
able. The shear number of a β-barrel is positive and around n + 2, where n is a number
of β-strands in the barrel. β-Strands are anti-parallel connected through short turns
at the periplasmic side and long loops with high sequence variability at the external
side. Described above features of β-barrels were implemented in several algorithms
for topology prediction implementing the most efficient [86] HMMs [107, 87], SVM
Modeling of Membrane Proteins 381

[108], neural networks [89] or statistical methods [109]. As in the case of TMH pre-
diction amino acids composition is taken into account [74], together with sequence
profiles [87] and statistical potentials [88].

3.2 Prediction of Solvent/Lipid Accessible Surface

Prediction of solvent (or lipid) accessible surface (i.e. buried residues) provides addi-
tional source of information to determine TM topology of a protein and may help
to design mutagenesis experiments aimed at identifying catalytically important TM
residues [110]. Accuracy of burial status predictions is relatively high—above 70%
[110, 111] for TM regions of membrane protein and 58% for entire membrane pro-
teins [112] which is comparable to accuracy achieved for globular proteins. The
main difference between buried and exposed residues in globular proteins is their
hydrophobicity, but in case of membrane proteins this feature is not that well dis-
tinctive [34]. Few methods developed to date which target specifically the membrane
protein accessible surface area (ASA) are based on sequence conservation patterns
as exposed residues are assumed to evolve faster than buried residues. Such conser-
vation patterns, before running the burial status prediction, can be translated e.g. into
a knowledge-based surface propensity scale which is highly correlated with other
propensity scales for membrane proteins such as hydrophobicity or hydropathy [113].
Like TM topology predictors also burial predictors use BLAST and PSI-BLAST
generated sequence profiles and support vector machines [110, 111]. Different con-
servation patterns in TM and globular regions of membrane proteins were taken into
account in MPRAP [112]—a web server that predicts buried and exposed residues
for entire membrane proteins. This unified prediction is possible due to the prior opti-
mization of SVM which included information about the location of residues with
respect to the membrane.

3.3 Kink and Contact Predictions

Lack of reliable algorithms which mimic the folding of membrane proteins in silico
and sparse structural information from crystallographic studies prompt to develop-
ment of methods extracting a more fine-grained description of membrane proteins
than a simple definition of their topology. Namely, several additional features were
subjected to prediction from membrane proteins sequence: kinks of TM helices,
location of re-entrant regions [59] (when entry and exit of a protein fragment are at
the same side of the membrane—a common feature of ion and water channel pro-
teins) and finally interfacial residues in a TM core. The key element in detection of
TMH kinks is presence of proline in a particular position of a TM helix either in a
query sequence or in a significant fraction of its close homologs [114–116]. Recently,
Kneissl et al. [117] reported a new kink predictor with included ASA predictions and
382 D. Latek et al.

statistics of Ser and Gly occurrences in kinks. Early methods for contact prediction
were based on correlated mutation analysis (CMA) [118] assuming that residues
close in space mutate in tandem. Additional information about predicted secondary
structure, solvent accessibility, homologous proteins and usage of advanced machine-
learning algorithms improved rather weak performance of CMA-based methods and
enabled to use them not only in a large scale globular protein structure prediction
[119] but also in GPCRs structure prediction [120]. In the latter case only a simple
sequence conservation filter was used. That shows that due to relative structural sim-
plicity imposed by the lipid bilayer of membrane proteins comparing globular ones
contact prediction requires less sophisticated algorithms, e.g. based only on CMA,
which result in quite high prediction accuracy [121]. Although contact predictors tar-
geting specifically membrane proteins are less common, several attempts have been
made in this field. Developed methods introduced similar factors in the contact pre-
diction as in case of globular proteins: sequence conservation and CMA [121, 122],
TM helices and β-strands packing motifs [123]—either structural (‘knob-into-hole’
and ‘ridge-into-groove’ [124]) or sequential [123, 125], amino acids propensities
[126], evolutionary [127] and knowledge-based data [128]. Distinct packing of TM
helices is crucial for the interface contact prediction since such interactions are mainly
accomplished by weakly polar amino acids that create contacts every fourth residue
of a helix in TM channels or by large polar amino acids every 3.5th residue of a
helix in TM receptors and membrane-integral redox proteins. The former type of
contacts were named as right-handed interactions because interacting residues are
placed in such a way that they form a right-handed curve while looking along the
main axis of the helix. The latter were named left-handed interactions, respectively
[129]. Detection of both, right and left-handed interactions in contact prediction was
implemented e.g. in a RHYTHM server [129, 130].
Prediction of kinks of TM helices together with prediction of other structural
deformations such as bulges or constrictions is an important issue in GPCR struc-
ture modeling. Such distinct structural features can be crucial, e.g. for the GPCR
ligand selectivity [131]. Two, recently updated web services, GPCR-SSFE 2.0 [16]
and GPCRDB [14], for GPCR structure modeling implement structural fingerprint
features such as kinks or bulges to search for the best template for the model building.

4 3D Structure Predictions and Modeling

Attempts of tertiary predictions for membrane proteins are even more problematic
than in case of globular proteins since the number of membrane proteins structures
deposited in PDB is substantially smaller. Thus, comparative modeling—the most
common approach to structure predictions is severely hampered for membrane pro-
teins. On the other hand, de novo methods developed for globular proteins are based
on assumption of polar solvent around proteins and thus hardly could be used for pro-
teins embedded into a specific anisotropic membrane environment. Empirical force
fields which were designed to simulate behavior of membrane proteins are used
Modeling of Membrane Proteins 383

mostly in molecular dynamics which model biological systems in much shorter time
scale than protein folding and cannot be used in structure prediction. Coarse-grained
force fields combined with a Monte Carlo algorithm which enabled to predict folding
of at least small globular proteins [132] in the case of membrane proteins are very rare
(Rosetta-membrane [133] and HBMPs [134] are notable exceptions). For that reason
recent attempts by Ueno et al. [135] to develop a coarse-grained algorithm for fold-
ing of TM helices into the shape derived from a low-resolution electron microscopy
image will certainly gain interest of the research community. Despite those obvious
hindrances in structure modeling of membrane proteins several attempts have been
made either to template-based or de novo modeling (see Table 3) as the knowledge
of 3D structure is not only crucial in drug discovery process but even for reliable
classification of members of membrane protein families [136].

4.1 Comparative Modeling

The first step in comparative modeling is the choice of a template (or templates) struc-
tures and generation of the target-template alignment. Except for similarity between
target and template sequences also a biological context should be taken into account,
e.g. an expected activation state of the modeled structure in case of membrane recep-
tors (GPCRs) [145] and similar structural fingerprints such as kinks or bulges [16]
and also coverage of functionally important sequence motifs [131]. Since classifi-
cation of membrane proteins into families is not always straightforward (see above)
an extensive search for close homologs should be performed in prior to structure
prediction by comparative modeling [136] e.g. using an algorithm based on Hidden
Markov Models as in SSFE [15]. Standard scoring matrices such as BLOSUM and
PAM used to align protein sequences were derived mostly from globular proteins
and do not take into account different sequence conservation patterns observed in
membrane proteins. Distinct evolutionary divergence of membrane proteins, high in
loops and low in TM regions, was taken into account in new substitution matrices for
TM helical proteins: JTT [75], PHAT [146], SLIM [147] and also for β-barrels [148].
Usage of those membrane-specific substitution matrices improves sequence align-
ment in many cases [149, 150], however attempts to use them only for TM regions
and e.g. a standard BLOSUM matrix for scoring of loop regions alignment (so-called
bipartite alignments) were not always successful [151]. More beneficial seems to be
a simple increase of a gap cost for TM regions and aligning them separately from
the rest of a protein even without changing the matrix into a membrane-specific,
as was firstly showed by Shafrir and Guy [152]. Such detection of a TM core and
including this information in the alignment generation by a more restrictive gaps
treatment and a membrane-specific substitution score was implemented lately in a
Medeller software [141]. Another approach to target-template alignments for mem-
brane proteins is to use anchored realignment [145], preserving important functional
motifs of membrane proteins and integrity of template TM helices (only one-residue
gaps in the alignment are allowed [153] with only slight intervention into the original
384 D. Latek et al.

Table 3 Web servers and stand-alone applications targeting structure prediction of membrane pro-
teins
Name Website Method References
Interface/contact predictors
HelixCorr http://webclu.bio.wzw.tum. Consensus method and CMA [121]
de/helixcorr
RHYTHM http://proteinformatics. PSSM and secondary [130]
charite.de/rhythm structure prediction and
sequence conservation
Full 3D model predictors
Rosetta- http://www.rosettacommons. Fragment-assembly and [133, 137]
membrane and org membrane proteins-based
Rosetta Broker statistical potentials
BCL::MP http://www.meilerlab.org/ Fragment-assembly and [138]
-Fold bclcommons membrane proteins-based
statistical potentials
FILM3 http://bioinf.cs.ucl.ac.uk/ Fragment-assembly based on [139]
introduction the Fragfold method
ModWeb https://modbase.compbio. Comparative modeling by [140]
ucsf.edu/scgi/modweb.cgi Modeller
Medeller http://opig.stats.ox.ac.uk/ TM core detection in the [141]
webapps/medeller/ alignment generation
Predictors targeting specific families
GPCRM http://gpcrm.biomodellab.eu Comparative modeling by [142]
Modeller and Rosetta;
multiple templates and
profile-profile alignment
GPCR-SSFE http://www.ssfa-7tmr.de/ Comparative modeling by [15, 16]
2.0 ssfe2/ Modeller
GPCR- http://open.gpcr-modsim. Comparative modeling by [143]
ModSim org/ Modeller with identification
of structural fingerprint
features
GPCR-I- https://zhanglab.ccmb.med. Comparative modeling by [10]
TASSER umich.edu/GPCR-I- I-TASSER threading method
TASSER/
GoMoDo http://molsim.sci.univr.it/cgi- Comparative modeling by [144]
bin/cona/begin.php Modeller and docking by
Autodock VINA
Abbreviations used:
SVM Support vector machines
CMA Correlated mutations analysis
OMP Outer membrane proteins
REMC Replica Exchange Monte Carlo
PSSM Position-specific scoring matrix
Modeling of Membrane Proteins 385

non-anchored alignment. An interesting solution is also an incorporation of hydropa-


thy profiles into the alignment as in the AlignME software [154]. Undoubtedly, a
target-template alignment derived from a profile-profile alignment of homologous
sequences is much more accurate even if no membrane proteins-specific substitution
matrix is used as in GPCRM [142], as it is one of the most efficient methods used in
comparative modeling for various protein families [155].
The model building step in comparative modeling of transmembrane proteins is
usually performed by Modeller [153], which creates a 3D protein model by satisfying
spatial restraints derived from a template structure and minimized it in the all-atom
force field based on CHARMM [156]. Nevertheless, other methods could also be used
[153], such as Yasara [157] e.g. in GPCRDB, Swiss-model [158], Rosetta [133, 159],
ITASSER [160] or recently published Medeller [141]—a program based on Modeller
in which the target-template alignment generation is improved due to the detection of
transmembrane core. Basically, the abovementioned methods preserve the template
structure, however, some modifications can also be introduced. For example, a GPCR
model can be built by joining helices from different template structures (GPCR-SSFE
2.0, GPCRM, GPCRDB) or as a sequence similarity-dependent weighted average of
a few templates (GPCRM). A large scale movement of a selected helix in a given
template structure, e.g., to reconstruct an allosteric binding site, is also possible
(Rosetta Broker) [161].

4.2 Modeling of Loops

Since the model building procedure hardly ever takes into account a different amino
acids rotamers distribution in the membrane comparing the polar environment of
globular proteins even a short minimization of implicit or explicit membrane envi-
ronment improves the local accuracy of the final protein model [153, 162]. Perform-
ing molecular dynamics simulation in a membrane at least as long as the protein
relaxation time before e.g. a docking procedure is undoubtedly more beneficial but
requires a significant amount of computational resources and can be skipped in many
cases when any experimental data confirms reliability of the generated models [153,
163].
A more crucial than the model refinement in a membrane-like environment is a
reliable refinement of loops especially in the binding site area. Accuracy of such
refinement greatly depends on a position of loop anchoring residues in a certain
homology model [164]. Many methods for membrane proteins modeling use the
loop-modeling procedure implemented in Modeller which includes statistical poten-
tials (a DOPE score) [165] and can be characterized as a fragment-based method—-
like a SuperLooper web server based on a database of protein fragments [166]. Less
popular, but of equal performance [167] is another fragment-based method imple-
mented in Rosetta i.e. a cyclic coordinate descent algorithm [168]. Less optimal
treatment of disulfide bonds in Rosetta applications comparing an efficient disulfide
patch in Modeller, either based on the template’s local geometry or general rules of
386 D. Latek et al.

stereochemistry and the CHARMM force field, slightly favors the latter approach
[145]. This is because disulfide bonds are very common in membrane proteins e.g.
in the extracellular loop2 (EC2) in GPCRs. Both, in Modeller and Rosetta secondary
structure predictions can be used during the loop-modeling which improves method
performance especially in the case of long loops (more than 10 amino acids). As for
de novo methods useful in the modeling of long loops and N or C-termini of mem-
brane proteins successful results were obtained by the CABS method [169] in case of
GPCRs models, a Rosetta kinematic closure algorithm [170] and PLOP—a dihedral
angle search procedure with the all-atom OPLS-AA force field energy function and
a Generalized Born implicit solvent model, which was implemented commercially
as Prime (Schrödinger, LLC) [171].

4.3 Assessment of Protein Models

As in the case of structure prediction of globular proteins the selection of the final,
most probable model of a protein is an important step. Yet, there are few MQAPs
(Model Quality Assessment Programs) which were developed specifically for mem-
brane proteins: an IQ method [172] based on the analysis of four types of inter-residue
interactions (hydrophobic interactions, hydrogen bonds, ionic bonds, and disulfide
bonds) within the transmembrane domains and ProQM [173] which is using sup-
port vector machines trained on structural features of membrane proteins such as
inter-atomic and inter-residue contacts, solvent-accessible surfaces, secondary struc-
ture, topology of TM region, a Z-coordinate (describing positioning of residues with
respect to the membrane center) combined with evolutionary information (profiles
and sequence conservation). MQAPs developed for globular proteins perform much
worse on membrane proteins due to significant differences in amino acid propen-
sity, packing density, and side-chain rotamer frequencies in soluble and membrane
proteins [174]. Alternatively to MQAPs, membrane protein models can be assessed
successfully by their stability during molecular dynamics simulations [175] or by
scoring functions provided by model building programs even lacking a representa-
tion of a lipid bilayer [174] e.g. Rosetta total energy [145] or low-resolution energy
function [173], a DOPE Modeller score [145, 143]. Progress in structural determi-
nation of membrane proteins enabled the usage of statistical potentials for scoring
models by, e.g. BCL::Score [176, 177]. Selection of the most suitable model quality
assessment method depends on the purpose. For example, in the GPCR modeling
which is aimed at drug discovery, a ligand-based approach, in which the interactions
with known binding ligands are used in the model assessment, is believed to be the
most beneficial [177–180].
Modeling of Membrane Proteins 387

4.4 De Novo Modeling

Since the number of crystal structures of membrane proteins in PDB is limited the
comparative modeling frequently does not provide protein models which could be
confirmed by experimental data e.g. in case of early rhodopsin-based models of
GPCRs [181] or hERG channels [162]. Consequently, de novo methods for mem-
brane protein structure modeling are of great interest. Methods used for globular
proteins can still be used in some cases for membrane proteins provided some adjust-
ments of the solvent-related components in the force field are made e.g. in Rosetta-
membrane (or Rosetta Broker). Rosetta-membrane employs statistical potentials
derived from the known 3D structures of membrane proteins which take into account
two types of environment: polar and hydrophobic [133]. The TM topology prediction
from servers should be added during the modeling procedure. The performance of
Rosetta-membrane is comparable with the Rosetta performance for de novo model-
ing of globular proteins as long as a membrane protein is smaller than 150 amino
acids [182]. Unfortunately, most of membrane proteins of interest are longer than
200 residues and thus at least a limited set of constraints on the structural elements
packing has to be incorporated during the Rosetta-membrane folding [183]. Tertiary
restraints derived from the template structure are also needed for the CHARMM-
based hierarchical approach using an implicit membrane in a foldGPCR tool [184].
Nevertheless, few groups developed their own membrane proteins-specific de
novo tools i.e. GEnSeMBLE [185] and PREDICT [186] which both target 300 or
more residues long members of the GPCRs family. The latter approach is based on
sampling a reduced space of TM helices represented as discs on a 2D plane. The for-
mer, more realistic approach is based on a BiHelix algorithm [187] and its ancestor
Membstruk [188] which use the sampling the helix orientation angles space (a tilt
angle θ , a sweep angle φ and a rotation angle η) in a homology-based starting model.
Since the energy calculation of all possible combinations for 7TM helices is compu-
tational expensive a 7 helices bundle is split into pairs of interacting helices in the first
step and gathered again only from the low-energy conformations [187]. A recently
published de novo algorithm [134] based on a Replica Exchange Monte Carlo method
(REMC) also employs sampling of TMH orientation angles but with a reduced rep-
resentation of an amino acids: C-alpha atoms joined with united side-chains. The
lowest-energy model is refined in all-atom molecular dynamics in the AMBER9
force field. The idea of TM helices rotation with respect to templates structures
has proved its relevance during the last GPCRDock 2010 competition [153], while
the reliable model generation for the chemokine receptor CXCR4 required ~100°
rotation of a part of TM2 with respect to the template. Such rotation could also be
obtained by introducing a certain gap into the target-template alignment [145, 153].
388 D. Latek et al.

4.5 Web Servers for 3D Structure Predictions

Several methods for comparative and de novo modeling of membrane proteins have
been developed to date (see Table 3), some of them in the form of web-servers—the
most beneficial for the research. Most of them target GPCRs family for which only
few structures are available in PDB despite the great interest from the pharmaceutical
industry. Except for the web-servers precomputed 3D models of membrane proteins
with unknown crystal structure can be accessed in various databases e.g. GPCRDB
(all human nonolfactory GPCRs in inactive, intermediate and active states—using
main template and alternative local templates) [14], GPCRRD (ITASSER-generated
models) [10], Mod-Base (Modeller-generated comparative models) [140], GPCR-
SSFE 2.0 (Modeller-generated models) [15, 16]. Critical assessment of available
structure modeling methods targeting membrane proteins is still limited, due to small
number and rare occurrence of membrane proteins in PDB.

4.6 Modeling of a Ligand Binding Site

Membrane protein structure prediction still requires development of new methods or


at least adjusting methods already developed for globular proteins. Consequently, a
human intervention into prediction and usage of consistent experimental data cannot
be overestimated [153]. As the main aim of protein structure modeling is devel-
opment of new drugs, a ligand-guided approach in which protein models are built
and selected based on the ligand (or multiple ligands—a pharmacophore) docking
information [153] seems a notable solution. Another problem in this area is a com-
putational support for studying the allosteric effect—binding of ligands which cause
structural changes in some other sites of proteins, which is observed e.g. in class C of
GPCRs family [189]. Allosteric drugs seem to have less side effects due to binding
to non-orthosteric sites in proteins and consequently are of great interest from the
pharmaceutical industry [190]. Recent studies combine molecular dynamics simu-
lations with experimental data to study allostery in GPCRs, however more efficient
computational methods for sampling loop conformations in the presence of ligands
are undoubtedly still needed [191].

5 Docking Methods

In the field of molecular modeling, docking is a method for predicting the preferred
orientation of one molecule to a second when they are bound to each other to form
a stable complex. Knowledge of the preferred orientation in turn may be used to
predict the strength of association or binding affinity between two molecules using
for example scoring functions. In the 1970s, complex modeling revolved around
Modeling of Membrane Proteins 389

manually identifying features on the surfaces of the interacting molecules, and inter-
preting the consequences for binding, function and activity. Computer programs
were typically used at the end of the modeling process, to discriminate between the
relatively few orientations which remained after the heuristic constraints had been
imposed. The computers was first employed in a study on hemoglobin interactions
in sickle-cell fibers by Levinthal et al. [192].
Molecular docking can be thought of as a problem of “lock-and-key”, where one
is interested in finding the correct relative orientation of the “key” which will open
up the “lock” (where on the surface of the lock is the key hole). Here, the protein can
be thought as the “lock” and the ligand as a “key”. Molecular docking may be also
defined as an optimization problem, which would describe the “best-fit” orientation
of a ligand that binds to a particular protein of interest (Fig. 2).

5.1 Preparations for Docking

Three questions should be answered before docking experiment. The first one when
planning the docking experiment is there an experimental structure for the protein I
want to use as a target during the docking? To answer this question, it is necessary to
check the PDB database depository (www.pdb.org) and download the corresponding
target. If no 3D structure of a receptor is available, extensive structure prediction
studies should be performed, favorably followed by experimental studies confirming
reliability of the obtained protein model.
The second question to answer is where could my ligand be docked? The binding
site can be determined based on experimental data such as mutagenesis. If a receptor
loses its ligand binding ability after mutation of certain amino acids, most probably
those residues are close to or inside the binding site. In case of lack of experimental

Fig. 2 Formyl peptide fMLF docked to model of FPR1 (Formyl Peptide Receptor 1)
390 D. Latek et al.

Table 4 The binding pocket finding tools


Tool Method Webpage References
Fuzzy-oil-drop Distribution of www.bioinformatics. [193]
hydrophobicity cm-uj.krakow.pl/
activesite
PLB Amino acid [194]
composition
LigProf Transfer ligand www.cropnet.pl/ [195]
annotation from PDB ligprof
bank
PLB-SAVE Based on geometric http://save.cs.ntou. [196]
features edu.tw
Protemot Based on features of http://protemot.csbb. [197]
PDB bank ntu.edu.tw
CASTp Weighted Delaunay http://sts.bioe.uic.edu/ [198]
triangulation and the castp/
alpha complex for
shape measurements
MEDock Global search http://medock.csbb. [199]
ntu.edu.tw
PASS Characterize regions www.ccl.net/cca/ [200]
of buried volume software/UNIX/pass/
overview.shtml

data the binding site can be predicted based on geometry or electrostatic of the protein
surface. Several binding pocket prediction tools are described in Table 4.
The third question is how to obtain 3D structure of a ligand including total and
partial charges determined. To build such a 3D structure of a ligand one can use many
stand-alone applications as well as databases with the online access (see Table 5).

5.2 Conformational Search Algorithm

In many cases the lowest energy conformation of a ligand downloaded from the
databases or produced by standard tools is not sufficient for docking purposes due
to its flexibility while fitting to the receptor binding site. For that reason, all the
docking programs have the following features: (a) an exhaustive conformation search
algorithm which changes not only the starting conformation of a ligand but sometimes
also the receptor and provides candidate 3D structures of the complex; and (b) a
scoring function that scores all those candidates and ranks them according to the
intermolecular interaction energy (i.e. the more negative this energy is, the higher
the candidate’s score). Despite this apparent coincidence, each docking program
differs from the others in the search method which is used, the level of flexibility
Modeling of Membrane Proteins 391

Table 5 Ligand structure generating tools


Program Webpage References
Chemoffice www.cambridgesoft.com/ –
software/ChemOffice
BIOVIA Draw accelrys.com/products/ –
collaborative-science/biovia-
draw/
Maestro www.schrodinger.com –
MOE www.chemcomp.com [201]
SYBYL www.tripos.com –

Online Databases
ChemPDB www.ebi.ac.uk/pdbe-srv/ [202]
pdbechem
ZINC15 https://zinc15.docking.org/ [203, 204]
PUBCHEM https://pubchem.ncbi.nlm.nih. [205, 206]
gov/
BindingDB https://www.bindingdb.org [207]
ChEMBL https://www.ebi.ac.uk/chembl/ [208]
downloads

of molecules and contributions of different types of intermolecular interactions and


steric overlaps that it considers to evaluate ligand binding modes.
The ability to produce a large and diverse set of ligand poses (ligand conformations
that specifically bind to the biological target) is a prerequisite for a docking tool to be
useful [209]. There are two main types of algorithms that allow docking programs
to search the conformational space of the ligand in order to find its poses [210]: (a)
systematic or directed approaches; and (b) random or stochastic methods [210–212].
There are three subtypes of systematic or directed search algorithms: (a) confor-
mational search methods; (b) fragmentation or incremental construction methods;
and (c) database methods. They all try to explore all the degrees of freedom of the
ligand, however the way they carry out the search, is different. In this respect, the
conformational search algorithms try to obtain all possible ligand conformations by
a simple rotation of all ligand rotatable bonds by a fixed increment. In fragmentation
methods, the ligands are incrementally grown in the binding site by dividing the
ligand into several rigid fragments, docking them and finally trying to rebuild the
whole ligand by joining those rigid fragments by the flexible segments. In the other
approach the ligand is divided into a rigid core that is first docked and the remaining
fragments are being consecutively added. Fragmentation methods are used in the
392 D. Latek et al.

docking programs such as DOCK [213], LUDI [212], FlexX [214], ADAM [215],
and eHiTs [209]. The last subtype of systematic search algorithms are the database
methods that use libraries of pre-generated conformations (so called conformational
ensembles) that are subsequently subjecting to a rigid body docking. This method is
employed in Glide [216, 217] and FRED [218].
In random or stochastic search algorithms the conformational space is sampled
by performing a random conformational change of a ligand structure followed by
acceptance or rejection of the resulting conformer based on a predefined probability
function. If the generated ligand conformation is accepted, it is used as the start-
ing point for a new random conformational change. Random search methods are
divided into three subtypes: (a) Monte Carlo (MC) methods; (b) Genetic Algorithm
(GA) methods; and (c) tabu search (TS) methods. In MC methods, the position and
conformation of the ligand is subjected to random subsequent changes followed by
the minimization step which are accepted based on the energy-dependent Metropolis
criterion [211]. The docking programs based on MC include: ICM [219], QXP [220],
Prodock [221], and MCDOCK [222]. Another subtype of random methods: GA uses
concepts derived from the theory of biological evolution to explore the conforma-
tional space of the ligand. Unlike MC methods, GAs start from an initial population
of different conformations of the ligand that are defined by sets of state variables or
genes that describe the conformation of the ligand and its translation and orientation
relative to the receptor. GOLD [223], AutoDock [224], SwissDock [225] are the
docking programs in which evolution algorithms are implemented. It is worth noting
that Autodock VINA, the newest version of Autodock, employs parallel processing
and accelerates small molecule ligand docking to such extent that it could be used
for docking on-line. Namely, Autodock VINA was implemented in GoMoDo [144],
recently in GUT-DOCK [226] and in MTiOpenScreen [227] web services.
The last subtype of random search algorithms are the tabu search (TS) methods
that work by imposing restrictions that prevent already explored areas of the ligand
conformational space from being visited again and, therefore, favor the analysis of
new conformations. To exclude already explored conformations, when a new ligand
conformation is available, its root-mean square deviation (RMSD) relative to the
previously visited conformations is computed. The lowest RMSD is compared with
a certain threshold value and, if it is higher, then the analyzed conformation of the
ligand is accepted and its coordinates are stored and used to accept or reject new
conformations.

5.3 Scoring Algorithms

Once the candidate ligand poses have been predicted, their binding affinity for the
receptor must be scored. This is done by means of a scoring function that evaluates
the search results and then gives, ideally, the highest score to the right pose. In fact,
if the search algorithm can find the correct pose but the scoring function cannot
recognize it, the program will make an invalid and useless suggestion to the scientist.
Modeling of Membrane Proteins 393

Therefore, the role of the scoring function is critical in every docking protocol. The
scoring functions commonly used in protein-ligand docking can be divided into four
major classes: (a) force field-based; (b) empirical-based; (c) knowledge-based, and
(d) consensus-based.
Force field-based scoring functions are similar to empirical-based functions (see
below) because they both predict the binding free energy of a protein-ligand complex
by adding individual contributions from different types of interactions. Nevertheless,
the interaction terms of the former are derived from the theoretical physics that under-
lie molecular mechanics as opposite to the experimental affinities used to derive the
latter. Dock [213] is a classic example of a force field based tool. Created in the 1980s
it was the first docking program. Empirical-based scoring functions are based on the
idea that the binding energy can be obtained by adding several individual and uncor-
related terms. Many of the terms in the empirical scoring functions have equivalences
in the force-field scoring functions but they are usually simpler in form. The programs
like GlideScore [216, 217], SYBYL/F-score [214], X-score [228] and Chemscore
[229] are all belonging to the empirical scoring methods class. Knowledge-based
scoring functions are based on ligand geometry and contact preferences derived
based on the Boltzmann distribution from databases of known protein-ligand com-
plexes. The last but not the least class of methods, consensus scoring functions,
combine the information obtained from different scoring approaches to compensate
for errors introduced by each of them and thus to improve the probability of finding
the correct solution. Examples include DrugScore [230], SMoG [231, 232], BLEEP
[233, 234] and GOLD/ASP [235].

5.4 Induced Fit Docking

If the bond angles, bond lengths and torsion angles of the components are not modified
at any stage of the docking it is called a rigid-body docking. A subject of speculation
is whether or not rigid-body docking is sufficiently good for most of studies. When
a substantial conformational change occurs within the components at the time of the
protein-ligand complex formation, the rigid-body docking is inadequate. However,
scoring all possible conformational changes is computationally too expensive when
both ligand and receptor structure are changed. For that reason, the flexible docking
procedures which permit a conformational change must efficiently select only a
small subset of possible conformational changes for consideration. Flexible docking
involving flexibility of the side chains of the receptor is called “Induced Fit Docking”.
The “Induced-Fit Docking” (IFD) module from the Schrödinger has been reported
to be a robust and accurate method to account for both ligand and receptor flexibility.
The average ligand root-mean-square deviation (RMSD) for the traditional rigid
receptor docking for 21 cases was 5.5 Å, while the RMSD from the Schrödinger IFD
module was 1.4 Å [236]. Recently, Hanson et al. used IFD method docked ligands
into lysophospholipid sphingosine-1-phosphate (S1P) G-coupled protein receptor
crystal structure to eliminate the differences between agonist and antagonist which
394 D. Latek et al.

have the different impact on the receptor structure [237]. Other programs such as
Gold [223], Autodock [224] and FlexX [214] can also perform flexible docking.

5.5 Example of Virtual Screening on GPCRs

Structure-based virtual screening involves docking of candidate ligands into a pro-


tein target followed by applying a scoring function to estimate the likelihood that the
ligand will bind to the protein with high affinity. Since G-protein-coupled receptors
(GPCRs) mediate cellular responses to the majority of hormones and neurotrans-
mitters they are attractive targets in drug discovery procedures. GPCRs represent a
large family of signaling proteins (see Fig. 1) that includes many therapeutic targets.
However, the progress in identifying new small molecule drugs by virtual screening
has generally been disappointing. Nevertheless, in the past 4 years we have seen
remarkable progress in the structural biology of GPCRs, raising the possibility of
applying structure-based approaches to GPCRs drug discovery efforts. Of the various
structure-based approaches that have been applied to soluble protein targets, such as
proteases and kinases, in silico docking is among the most ready to apply for GPCRs.
Early studies suggest that GPCRs binding pockets are well suited to docking, and
docking screens have identified potent and novel compounds for these targets [238].

6 Introduction to the Molecular Dynamics of Membrane


Proteins

Molecular dynamics (MD) of biomolecules. During the simulation trajectories of


molecules are obtained by solving the Newton’s equations of motion. A quantum part
is hidden in a force field which is a set of equations and parameters used to derive
potentials and forces applied to interacting atoms. These parameters are typically
obtained from ab initio calculations as well as from experimental observations. Force
fields used in molecular dynamics are generally based on two types of terms: bonded
terms when atoms are linked by covalent bonds, and non-bonded terms describing van
der Waals and electrostatic potentials. In all-atom molecular dynamics simulations
many popular force fields like CHARMM27/36 or AMBER use a representation in
which every single atom in the system is mapped as a separate object and is explicitly
represented.
Setting up an input for the MD simulation of a cytoplasmic protein is quite straight-
forward as long as an initial structure is easy to obtain. A situation becomes more
complex with membrane proteins. The instability of membrane proteins in water-
like environments is the main reason of underrepresentation of this important class
of proteins in the Protein Data Bank. Nevertheless, it is noteworthy that in the recent
years we observe a certain breakthrough in the field of membrane protein structure
Modeling of Membrane Proteins 395

determination. After 2009 the number of resolved structures of unique membrane


proteins crossed 40 per year and reached an all-time high of 83 new unique structures
in 2016. The presence of many structures of proteins that belong to the same protein
family, like e.g. GPCRs, facilitates the computational research on close homologs of
structures which still remain to be experimentally determined. For details on com-
parative modeling of the membrane proteins, please see Sect. 3.1.
In order to run molecular dynamics simulation of any system, a set of parameters
describing each molecule type in the system is required. While biomolecular force
fields include parameters for the most common residues like amino acids, nucleotides,
water or ions, in an explicit-solvent membrane system one may come across some
residues for which there are no parameters in the standard force fields. That group
includes (a) modified residues, (b) small ligands and (c) membrane lipids.
The most prominent example of the first class of non-standard residues is a reti-
nal moiety covalently attached to a lysine side chain via a terminal nitrogen atom.
That important residue is found in the most extensively studied GPCRs: rhodopsin
and archaeal proton channel bacteriorhodopsin. Since the simulations of retinal-
containing systems are so widespread, retinal was parameterized by several authors
(for instance in [239]). Therefore, some ready-to-use files may be available upon an
email request from the authors unless they were published, for example, as supple-
mentary material.
Small ligands are usually substrates or modulators of protein activity. The ligand
molecule may be present in a PDB file or placed within a putative binding cavity
by a docking algorithm (see Sect. 5). Since small ligands are mostly the subject of
scientific interest as a part of drug discovery experiment, their parameters are hardly
ever known and have to be derived de novo, which is considered to be an advanced
task. In order to simplify this task, automated parameter generators compatible with
certain force fields were designed (see Table 6). Nevertheless, the newly derived
parameter set still requires a human inspection to avoid some obvious errors [240].
Membrane lipids constitute a very special class of non-standard residues which are
the building blocks of the lipid bilayer in which the protein is immersed. The mem-
brane, however, is not present in the PDB file except for single lipid molecules which
presence and alignment allows for drawing conclusions on the nature of protein-lipid
interface [252, 253]. This fact raises specific issues that have to be addressed: (a)
finding the proper type of bilayer, (b) building or finding membrane’s structure and
parameters and (c) embedding the protein into the membrane.
The lipid composition of biological membranes is variable and is known to depend
on a number of factors like (a) area of the cell membrane, (b) a cell type, (c) cell age,
(d) environment, (e) organelle or (f) taxon of the organism. The stunning resemblance
between the types of structures of membrane proteins present in various organisms
(either bundle of α-helices or β-barrel) contrasts with the variable nature of mem-
brane composition. That suggests that in general the membrane proteins are tolerant
to certain extent to differences in bilayer composition [254]. For instance GPCRs,
heterologously expressed in the cells of evolutionarily distant organisms may retain
their activity (a quite recent example [255]) despite the fact that the bilayer is lacking
cholesterol which is thought to be indispensable for GPCRs function (reviewed in
396 D. Latek et al.

Table 6 Selected topology builder applications and topology databases of small ligands
Application Website Force fields References
ATB http://compbio.biosci. Gromos family [241]
uq.edu.au/atb/
PRODRG http://davapc1.bioch. Gromos 87 [242]
dundee.ac.uk/prodrg/
SwissParam http://swissparam.ch/ CHARMM [243]
CGenFF http://mackerell. CHARMM [244]
umaryland.edu/
~kenno/cgenff/
MKTOP http://www.aribeiro. AMBER03 [245]
net.br/mktop/ OPLS/AA
Acpype http://code.google. GAFF [246, 247]
com/p/acpype/
AutoSMILES http://www.yasara. GAFF [248]
org/autosmilesserver.
htm
Virtual chemistry http:// GAFF [249, 250]
virtualchemistry.org/ OPLS/AA
Lipidbook http://lipidbook.bioch. set of force fields [251]
ox.ac.uk/

[253]). Such experimental data justifies the usage of simple, consisting of one or
two phospholipid types, membrane models in the MD simulations. Nevertheless,
each protein is a separate case. Therefore, the data concerning the sensibility of a
given protein to the lipid composition of the membrane should be checked prior to
MD simulations since in certain cases it may influence the results [256, 257]. For a
thorough review on that subject please refer to [254].
When the lipid composition is finally established, the next step is to generate
an input file with the pre-equilibrated membrane along with topology files of all
molecules inside that bilayer. There exists an excellent lipid topology repository
called Lipidbook [251] which stores topologies parameterized for the commonly used
force fields like GROMOS43a1/53a6 [258, 259], CHARMM22/27/36 [260–262],
GAFF [263], OPLS/AA [264, 265], Slipids [266], Martini [267] and Bondini [268,
269] which are implemented in the popular molecular dynamics software packages:
GROMACS [270–274], NAMD [275], CHARMM [276] and Amber [277]. If the
available packages do not include the membrane topology which is needed for the
certain study, either because of improper size of required periodic box or a compo-
sition, the membrane may be built automatically by CHARMM-GUI [278–280] or
VMD [281] which allow for membrane size adjustments.
Modeling of Membrane Proteins 397

The position of the protein in the bilayer is another key factor heavily influencing
the outcome of MD simulations. Since the membrane position is not provided in PDB
files, a number of computational methods have been developed to facilitate the step
of membrane positioning. The key concept at this stage is the hydrophobicity of the
protein that determines the orientation and thickness of the membrane into which the
protein will be inserted. For a comprehensive review of methods for transmembrane
region prediction and related databases please refer to Sect. 3.2.
When the protein of interest is finally positioned with respect to the bilayer, dele-
tion of several lipid molecules is necessary so that they do not overlap with the
positioned molecule. A simple and naïve approach of lipids deletion may require
a long equilibration due to the very loose lipid packing around the protein. For-
tunately, there exist more sophisticated methods to perform that step. The tools
developed over the last decade implemented several approaches. An inflategro perl
script [282] implements inflation of the membrane followed by lipid deletion within
the given cutoff and subsequent gradual membrane compression with protein coor-
dinates remain constant during the whole process. Another example, a tool from
the GROMACS suite [270–274] called g_membed [283] (currently included into
the code of the main program mdrun) contracts the protein, deletes lipids within
the given cutoff and gradually decompresses the protein to its initial size perform-
ing one step of molecular dynamics during every iteration of the decompression
stage. The same approach is implemented in other tools, for instance a Yasara macro
called md_runmembrane.mcr which was designed to automate the process of mem-
brane simulation setup [284]. Both methods, g_membed and md_runmembrane.mcr,
result in dense lipid packing around the protein whereby the equilibration time is
reduced. The advent of multiscale simulations opened a new way where insertion and
equilibration are performed using a coarse-grained representation. Before running
production simulation a transformation to all-atom resolution is carried out. Insane
[285] and Backward [286] which can handle many types of lipids thereby allowing
for setup of complex membrane environments. Both tools use MARTINI force fields.
If the system after the protein insertion does not contain water layers, the solvation
step is required. Since the software used for the protein insertion takes into account
only the space criterion and not the properties of the environment, the final system
should be verified to capture misplaced water molecules. Such misplacement may
involve water molecules inserted into the hydrophobic core of the membrane and into
solvent-inaccessible protein cavities. Although in the former case water molecules
will diffuse out of the membrane during the equilibration step, it is reasonable to
remove them before starting the simulation at least for the sake of saving the com-
putational time. The latter type of misplaced water molecules are more problematic
since running such simulation with water in buried cavities renders the system to
unphysical states which undermines conclusions drawn from such study. A sudden
crash of the simulation may indicate that water molecules are present in a closed
cavity.
The last question to consider is how long the equilibration step should last and
how to detect its end when one may move on to the production run. It is obvious, that
the preparation of the investigated system should be designed in such a way that at
398 D. Latek et al.

the beginning of the equilibration step the system is as close to equilibrium as pos-
sible. Several steps to shorten the equilibration time were discussed in the previous
paragraphs. They include usage of a pre-equilibrated membrane, more sophisticated
protein insertion methods and a proper solvation of the system. A reliable protein
model is also important and this is a primary distress of researchers performing
homology modeling. Since the equilibration time depends on many factors, it is
essential to choose reasonable criteria that, once fulfilled, mark the end of equili-
bration process. One of the most commonly used criterion is the root mean square
deviation (RMSD) calculated with respect to the reference structure. Other criteria
to consider include various interaction energies (e.g. lipid-water or protein-lipid) or
a simulation box volume (when pressure coupling is applied). Once the properties
of interest converge to a stable value, the equilibration is finished.
A step-by-step manual setup of a membrane protein system is a labour-intensive
task. A notable progress in the development of tools automating this process could
be observed recently. One of such pipelines is used by MemProtMD database [287].
The tool automatically identifies new membrane proteins in Protein Data Bank and
performs membrane insertion, system equilibration and resolution transformation for
which it utilizes the already mentioned Insane and Backward tool duo. The popular
CHARMM-GUI web server gained new features like Martini Maker [288] or Mar-
tini to All-atom Converter. The latter one relies on the same toolset as MemProtMD.
A set of Membrane Builder improvements allows for more efficient construction of
even more complex all-atom membranes [289]. There exist tools that can be installed
and used locally. QwikMD [290] is a recent addition to VMD [291] visualization
toolkit and facilitates both setup and analysis of molecular dynamics simulations
through a graphical user interface. It provides workflows for both beginners and
more advanced users. High Throughput Molecular Dynamics (HTMD) is a platform
which integrates many functionalities from structure manipulation through running
calculations on different resources to trajectory analysis [292]. Its features are avail-
able as a set of Python classes and functions. The popularity of this language in
scientific environment also provides a boost to a further community-driven develop-
ment of extensions.
This chapter part highlighted the selected topics regarding the setup of MD sim-
ulations of membrane proteins. While the development of automated tools capable
of simulation setup, running and analysis serves the scientific community, some sys-
tems or steps might yield errors and require detailed inspection. In such case, it is
crucial to possess a more detailed knowledge.

6.1 Steered Molecular Dynamics

Many membrane proteins serve as receptors or transporting channels. During acti-


vation they undergo certain conformational changes, for example, the movement of
whole TM helices. To understand how and why the protein is activated it is cru-
cial to study its dynamic properties and stability of a ligand-receptor pair. Atomic
Modeling of Membrane Proteins 399

Force Microscopy (AFM) methods, especially Single Molecule Force Spectroscopy


(SMFS) and Dynamic Force Spectroscopy enable to record information about forces
either needed to rupture interactions within protein or forces required to unbind lig-
and from the complex. The exact unfolding path or ligand extraction path remains
unknown, yet it is possible to perform molecular dynamics (MD) simulations resem-
bling AFM experiments which may reveal the trajectory of system changes. The
exemplary images from unfolding path of rhodopsin are shown in Fig. 3.
The modified MD, Steered Molecular Dynamics (SMD), is similar to experimen-
tal method SMFS. Like its experimental counterpart, SMD allows for mechanical
unfolding of proteins or drag molecules in a specified direction by applying an exter-
nal force to selected atoms, amino acids or even whole molecules (e.g. ligands in
proteins). In experiments the investigated molecule is attached to a tip of AFM can-
tilever. As the cantilever with the tip are being retracted with the constant speed
the interaction forces between the tip and an attached molecule increase resulting in
bending the cantilever. The flexible cantilever is subjected to the Hooke’s law with a

Fig. 3 Exemplary steps of the unfolding pathway of rhodopsin. a Unfolding of helix TM1 (in blue).
b Unfolding of the protein region containing a disulphide bridge
400 D. Latek et al.

force constant characteristic to the type and model of the cantilever used. In the SMD
simulations the external force can be employed in various ways. (1) Since the AFM
cantilever is subjected to Hooke’s law its attachment to the sample can be modeled
as restrained by harmonic potential to a dummy atom (equivalent of a tip) which
is moving with a constant speed. Such method is very often used for mechanical
unfolding of proteins e.g. titin [293], bacteriorhodopsin [294] and investigation of
intermolecular forces between proteins and smaller molecules [295]. Due to similari-
ties to SMFS the results of simulations can be easily compared with the experimental
force-displacement (F–D) plots. (2) Another implementation of SMD is applying not
a constant speed but a constant force or a torque to selected atoms. Such a force is
added directly to the selected atoms during each step of MD simulation therefore a
dummy atom and a virtual spring is not needed. Such implementation is useful for
achieving nearly equilibrium state during pulling especially when the applied force
is equal to resistance forces so one can investigate internal regrouping of parts of
protein during ligand unbinding or during a movement (even rotational) of domains
[296]. Depending on the introduced force the obtained displacement can resemble
slightly biased thermal movements (very small forces) or molecule diffusion (moder-
ate forces) up to drift movement (strong forces) [297]. (3) The third method involves
using of frozen dummy atom while a spring is relatively week and initially stretched.
During the simulation a force constant of spring is gradually increased so the force
is increasing and enabling movement of atoms. This method was used to investigate
unbinding of avidin-biotin complex [298] but nowadays it is rarely used because a
direction of applied force cannot be changed.
Although the SMD methods are extremely useful providing details of processes
not available from experiment they have also some drawbacks. The most important
is that the pulling speed used in SMD is much larger (about six orders of magnitude)
than that in experiment because a single AFM pulling experiment can last even few
seconds while the longest SMD simulation is in a microsecond time scale. Because
of it the recorded forces in SMD simulations are higher than those in experiment
about one order of magnitude [297]. Nevertheless, since the obtained F-D curves
are very similar to experimental ones the mechanisms of unfolding or unbinding
should be also similar so results taken from SMD are valid and taking into account
a constant increase in computer efficiency the gap between theory and experiment
will be vanishing.
The SMD simulations were successfully used in various investigations. Mechan-
ical unfolding of bacteriorhodopsin (BR) unveiled the sequential unfolding pathway
of that protein and showed that dominant molecular interactions are networked hydro-
gen bonds and Van der Waals interactions between nonpolar groups. The researchers
suggested that the similar dynamic interaction network could be a key factor stabi-
lizing GPCRs and other membrane proteins [294]. Series of fast SMD simulations
concerning unfolding of various rhodopsin mutants associated with an autosomal
dominant form of retinitis pigmentosa also confirmed importance of the dynamic
interaction network. For the selected 20 point mutants all force curves were very
similar to the wild type rhodopsin curves, proving that mutation of one amino acid is
not enough to disrupt the rhodopsin structure and stability even if the protein function
Modeling of Membrane Proteins 401

is ceased [299]. Another SMD study [300] concerned the retinal extraction pathway
from the bacteriorhodopsin binding site into the membrane. A certain assumption
was made here, namely that the protein structure remains intact during the extraction
so the same path could be used for the insertion. Since there is no straight way for
retinal to leave the protein the time dependent force SMD protocol was applied. It
was observed, that retinal formed stable interactions at the assumed entry/exit site
suggesting that they may be formed prior to entering the protein cavity [300].
For the modeling of transition processes between two conformations of the system
a variation of SMD called Targeted Molecular Dynamics (TMD) may be success-
fully used. It consists of series of forced atom movements by which the appropriate
pathway to the final state is reached [301]. In recent years, TMD was used to study
e.g. the behaviour of a c-loop and channel gating in nicotinic receptors. The TMD
protocol was used to displace the c-loop from an “open” to “closed” position which
covers the active site. Such conformational change resulted in the structural reor-
ganization of the ligand-binding pocket, the β1-β2 loop, the Cys-loop and the β10
strand leading to channel widening [302].

6.2 Interactive Molecular Dynamics

The SMD needs to have a predefined direction and a value of applied force, yet it can
be hard to find the ligand access path to the receptor active site. Probing the complex
system with numerous potential solutions would require running a large number
of SMD simulations. Some of the calculations can last very long and therefore are
costly in terms of high performance computing resources. In addition, the possibility
of quick screening of experiment hypothesis may be essential for the success of the
whole project. The best solution to the above problems is to combine an efficient MD
algorithm with a molecular modeling tool to allow the low-cost simplified simulations
with the live interaction option, in other words the Interactive Molecular Dynamics
(IMD). In such simulations a researcher can use standard human interface devices
(e.g. a mouse or a special haptic device) to add forces to pull or restrain particular
atoms in the system. Haptic device allows additionally for bidirectional passing of
the force information, so the resistance of the system to the movement applied can
be felt by hand.
The computer times of the IMD simulations are much shorter than the SMD ones,
respectfully up to hours versus up to few months. Thus, the applied forces in IMD
have to be high to complete the pulling procedure. It is difficult to extract useful
quantitative information from interactive simulations of IMD comparing to SMD.
Nevertheless, IMD may be used to provide initial conformations for SMD.
The IMD protocol with a haptic device was used to investigate transition pathways
of arbitol and ribitol through a GlpF member of the aquaporin membrane proteins
family. From interactive runs there were chosen significant transition states to study
in further MD simulations. Yet, directly from IMD runs it was found which hydrogen
bonds are responsible for selectivity of the water channel in aquaporin [303].
402 D. Latek et al.

6.3 Supervised Molecular Dynamics (SuMD)

Supervised molecular dynamics is a computational method that allows the explo-


ration of ligand-receptor recognition pathway in a nanosecond times scale. Molecular
recognition is a crucial issue when aiming to interpret the mechanism of known active
substances as well as to develop novel active candidates. Ligand binding events can
be simulated using classical MD methods, however such experiments require very
long computation times (a microsecond timescale) and therefore they are afford-
able only with a high-level computational capacity and are in general computation-
ally expensive. In order to overcome that obstacle, an alternative MD approach has
been recently developed by Sabbadin and Moro [304]. It was named ‘Supervised
Molecular Dynamics’ (SuMD, to distinguish it from SMD, the steered molecular
dynamics) and it was successfully used for simulations of ligand recognition by G
protein-coupled receptors (GPCRs) within a time scale reduced up to 3 orders of
magnitude compared to classical MD. SuMD enables the investigation of ligand-
receptor binding events independently from the starting position, chemical structure
of the ligand, and also from its receptor binding affinity.
In SuMD approach no artificial forces are employed as well as no movement or
spatial restraints are applied to any of the atoms. Therefore we could say that in
every single moment of the simulation the behavior of the system is spontaneous,
with one exception: The simulation is divided into cycles (typically ~200 ps) and
after the end of each cycle simulation is either continued or it is restarted from the
last checkpoint. A special tabu-like supervision algorithm is applied to increase the
probability to produce the ligand receptor binding event without introducing bias
into the simulation. The distance between certain atoms (or certain groups of atoms)
is being monitored. In case of SuMD simulations of ligand binding the distance
between the ligand and the receptor binding site is measured. Here we describe the
most basic SuMD algorithm that was used by us for ligand binding event simulation
for CB1 cannabinoid receptor (Fig. 4). After the end of each cycle the measured
distance is compared to the one measured at the end of the previous cycle. If the
distance decreased during the last cycle the simulation is continued without any
intervention and the system coordinates are saved to the checkpoint. If the distance
did not decrease however, the system coordinates are restarted from the previous
checkpoint and atom velocities are reinitialized according to the given temperature.
The tabu-like supervision algorithm is perpetuated in time until the ligand–receptor
distance is less than 5 Å.
The supervision algorithm first described by Sabbadin and Moro is slightly more
complex, since an arbitrary number of distance points (n: a, b, c, d, e) per each cycle
is collected in a real time and a linear function f(x)  m × x is fitted on the distance
points at the end of the cycle. If m < 0, the ligand–receptor distance is likely to be
shortened over the cycle time, and classic MD simulation is restarted from the last
produced set of coordinates. Otherwise, the simulation set of coordinates is restored
from the previous checkpoint and random velocities of each atom in the system
reassigned coherently to the NVT ensemble.
Modeling of Membrane Proteins 403

Fig. 4 A graphical representation of the exemplary SuMD algorithm. That particular supervision
algorithm was used for identification of the most probable ligand entrance pathway into CB1
cannabinoid receptor [305]

So far SuMD approach was successfully applied for simulating receptor-ligand


recognition pathway for various orthosteric and allosteric GPCR ligands. For instance
A2A adenosine receptor ligands (ZM241385, T4G, T4E and caffeine) crystallo-
graphic poses were reproduced with high accuracy after 15–110 ns of SuMD simu-
lation [304]. It has also led to identification of meta-binding sites for some of those
ligands. SuMD methodology was also used to investigate the positive allosteric modu-
lation of human adenosine A3 receptor at the molecular level mediated by LUF6000
[306] and to sample the putative binding sites for A2A AR allosteric modulators
ZB1854, ZB268 and ZB418 [307].
Furthermore, the SuMD simulations of CB1 ligand-receptor recognition mech-
anism for two agonists (anandamide and THC) supported the hypothesis that
hydrophobic ligands attain access to the CB1 receptor via the lipid bilayer. Both
tested ligands entered the binding site by crawling between transmembrane helices
TM7 and TM2 (Fig. 5). The hydrophobic tail of the ligand first penetrated the recep-
tor during entry, and then the rest of molecule passed through the gate with the polar
head at the end. Two alternative approaches, SMD and SuMD, used for ligand exit
and entry, respectively, provided the same pathway through the TM7–TM1/TM2
crevice, and also, the orientation of the ligand was the same during its exit and entry
[305].
Supervised molecular dynamics (SuMD) of ticagrelor binding to P2Y12R puryn-
ergic receptor led to identification of possible meta-binding sites of that receptor,
indicating interactions between the ligand and the extracellular regions of P2Y12R
404 D. Latek et al.

Fig. 5 Selected frames from SuMD simulations trajectories of two agonists—anandamide (left)
and 9 -THC (right)—entering the binding site of CB1 cannabinoid receptor. Those simulations
results indicate that the most probable ligand entrance pathway for CB1 cannabinoid receptor lies
between TM7 and TM1/TM2 and that ligands access the binding site directly from the membrane
[305]

[308]. The SuMD methodology was extensively tested not only for GPCR-s but also
for other membrane proteins and globular proteins [309].
SuMD approach is also very useful to analyze both orthosteric and allosteric
binding events broadening our perspectives in several scientific areas from molecu-
lar pharmacology to drug discovery. In particular it can be applied in a drug design
campaign for lead optimization in order to design novel binders with preferable
pharmacodynamic profiles. Moreover, SuMD represents a powerful tool to assist the
design site-directed mutagenesis experiments in order to investigate the molecular
recognition process. Very likely the future drug design will involve detailed char-
acterization of not only the bound state but also the whole liand-protein network of
recognition pathways, including all metastable intermediate states and for this reason
SuMD will become a very useful tool.

7 Formation of Protein Oligomers in the Membrane

Membrane proteins play crucial role in passing information and transporting small
molecules between membrane-separated compartments. To perform their function
they interact with other proteins, forming transient or more stable homo- or het-
erooligomeric complexes [310–313]. Due to difficulties in solving the structures of
membrane proteins using X-ray diffraction or NMR, computational methods of struc-
ture and interaction prediction became quite important, offering insight into details
at the resolution inaccessible with current experimental methods. In this chapter we
briefly review selected methods of protein-protein interface prediction in the context
of membrane proteins.
Modeling of Membrane Proteins 405

The methods used for protein-protein interface prediction can be classified into
two groups:
• structure-based methods that use atom coordinates and atom types. This category
is employed in case of membrane proteins for which structural information is
available. The most prominent methods in this group are:
– Docking
– Molecular Dynamics (MD)
• sequence-based methods that rely on sequence alignments and residue conserva-
tion.

7.1 Docking

The procedure of docking involves three general steps: (a) generation of a complex
structure followed by (b) filtering out false positives based on a scoring function and
(c) refinement of the best ranked models. Various methods of searching the solution
space and ranking the results are reviewed in [314–316].
The most commonly used protein-protein docking engines are listed in Table 7.
One has to note that they have not been developed specifically for membrane proteins.
This is mostly due to the fact that in order to properly validate any new method,
a sufficient amount of experimental data, such as structures of proteins and their
complexes, is needed. This condition is not easy to meet in case of membrane proteins
due to experimental difficulties in solving their structures. Therefore, complexes of
membrane proteins are underrepresented and hence the docking programs may have
problems delivering good results in this area. Nevertheless, it is possible in many cases
to yield reasonable structures using the available programs. The issues a researcher
has to be aware of while attempting membrane protein docking are briefly outlined.
While some of them are membrane protein-specific, the others are more general.
First of all, the presence of the membrane is not taken into account during the
results ranking stage. Therefore, a solution that would be perfectly valid in a cyto-
plasmic environment is mostly invalid when placed in a lipid bilayer. The burden of
creating a filter that successfully selects and ranks membrane-aware complexes from
a population of results is left upon a researcher but the docking methods themselves
were shown to work even in such hard cases (see for instance [337, 338]). Second,
in case of membrane proteins it may be hard to identify obvious interaction sites like
surface bulges and cavities and the small contact area may not suffice for a good
prediction. Furthermore, if at least one of the proteins undergoes a significant con-
formational change during the formation of a protein-protein interface, the docking
engine, particularly if rigid, will likely fail to yield a native-like structure. What is
more, in order to further validate the model obtained from docking, stability of a
complex may have to be confirmed by MD. Since this is a time-consuming step, one
should employ some other available filters to limit the number of initial configura-
406 D. Latek et al.

Table 7 Selected protein-protein docking programs


Tool Website References
ClusPro 2.0 https://cluspro.bu.edu [317–321]
GRAMM-X http://vakser.bioinformatics. [322]
ku.edu/resources/gramm/
grammx
ZDOCK https://zlab.umassmed.edu/ [323–325]
zdock/
Rosetta Docking2 http://rosie.graylab.jhu.edu/ [326–330]
docking2
HADDOCK http://www.bonvinlab.org/ [331–335]
software/haddock2.2/
PatchDock http://bioinfo3d.cs.tau.ac.il/ [336]
PatchDock/
SymmDock http://bioinfo3d.cs.tau.ac.il/ [336]
SymmDock/

tions. Last but not the least, a docking program may allow to use certain constraints in
order to limit a search space and produce more significant results. If any experimental
data, such as distance restraints between certain residues or reciprocal orientation of
complex subunits, is known, one is encouraged to use it to improve the quality of a
generated model. However, this step requires caution especially when interpretation
of experimental data is ambiguous. For instance a mutation of amino acid on site
A may induce conformational changes in a protein so that a distant binding site B
cannot interact with its partner anymore. If the aforementioned amino acid is used
to constrain the searching step, results will be rendered invalid. In this situation it
is desirable to generate more structures and to use experimental data as a filter. The
crystal structures of protein oligomers that can be employed for testing of the above
methods are shown on Fig. 6.

7.2 Molecular Dynamics

Biomolecules are dynamic systems and the employment of the exploration of their
dynamic properties can reveal their true nature. This is the reason why molecular
dynamics is a widely-used tool in computational research. Yet if one attempts to
find a proper interface by simulating a set of starting random complexes (even if the
presence of membrane is taken into account), they step into a time- and resource-
consuming experience that is simply too costly unless the interacting proteins are
really small. The reason is a timescale of complex formation that may not be reachable
Modeling of Membrane Proteins 407

Fig. 6 The protein-protein interfaces in crystal structures. a The trimer of bacteriorhodopsin. PDB
id:1BRR. b Two different interfaces in oligomer of opioid receptor μOR. PDB id:4DKL The
interfaces are encircled by red dashed ellipses. The interacting helices are colored and labeled

with MD, particularly when the complex formation induces large conformational
changes. This is the reason why MD is usually used as a complementary tool with
a docking engine of one’s choice where docking delivers a set of starting structures
and MD determines whether the complex is transient or stable.
As previously noted, docking engines lack proper filters that remove membrane-
infeasible solutions. This drawback transfers this responsibility to a researcher. The
structures that passed the test can be subjected to molecular dynamics simulation. For
the sake of accuracy, the simulations should be carried out in a membrane environ-
ment and this requirement imposes applying a longer system preparation procedure
in comparison with water-soluble proteins. For more details please see Sect. 6.
The trajectory analysis provides valuable information on the properties of stud-
ied protein complexes: (a) area and type of protein-protein interface, (b) energy of
408 D. Latek et al.

interaction, (c) various structural changes of protomers upon binding, and even (d)
kinetics of complex formation/dissociation for sufficiently long simulations. The
role of computational research is not limited to validation of experimental data. The
results of simulations delineate new research paths for experimental labs, like for
instance picking residues for mutations and predicting resulting interfaces. There-
fore, molecular dynamics is an important tool in a portfolio of a modern scientist
interested in the formation of protein-protein complexes.

7.3 Sequence-Based Methods

The protein sequence records vastly outnumber the protein structures solved to date.
It is not uncommon that for certain protein family very few if any protein struc-
tures are known. This was the case with G protein-coupled receptors (GPCRs) at
the beginning of 2000s when of this important family only rhodopsin structure was
solved [339]. The sequence-based methods, often equipped with a reasonable tem-
plate structure, may still bring valuable information regarding residues of primary
significance for protein structure and function, including protein-protein interfaces.
These methods rely on sequence homology and produce their output after analyzing
multiple sequence alignments. Below there is a brief overview of selected sequence-
based approaches for protein-protein prediction.
Evolutionary trace (ET) method [340] uses a multiple sequence alignments to
build a phylogenetic tree. The sequences are then divided into several groups during
clustering. The population is scanned for the residues that are conserved within
the group but differ in between them. Such residues are labeled evolutionary trace
residues and are claimed to be important due to a lower probability of mutation. The
ET residues are subsequently mapped onto the structure of the protein in order to
visualize the location of functional sites. The different flavors of ET analysis were
used to distinguish residues responsible for binding ligands, G-protein binding and
another monomer [341–343].
Correlated mutation analysis (CMA) searches for mutations that occur together
in a multiple sequence alignment [344]. The mechanism of action is that the effect of
one mutation is compensated by the other one and hence the protein-protein interface
remains functional. This method is in general used for determination of structurally
important residues, not only between but also within a single protein molecules
(please see Sect. 3.3). This method was shown to be useful when applied to membrane
protein interface predictions [342, 345]. Subtractive correlation mutation method
(SCM) can be used for membrane dimers formed by paralogs [346]. A very recent
method Structure-based CMA (SCMA) combines protein structural information and
co-evolutionary information [347] and overcomes the low signal to noise ratio, a
well-known disadvantage of CMA, which was dealt with before [348].
Each method has its strengths and weaknesses. Therefore to avoid a distorted
view and gain predictive edge it is advisable to use both structure and sequence-
based methods. Careful selection of the input data should never be underestimated
Modeling of Membrane Proteins 409

since the computer only processes what it is given and the onus is on a researcher to
produce meaningful results.

8 Introduction to Implicit Solvent Methods

The environment has a great impact on properties and function of biomolecules. For
proper modeling of e.g. proteins, one have to simulate all necessary surroundings,
mainly water and/or lipids or more general, solvent. However, the number of solvent
atoms is of at least one order magnitude bigger than that of molecule of interest.
That leads to the conclusion that most of the computer resources in all-atom explicit
simulations are devoted to solvent-solvent interactions.

8.1 Theory

Usually implicit solvent models assume that an examined part of a system is treated
with the full-atom description, whereas solvent is represented as a continuous media
with properties that reflect real but only average qualities of the environment (usually
water). This transformation leads to an additional energy term, the free energy of
solvation, which stand for all the effects that solvent has on solute and is thermo-
dynamically represented by the change in free energy when molecule is transferred
from vacuum to solvent.
Here we present only a very basic theory of implicit solvation; for better descrip-
tion please check Roux [349]. In general, the energy of solvent-solute system depends
on a solute’s configuration (coordination vector X) and solvent (coordination vector
Y):

U (X, Y)  U P (X) + U S (Y) + U P S (X, Y) (1)

where U P denotes internal energy of solute, U S is solvent energy, and U PS describes


interactions between solvent and solute. Now, the probability of the given microstate
is govern by function which depends on both X and Y configurations. Basic formu-
lation of implicit solvation relies on so called reduced probability, which depends
only on solute configuration, X, or where solvent degrees of freedom were integrated
and thus averaged. This idea allows to introduce the effective function, the potential
of mean force, W (X):

W (X)  U P (X) + G slv (X) (2)

where G slv (X) is a solvation term, averaged solvent influence to the solute at fixed
position X. One can decompose the free energy of solvation G slv (X) into two terms:
np
nonpolar solvation effects G slv (X), and electrostatic contribution G elec
slv (X). The
410 D. Latek et al.

latter is mainly electrostatic potential acting on the molecules charges from polarized
solvent and is commonly called the reaction field. The nonpolar term is mainly
governed by the work which is needed for displacing solvent molecules from the
space occupied by solute and is commonly called the cavity formation. Calculation
np
of G slv (X) and G elec
slv (X) depends on specific methods for implicit solvation which
can be divided into two groups: based on continuum electrostatic and semi-empirical
methods. Here we describe the methods which are suitable for molecular dynamics,
e.g. with analytical derivatives which allows to calculate forces acting on the system.
Methods based on electrostatics assume that solute charges reside in low dielectric
cavity which is immersed in continuous dielectric environment (solvent). Therefore,
calculations of the G elec
slv (X) are based on Poisson-Boltzmann equation [350, 351],
differential equation derived from Maxwell laws, describing electrical potential for a
given charge distribution. The solution to the Poisson equation strongly depends on
the geometrical factors of the solute, e.g. charge distribution or cavity shape. In prac-
tice, the solution has an analytical form only for very basic, symmetrical problems.
It can be solved numerically but the high computational cost limits its application to
stationary problems, where solute position is fixed, making it impractical for molec-
ular dynamics. To overcome above limitations, the semi analytical approximations
have been developed, from which the generalized Born formalism (GB) is the most
commonly used [352]. GB methods estimate G elec slv (X) as a pairwise sum of all
interacting charges with so called effective Born radii.
The complete calculation of solvation term G slv (X) requires also estimation of
np
the nonpolar entropic solvation effects—G slv (X). This is achieved by introducing
solvent-accessible surface area potential (SASA):
np
G slv (X)  λS(X) (3)

where S(X) is surface area of solute, and λ has interpretation of surface tension, and is
phenomenologically adjusted so one can obtain proper values of solvation free energy
for simple molecules in water, like alkanes [353]. GB methods combined with SASA
are commonly named GBSA methods and have many variations [354–356].
Another approach is used in semi-empirical methods like EEF1, based on solvent
exclusion functions [357]. The main idea is to take some reference parameters for
small model molecules and extrapolate them to bigger systems, like proteins. Hence,
the solvation term is calculated by a combination of experimental knowledge and
theoretical considerations. It is based on reference solvation parameters, G r e f (the
solvation of reference molecule) and takes into account a burial of the group:

G slv  G r e f − f (r )dr (4)

where integral is a correction for solvation because of the presence of additional


surrounding groups. Function f (r) has an interpretation of solvation free energy
density and varies with particular atom type (e.g. van der Waals radius). Since the
Modeling of Membrane Proteins 411

solvent electrostatic screening is not explicitly included in solvent-exclusion model


the distance-dependent dielectric constant is used in a form ε  r .
The idea of the implicit membrane (IM) emerged when implicit solvent was used
to study the membrane proteins. In this approach the membrane is represented as an
infinite plane with a given thickness and the features different than solvent (water) but
reflecting the real parameters of the membrane. It is usually placed at the origin, for
example along XY plane. For GB methods it is low-dielectric slab where dielectric
constant can vary to mimic the hydrophobic core of the membrane (e.g. ε  1), the
interface (ε  8), and the bulk solvent (ε  80) [358, 359]. Implicit membrane in
EEF1 method is modeled by applying additional reference solvation terms [360]. It
is assumed that the interior of the membrane is a non-polar solvent (e.g. cyclohexan)
and near the bilayer interface occurs smooth transition so beyond the membrane’s
border the pure solvent is restored. Hence, the reference solvation energy depends
on absolute position and the simple switching function assures a transition between
interior of the membrane and the pure water. For strengthening the electrostatic
interactions the properties of solvent, the solvation free energy as well as dielectric
constant, can be continuously changed perpendicularly to the membrane (Fig. 7).

8.2 Applications to Membrane Proteins

For the newly evaluated force-field one of the most fundamental features is an ability
to recognize a native fold of proteins. To test this ability one can consider two
proteins having the same length but different sequence and fold. Next, each of their
3D structure is transformed into the other to obtain so called decoys—the known folds
but deriving from different sequences. Implicit methods were able to discriminate
between natively folded proteins and decoys based on energy function including

Fig. 7 An implicit solvent method IMM1. a A continuous change of solvation potential in a water-
membrane system. b A rhodopsin simulated in implicit membrane environment. Red surfaces denote
pure hydrophobic part of the membrane, blue surfaces denote bulk water areas
412 D. Latek et al.

implicit solvation [361, 362]. The implicit solvent methods are also employed for
protein structure prediction [363] and for ligand docking [364] in Rosetta.
Some basic considerations about influence of biological membranes on protein
structure and conformational changes were discussed by Im et al. [365]. They exam-
ined three small membrane proteins: mellitin, the transmembrane domain of the M2
protein from Influenza A (M2-TMP) and transmembrane domain of glycophorin A
(GpA) with newly developed implicit membrane model GB/SA. One of the most
interesting experiments was related to the GpA protein. Starting from two separated
helices in the membrane system they were able to reproduce NMR structure of GpA
dimer and GxxxG interface with RMSD as low as 1.2 Å.
The same authors also studied the problem of membrane protein folding [366].
Five artificially designed peptides (WALP16, WALP19, WALP23, TMX-1, TMX-
3) were subjected to test with replica-exchange molecular dynamics (REX-MD) in
GB/SA implicit membrane model. Initial configuration began from extended con-
formation and about 30 Å away from the membrane. Four peptides, all WALPs and
TMX-1 acquired most of their a-helical structure at the membrane surface, before
they were able to fully penetrate the bilayer. Only TMX-3 does not insert but fluc-
tuates at the interface with low helical content. These facts allowed deriving the
conclusion that spontaneous peptide insertion requires very high ratio of secondary
structure.
The membrane protein folding problem has been examined also by Ulmschneider
et al. [367]. The transmembrane part of virus protein U (Vpu) was subjected to
several Monte Carlo folding simulations and it was shown that folded structures
were converging to the one obtained in NMR study. Interesting advantage of implicit
solvation is a straightforward free energy evaluation. Here, authors investigated free
energy landscape of protein insertion into the membrane and role of charged termini
residues in insertion profiles. The dependence of G slv (X) on position and tilt angle
was checked for both, charged N termini and capped with neutral methyl group. It
was realized that lack of charged residue at the N termini lowers the energy barrier
and could result in peptide leaving the membrane. Additionally, they were able to
reproduce so called hydrophobic mismatch effect—an increase of helix tilt with
decreasing hydrophobic thickness of the membrane.
An extension to IMM1 model is discusses in Mottamal and Lazaridis [368]. They
showed that transmembrane voltage correction has the great impact on optimal orien-
tation of alamethicin helices in the membranes. Without the transmembrane potential
the protein orientation is rather parallel to the membrane and stays at the interface,
whereas TM voltage compels the protein to adopt more perpendicular, transmem-
brane orientation.
One of the biggest advantages of MD is that it provides insight into real pro-
tein dynamics. In addition, the IM method allows obtaining several independent
trajectories. These facts were employed to explore unfolding pathways and stabil-
ity during the atomic force microscopy simulation of bacterioopsin [369]. Authors
applied an external force to the C-terminus of bacterioopsin and pulled it with con-
stant velocity (SMD, steered molecular dynamic) or force (CFMD, constant force
molecular dynamics) along direction perpendicular to the membrane. The force-
Modeling of Membrane Proteins 413

distance profiles obtained with SMD simulations were in very good agreement with
AFM experiment: a number of main peaks, their relative height and distance between
the maxima show significant similarity to AFM studies. That suggests that unfolding
mechanisms in SMD and AFM are also similar (although pulling velocities are much
different—about five orders of magnitude). However molecular dynamics allows
examination on molecular level with atomic resolution, so authors could interpret
the AFM force-peaks and correlate them with structural changes during unfolding.
Among the others, they explained the origins of the highest resistance—threading
and flipping helix F through bundle of other in membrane helices.
Another successful usage of IM method is in the paper of Park et al. [370] where the
effects of palmitylation were investigated both, experimentally and computationally.
The palmitate-deficient rhodopsin was examined to study molecular interactions that
stabilize its structure. The palmitate had the biggest impact on stability of small helix
H8, which is believed to mediate in the transducin activation. Indeed, experiments
show that activation rate drops significantly with the lack of palmitylation.
Although implicit membrane methods are still in a development and their usage
is limited, a growing number of known crystal structures of membrane proteins
allows for interesting validity tests and applications. Especially, the methods show
great potential in modeling of protein-protein interactions. They were able to restore
known protein features and interactions (e.g. GxxxG interface) without any con-
straints. They allow fast and reliable energy evaluation which is extremely useful
for creating free energy landscapes and basic knowledge of protein folding with
all-atom resolution of peptide chain. Moreover, the IM methods make possible to
run hundreds of individual, independent in silico experiments. Such experiments do
not need massively parallel computers to obtain biologically relevant timescales. The
absence of periodic boundary conditions, artifacts arising from finite simulation box,
complicated calculations for crystal electrostatic (e.g. Ewald summation [371]) make
implicit methods much easier in setup then standard all-atom molecular dynamics.
Recent improvements, like inclusion of membrane dipole potential make implicit
methods more detailed and reliable [372].
Of course there are still fields where implicit solvation would fail. Besides obvious
applications where solvent-mediated interactions are important, the IMS methods are
questionable when one wants to examine protein interactions with water/membrane
interface. The lack of data on the exact properties of water molecules in vicinity of
the lipid head groups make it very hard to incorporate in the present models. It is
also not possible to simulate the membranous water channels or mimic membrane
deformations caused by proteins. Usually these models do not include friction terms,
however this problem may be overcome by solving the Langevin equation of motion
[373]. Finally, the new generation of mixed implicit/explicit methods could overcome
the present difficulties [374, 375].
414 D. Latek et al.

9 Coarse-Grain Methods

The idea of the all-atom molecular dynamics simulation of membrane proteins


systems is still, despite a progress made in computer hardware and introduction
of improved atomistic simulation algorithms, very demanding computationally.
Although that method was initially restricted to small and simple protein systems,
currently it is used to simulate even very large objects like ribosomes [376], micelles
[377] or models of viruses [378]. Atomistic simulations provide valuable and detailed
information about the local structural properties of lipids inside the membranes and
proteins. Yet, they cannot access the time and length scales required to observe the
collective membrane phenomena or large protein oligomers, which take place at and
beyond the millisecond and micrometer scales. Some of the drawbacks of all-atom
MD can be overcome with more and more powerful and specialized supercomputers,
yet some biological processes we would like to investigate are still beyond the scope
of that method.

9.1 The Idea and History of Coarse-Graining

Although the coarse-grained strategy became very popular recently and many
researchers begin to rely on coarse-grained simulations of large biomolecular sys-
tems, it has been developed many years ago. The main reason to use the coarse-
grained modeling is that it provides a significant speed up when compared with
classical all-atomic molecular dynamics simulations. The coarse-grained simulation
allows the investigation of the large biological systems by using the simplified but
reasonable models able to reproduce the experimental data. The idea behind the
coarse-grain methods is to represent a group of atoms as one united bead and to use
a longer time step which enables researchers to study the behavior of the system in
longer periods of time.
From the early beginning of MD the scientists thought about simplified represen-
tation of investigated systems and the proteins in particular. The first step to build
transferable coarse-grained model was done by Levitt, who reported a knowledge
based parameterization [379]. Probably the earliest example of the CG idea in biology
was the development of the simplified protein folding model by Levitt and Warshel
[380]. The process of protein folding presents an enormous challenge, in light of the
Levinthal paradox [381] which states that it is close to impossible to rationalize how a
protein with so many degrees of freedom is capable of folding within any reasonable
timescale. In 1975 Levitt and Warshel [380], being aware of the fact that even minor
energy minimization of an all-atom protein takes an extremely long time, attacked
the Levinthal paradox by moving to a drastic simplification of a protein representa-
tion with retaining the main functionality of the system. The much simpler and less
physical Go model [382–385] was also developed that time.
Modeling of Membrane Proteins 415

The first idea of reducing the amino acid representation by grouping atoms into
a bead called a united atom or pseudo-atom was based on the uniaxial Gay-Berne
model [386, 387]. A united-atom approach was further improved by grouping each
carbon with its bonded hydrogen atoms into one united atom [388]. Precisely, an
aliphatic carbon atom and attached hydrogen atoms were represented as one bead.
The united atom representation is widely used because it is computationally efficient
and provides results in reasonable agreement with available experimental data. The
idea of united atoms was further extended by coarse-grained force fields in which
several heavy atoms were mapped onto one bead. Coarse-grained force fields are
available for commonly used MD programs. Even though they share the same idea,
they differ in details. In this work we compared popular coarse-grained models used
in GROMACS [270] and NAMD [389], the two MD program suites used in the
standard research studies.
In many coarse-grain methods, in which the implicit solvent is used instead of
water and ion molecules, such a simplification leads to the reduction of the system
by one order of magnitude. Representing each amino acid, containing on average 20
atoms, by two beads reduces the number of particles in proteins by a factor of 10. If
we consider large systems, calculation of forces scales proportionally to the number
of particles squared, so the acceleration may by even of two orders of magnitude.
The second factor of the speed up is the integration time-step, which is dependent
on the fastest frequencies of protein motions, which are about 10 times slower in
coarse-grained representation than in all-atom model so the integration time-step is
proportionally larger. Another source of speed up has its origin in a fact that the
energy landscape is much smoother and reduces the number of local energy minima
that are present in case of all-atom molecular dynamics. Above assessment of the
possible speed-up is very simplified and finally depends on an applied coarse-grained
method and the investigated system.

9.2 Two Ways to Derive the Coarse-Grain Potentials

One can find interesting surveys of coarse grained models of proteins in [390–392],
and also entirely focused on membrane proteins in [393]. The coarse-grained models
of proteins available at present can be divided in two categories based on different
treatment of nonbonded interactions. In one group of models those interactions rely
on an initial (e.g. crystal) structure of a protein. Models belonging to this category
use the initial structure of the investigated molecule in defining the potential of
the system. Such models are widely applied to study functional dynamics of larger
biomolecules.
The nonbonded interactions of coarse-grained models belonging to the second
category are defined in the similar way as in the Molecular Mechanics force fields.
The initial structure is not considered in the definition of the interactions in the system.
These models are directly or indirectly based on physicochemical interactions.
416 D. Latek et al.

9.2.1 The ENM and Go-like Models

The elastic network models (ENM) and Go-like models are methods belonging to
the first of the categories introduced above, with the very strong structure-based
bias. In ENM approach the system is represented by a network of beads connected
by harmonic strings. These connections are introduced for beads which are spatially
close to each other in the native structure. Usually one bead represents a whole amino
acid. Despite its simplicity, an ENM was able to reproduce the correct pattern of the
principal modes (with the largest amplitude), which usually are most important for
protein function. This method was applied in the studies of the mechanism of the
pore opening for five different potassium channels [394]. The study revealed that all
five structures display the common gating mechanism and the same intrinsic motions
at their gating region despite differences in their sequences, structures, and activation
mechanisms. The equilibrium dynamics of these five potassium channels were found
to obey similar patterns on a global scale.
The Go model was developed by Taketomi et al. in Go group and published in
1975 [382] and later improved and modified [383–385]. Basically, in this model a
protein is represented as a chain of beads, where each of them represents one amino
acid. A protein structure is biased toward the native conformation by means of simple
attractive and repulsive non-bonded interactions between beads represented by the
Lennard-Jones potential. Despite its simplicity, that approach was very successful in
reproducing several aspects of thermodynamics and especially kinetics of folding. It
is due to the fact that the immanent feature of the original Go model is that the system
is minimally frustrated so it can reproduce the folding process of many proteins. There
is a big variety of Go models with many modifications introduced, e.g. by adding
additional energy terms decreasing frustration of the system.
The Go-like model was applied to investigate the pulling a single bacteri-
orhodopsin molecule out of the membrane [395]. Firstly, the all-atom representation
of the bacteriorhodopsin-membrane system was generated. Secondly, the protein
Go-like model representation of the proteins conformation was constructed. The
membrane was set frozen and represented by C atoms of the phospholipids. Addi-
tionally, it was determined which of those carbon atoms form contacts in the starting
conformation. Those interactions were represented in the same way like the non-local
native interactions within the protein, namely by the Lennard-Jones potential. The
model introduced by the authors reproduced qualitatively experimentally observed
differences between force-extension patterns obtained on bacteriorhodopsin at dif-
ferent temperatures. Moreover, asymmetry was observed when pulling by different
terminus. Authors also showed that the interactions of the protein with the mem-
brane play the decisive role in determining the force pattern and thus the stability of
transmembrane proteins.
Different approach of investigating the protein-membrane system using Go model
is presented by Orlandini et al. [396]. The authors study immersing into a membrane
and folding kinetics of a two-helix fragment of bacteriorhodopsin. The membrane
was introduced by the slab as a defined fragment of the space. The native contacts
were divided into different classes depending on the location of the residues compris-
Modeling of Membrane Proteins 417

ing given contact with respect to the membrane position. This model allowed for the
characterization of the thermodynamics and dynamics of the protein folding process.
Authors identified various intermediates and the free energy barriers between them,
and the folding process was predicted as involving many pathways with a dominant
folding channel.

9.2.2 Molecular Mechanics-Like Coarse-Grain Models

Among the models belonging to the second category of models, the most attention
currently receives the MARTINI force field, initially developed for coarse grained
simulations of lipids [397–399]. The MARTINI potential for proteins is mainly based
on physico-chemical modeling with a weak bias to the native structure mostly through
the secondary structure constraints. The methodology applied to construct MARTINI
force field was based on extensive calibration of the peptide-bilayer systems of the
coarse-grained force field against thermodynamic data, in particular, oil/water parti-
tioning coefficients. In that model, four heavy atoms on average are represented by
one interaction site (bead) and also water is represented in that way. Each bead is
assigned to one of four main types: polar, nonpolar, apolar, or charged. Within each
type there are different subtypes introducing more detailed features of interacting
sites (like hydrogen bonding capabilities or degree of polarity). Beads (i, j) inter-
act with each other similarly to atoms in all-atom force fields. Nonbonded potential
involves the Lennard-Jones potential:
   6 
σi j 12 σi j
VLennar d−J ones (ri j )  4εi j − (5)
ri j ri j

The energy parameter ε determining the depth of the potential well depends on
the bead’s type and varies between 2.0 and 5.6 kJ/mol. All particles has the effective
size σ equal to 0.47 nm apart from the beads comprising ring like molecules (σ 
0.43 nm). Electrostatic interactions between charged beads are incorporated via the
Coulombic potential with the appropriately adjusted dielectric constant (εrel  15):
qi q j
Velectr ostatic  (6)
4π ε0 εr el ri j

Bonded interactions are used for chemically bonded sites, to represent chain stiff-
ness, and to impose secondary structure of the peptide backbone. Potential energy
functions for bonded sites i, j, k and l with the equilibrium distance d b , angle ϕ a and
dihedral angles ψ d and ψ id have the following forms:
1  2
Vb 
K b di j − db (7)
2
1   2
Va  K a cos ϕi jk − cos(ϕa ) (8)
2
418 D. Latek et al.
 
Vd  K d 1 + cos nψi jkl − ψd (9)
 2
Vid  K id ψi jkl − ψid (10)

The potential V b mimics the chemically bonded beads, potential V a is an angle


potential imposing chain stiffness, the improper dihedral potential V id prevents out-
of-plane distortions of planar groups, and the proper dihedral potential V d imposes
secondary structure of the protein chain. Authors underline that, because of the last of
the potentials incorporated into the model definition, conformational changes of pro-
tein secondary structure are not adequately modeled. The coarse-grain representation
used in MARTINI is showed in Fig. 8.
The MARTINI model has a broad range of applications. On one hand, this force
field is used to study systems consisted of lipids and surfactants. On the other hand
this model is applied to study transmembrane proteins, their interactions with lipids
and with other proteins in the solvent-lipids environment. In [400] the authors studied
self-assembly of standard lipid bilayers in the presence of one fukutin transmembrane
domain and in simulations of that protein in a complex Golgi apparatus membrane
model. In [401] the mechanism of the segregation of transmembrane helices into
disordered lipid domains in model membranes was investigated. The underlying
molecular mechanisms and thermodynamic driving forces are not sufficiently recog-
nized. Authors suggested that the driving force for the observed lipids arrangement is
the enthalpic cost associated with the presence of a cylindrical object (the TM helix)
inside the ordered lipid phase. Although synthetic WALP peptides and the α-helical
TM domain of the syntaxin 1A protein were used as generic models, the proposed

Fig. 8 The MARTINI coarse-graining procedure for membrane components, amino acids and
solvents. New types of potential for grains are specified. The image taken with permissions from
http://md.chem.rug.nl/cgmartini/
Modeling of Membrane Proteins 419

mechanism is general and likely to be relevant for protein sorting, also in vivo. In
another example [402], systems with up to 16 rhodopsin molecules at a protein-to-
lipid ratio of 1:100 were simulated for time scales of up to 8 microseconds. The
results obtained for four different phospholipid environments showed that localized
adaptation of the membrane bilayer to the presence of receptors is reproducibly most
pronounced near transmembrane helices 2, 4, and 7 of bacteriorhodopsin. That local
membrane deformation appears to be a key factor defining the rate, extent, and orien-
tation preference of the protein-protein association. Among other protein-membrane
system models based on the methods derived for lipids by Marrink et al. [397], e.g.
Bond and Sansom [403] explored interactions between a phospholipid bilayer of
the voltage sensor domain and the S4 helix from the archaebacterial voltage-gated
potassium channel (KvAP).
Simplified MARTINI version was presented in [404]. Authors proposed an
implicit-solvent version called Dry-MARTINI, in which the solvation effect was
introduced only by strength adjustment of existing pairwise Lennard-Jones interac-
tions to retain the hydrophobic/hydrophilic behavior of molecules in standard MAR-
TINI. In consequence also some bonded parameters were adapted to keep the equilib-
rium values in studied lipid molecules. The reparametrized model reproduces main
features of lipidic systems observed in standard (wet) MARTINI. However, Dry-
MARTINI does not mimic aqueous phase realistically enough, which has an impact
on protein interactions in solvent. All nonbonded interactions are attractive (Lennard-
Jones potential) and simulations of soluble proteins in general would lead to global
aggregation of the molecules or aggregates. Authors, however, suggest necessary
modifications needed to solve this problem in the future. Moreover, more systematic
testing of peptide-lipids systems is required before applying Dry-MARTINI to study
membrane protein systems.
Shih et al. [405] from Schulten’s group proposed the model applied to simula-
tions of discoidal high-density lipoprotein particles. That model, although is based
on original MARTINI approach [397], differs from the MARTINI-protein extension.
Here, each amino acid is represented by only two beads (apart from glycine). The
types of the amino acid side chains were previously defined in the lipid MARTINI
force field. Microsecond simulations of lipoprotein assembly showed that the over-
all structural features of high-density lipoproteins were reproduced accurately and
revealed the formation of a protein-lipid complex.
As it was mentioned above, the MARTINI-like approach imposes the a priori
knowledge of the secondary structure on the model. Spijker et al. [406] introduced
the force field in which one does not incorporate the secondary structure information.
This model is an extension to the lipid-water model by Markvoort et al. [407]. Each
amino acid is represented by two sites (one for backbone and one representing side
chain). For the protein backbone authors do not introduce the angle potential in the
harmonic form (as V a potential in MARTINI), but it is represented by the double-well
potential using a fourth power polynomial, for which the parameters were derived
from the MD simulations of two membrane proteins. Torsion terms, mimicked by
dihedral (V d ) and improper (V id ) potentials, are not present in this model. Their
role of stabilizing the secondary structure of the protein is played by an additional
420 D. Latek et al.

non-bonded interaction, which mimics the forming of the hydrogen bond between
i-th and (i + 4)-th of the backbone beads. The H-bond contribution has the following
form as:

VH B  −ηi j e 2 (ri j −μi j ) /κi j


−1 2 2
(11)

where μij is the location of the H-bond minimum, κ ij determines the width of the
H-bond well, and ηij represents the well depth of the H-bond minimum. The authors
used the model in simulations of WALP-peptides of different length immersion in
the lipid membranes of different thickness. The results pointed out, that until it is
possible, the membrane adapts to the TM helix length. When the membrane thickness
cannot be increased, peptides tilt in respect to the membrane normal. Such events
are not observed simultaneously but sequentially.
Another coarse-graining approach is represented by an integration of reduced
protein representation integrated with a fully implicit membrane model. One of the
examples is PRIMO-M [408], which is an extension of PRIMO (PRotein Intermediate
Model) for soluble proteins [409]. To mimic environment with two phases, authors
applied heterogeneous dielectric generalized Born methodology. The PRIMO energy
function consists of standard molecular dynamics energy terms with additional
hydrogen-bonding potential term. The backbone is represented with N, C, and a
combined carbonyl site (CO). Detailed backbone representation coupled with preser-
vation of hydrogen bonding allowed to an accurate description of the secondary struc-
ture of proteins. Each non-glycine side chain is represented with another CG site.
The PRIMO-M model reproduces such phenomena as the water-to-membrane free
energy of insertion for amino acids, or tilt angles of simulated transmembrane pep-
tides. This force field also provides trajectories of membrane proteins with calculated
beta-factors being in agreement with experiment.
Recently, the PRIMO and PRIMO-M models were combined with all-atom force
field (CHARMM36) within an all-atom/coarse-grained in a preliminary attempt to
build a hybrid model with solvent environment treated at the continuum level via the
generalized Born with molecular volume [410].
The force fields that are commonly used for simulations of the coarse-grained
membrane protein systems are summarized in Table 8.

10 Quantum Methods for Membrane Proteins

Due to their large computational requirements and poor scaling quantum chemistry
(QM) methods are usually not suitable for describing membrane proteins and pro-
teins in general. Quantum chemistry is based on converging to the exact solution of
the electronic Schrodinger equation and while it usually gives very good accuracy, it
is simply not possible to solve this equation for a system of the size of a protein. Still,
QM methods can be very useful and are commonly used for various tasks in com-
putational biology/chemistry of proteins; selected examples of such QM treatment
Modeling of Membrane Proteins 421

Table 8 The coarse-grained force fields used for membrane proteins


CG force field Website References
MARTINI, lipids http://md.chem.rug.nl/ [397]
cgmartini/
MARTINI, proteins http://md.chem.rug.nl/ [398]
cgmartini/
Dry-MARTINI, lipids http://cgmartini.nl/index.php/ [404]
299-dry-martini-beta
RBCG http://www.ks.uiuc.edu/ [405]
Research/CG/
FREADY – [411]
PRIMO-M – [408]

of transmembrane proteins will be given below. The problem of large computational


cost of QM methods can be also alleviated through various simplification schemes,
which reduce the computational cost and allow to treat large macromolecules, includ-
ing proteins on an accurate level. Examples of such methods will be given at the end
of this chapter.

10.1 QM Approaches to Retinal Chromophore

The presence and importance of retinal for the activation and action of rhodopsin has
been known for many years before obtaining the X-ray structure of this system in
2000 [339]. Before that date several computational studies were performed to better
understand the chemistry of retinal and the energetics of the cis-to-trans transition. In
1996 Terstegen and Buss performed Hartree-Fock (HF) calculations on three different
retinal conformers and with different protonation states of the N-methyl Schiff base
using the standard 6-31G** basis set [412]. They have shown a very good agreement
with the experimental data and noticed that protonation is accompanied by the loss
of double-bond fixation. In a follow-up articles the authors have estimated the energy
minima and transition states of various retinal conformers [413] and also performed
ab initio molecular dynamics [414]. According to their calculations the rotational
barriers around relevant dihedral angles were in the range of 2–5 kcal/mol and ring
inversion barriers in the range of 5–6 kcal/mol, making the whole system labile. Some
of these calculations were repeated in the following year using the density functional
B3LYP method, which gave an improved description of the retinal conformational
space [415]. Another approach towards retinal analysis was presented in a series of
papers by Bifone et al. [416]. They performed a Car-Parrinello ab initio molecular
dynamics (CPMD) (using DFT local density approximation) of all-trans and 13-cis
retinal molecules and shown good agreement with experimental data in the structure
and vibrational modes of this molecule. In all these calculations the protein part of
422 D. Latek et al.

the system was not included due to computational limitations. In the same year the
first simple model of rhodopsin chromophore has been built based on available NMR
data [417]. Using this very simple model which included retinal molecule, a chlorine
ion placed in the position of Glu113 and a CH2 –CH3 group mimicking the linkage
of the chromophore Lys296 they observed a coherent propagation of a conjunction
defect, which was associated to charge transport along the chromophore backbone.
A year later the same model and approach was used to o study the energy storage
mechanism in bathorhodopsin [418]. In the final paper in this series La Penna et al.
used CPMD simulations with additional external force to obtain information about
the transition state of 11-cis to all-trans isomerization [419].
Around the same time a series of studies by Garavelli et al. explained the mech-
anism of retinal photoisomerization using accurate MC-SCF or CASSCF methods,
though without any presence of the protein environment [420]. This group contin-
ued later the research on photoisomerization of conjugated and protonated imines,
modelling retinal protonated Schiff base chromophore, using more and more sophis-
ticated computational approaches such as multireference configuration interaction
with single and double excitations, multireference second order perturbation the-
ory, time-dependent DFT methods and equation-of-motion coupled-cluster methods
[421].
The solution of the first crystal structure of membrane proteins gave rise to much
more detailed description of the ligand binding sites and much improved calcula-
tions. In the classic paper from 2002 Sugihara et al. [422] used self-consistent-charge
density functional based tight-binding (SCC-DFTB) method [423] to study retinal
binding site, which included the retinal molecule and 27 amino acid moieties. Using
structure optimization and MD simulations they were able to investigate the influence
of the protein pocket on the structure of the ligand. They showed that both 6-s-cis
and 6-s-trans conformations of retinal and tolerated by the binding pocket, as well as
showed that the pocket forces the ligand to adopt a slightly distorted conformation.
In the following years similar studies has been performed on rhodopsin, but using
various sets of residues from the binding site and various computational methods. To
study rhodopsin chromophore excitation Hufen et al. [424] used high-level DFT and
ab initio CASSCF/CASPT2 approaches to a model of the bonding pocket including
the ligand, two amino acid residues and a water molecule. They obtained a good
agreement with the experimental data of the electric dipole moment of the chro-
mophore upon excitation and showed the importance of using correlated theoretical
method in proper description of the protonated Schiff base. Excitation energies of
protonated Schiff base of retinal was also studied by the means of time-dependent
DFT (TD-DFT) method using a model of the binding site consisting of 23 amino acid
residues and five water molecules, showing good agreement with the experimental
spectral data [425]. In another paper, Sugihara et al. explored the importance of sev-
eral counterions of the binding pocket on the stability of chromophore using DFT
approach [426]. The performance of various ab initio methods in the description of
retinal was summarized a year later by Blomgren and Larsson [427].
Modeling of Membrane Proteins 423

The fast development of new computational methods leads to a new set of publi-
cations, in which the whole protein was taken into the account. It was possible due to
the two-layer description of the system where the binding site was simulated using
QM approaches and the rest of the protein was simulated using molecular mechan-
ics (MM) methods [428]. One of such QM/MM methods is ONIOM [429] which
was applied to the rhodopsin system first in 2004 by Gascon and Batista [430]. In
this study rhodopsin was divided into inner layer consisting of retinal and a part
of Lys296 and treated with the B3LYP/6-31G* and TD-B3LYP/6-31G* methods,
while the rest of the protein was simulated using classical MM with AMBER force-
field. Authors of this study obtained a very accurate storage energies and electronic
excitation energies for the chromophore, in very good agreement with the experi-
mental data. A follow-up article using the same method showed also the strength of
the gauge independent atomic orbital (GIAO) method by predicting the NMR spec-
trum of rhodopsin pharmacophore [431]. Similar QM/MM studies are now routinely
performed for membrane proteins of similar size [432–435] and allow for precise
description of the pharmacophore interacting with the whole protein, which may be
additionally embedded in the membrane and/or solvent. Some of approaches used
to study of rhodopsin chromophore are summarized on Fig. 9.
In the recent years the rise of computational power made it possible to swap TD-
DFT methods with much more accurate CASSCF and CASPT2 schemes in QM/MM
description of rhodopsins [436]. A thorough description of the history and most recent
advances in simulation of double-bond isomerization of biological chromophores is
available in a recent review by Gozem et al. [437].

10.2 QM/MM and Linear-Scaling Methods

The biggest disadvantage of QM/MM methods is the problem of correct division of


the system into QM and MM parts and the often questionable description of bonds
spanning the two regions and negligence of charge transfer between the regions.
To address these problems linear-scaling methods have been in development over
the last 20 years [438–440]. These methods try to reduce the computational cost
of QM schemes by designing new protocols that scale linearly with system size.
Of the multiple linear-scaling methods available the most commonly used one is
the use of localized molecular orbitals in solving semiempirical self-consistent field
equations as implemented in MOZYME [438]. This method allows currently to treat
up to 15,000 atoms for geometry optimization and 18,000 atoms for single-point
calculations with any semiempirical method; most recently a PM6 method, which
reproduces properties of proteins with good accuracy [441]. The engine of MOZYME
allows for all standard types of calculations (including transition state locations and
refinement, intrinsic reaction coordinate following and reaction paths/grids calcula-
tions) and scales almost linearly with the size of the system up to 10,000 atoms. The
incorrect description of dispersion in semiempirical PM6 method has been solved
424 D. Latek et al.

Fig. 9 Evolution of the structural models of retinal binding site in rhodopsin used in classical
and hybrid quantum chemical calculations. a A model of retinal chromophore [412]. b The model
including part of Lys296 [417]. c The model of chromophore and two amino acids to study excitation
[424]. d The model used for study of counterions in retinal binding pocket [426]. e An extended
retinal binding site model including 27 amino acids [422]. f All-atom rhodopsin model used in
QM/MM approach [430]

by the introduction of simple corrections, which resulted in even higher accuracy of


these methods [442, 443].
Modeling of Membrane Proteins 425

The linear-scaling MOZYME approach has been used in several membrane pro-
teins studies. In 2001 Ren et al. [444] studied microbial sensory rhodopsin II and
optimized the chromophore within its binding site using MOZYME which allowed
them to identify principal mechanism and residues responsible for spectral blue shift
in this protein using other semiempirical methods. They showed that their calcu-
lations can reproduce well the experimental facts of formation of Schiff bases at
various residues. In a study from 2006 Lee et al. [445] used this computational
approach to obtain an all-atom model of bacteriorhodopsin mutant and the elec-
trostatic difference map of the whole protein. A recent study by the author of the
PM6 method describes in details its strengths and disadvantages in protein modeling
[446]. MOZYME approach can also be combined with other computational methods
within the ONIOM framework; the most commonly used implementation combining
MOZYME with DFT has been developed in 2001 by Ohno et al. and used for pKa
prediction of various proteins [447], including membrane proteins [448].
A second area of biological systems calculations where QM is very important is
the determination of molecular interactions potentials, and more specifically, deter-
mination of partial charges of ligands. Many of the membrane proteins interact with
various ligands and form complexes, i.e. drug-receptor systems. The binding of lig-
and occurs via a recognition process at relatively large distances and the electrostatic
field surrounding each molecule (as well as other molecular features like polariz-
ability and hydrophobicity) plays an important role in this process. Also, molecular
docking simulations usually need a proper parametrization of ligands including par-
tial charges. In most computational cases the electron distribution in molecules is
mimicked by a set of partial charges to each atom/nucleus center of the system.
For amino acids these partial charges are usually parametrized in each force-field
to reproduce a large range of experimental data and rarely changed. If one wants to
consider a protein complex with a ligand a set of partial charges has to be calculated
and it is usually a task for QM methods.
Charge densities can be obtained from wavefunctions using very different pro-
cedures; a comparison of different schemes is also available [449]. Traditionally,
Mulliken population analysis has been the most widely used method for determining
atomic charges, though it gives unnatural values for a number of cases and highly
depends on the used basis set [450, 451]. ESP method, which is also commonly
used, derives partial charges by fitting the molecular electrostatic potential available
from the calculations or crystallographic data [452]. Most of these methods give
reasonable results even when using moderate-size basis sets. In some cases it is
advisable to validate the calculated partial charges by deriving a theoretical dipole
moment and comparing it to the experimental one, which is usually easy to obtain or
find in the literature. A recent example of an improvement of force-field important
from the membrane proteins point of view is an advanced parametrization of the
tyrosine-choline cation-π interaction, based on a very accurate symmetry-adopted
perturbation theory potential energy surface [453].
The previously mentioned MOZYME method may also be used to facili-
tate protein-ligand docking. One of the most commonly used docking programs,
Autodock, uses simple Gesteiger partial charges both for protein and ligand, which
426 D. Latek et al.

in some cases leads to poor description of the complex [454]. It has been shown that
the accuracy of Autodock docking may be enhanced by using MOZYME-derived
partial charges [455]. In another study from 2010 Fanfrlik et al. [456] used the cor-
rected PM6-DH2 method of MOZYME combined with AMBER interaction entropy
and SMD deformation and desolvation energies of the ligand to construct fast and reli-
able docking scheme. They showed a dramatic improvement of results over standard
DOCK results, which were not able to distinguish between bonders and non-binders.
Finally, there is a number of problems in studying membrane protein, where the
use of QM/MM approach is indispensable or at least desired for an accurate descrip-
tion of mechanistic features of the system. The first example is any redox system,
where the QM part is needed for the elucidation of the electron transfer mechanism,
as in the previously described rhodopsins. A recent example of such approach is a
B3LYP/CHARMM investigation of the respiratory complex I—a redox-driven pro-
ton pump activated by the reduction of quinone molecule [457]. Results obtained
from the study involving more than 800,000 atoms revealed that that the initial
activation steps involve a charge imbalance arising from quinone reduction in the
soluble domain leading to a local proton-coupled electron transfer process in the
quinone-binding site and the effect of the excess charge is transmitted by concerted
side-chain reorientations of charged residues at the interface of the soluble and mem-
brane domains. The second problem is the accurate description of ion selectivities in
ion channels, an important group of membrane proteins. While the mechanisms of ion
conductance and channel gating can be and have been extensively studied in details
with classic MD approaches [458, 459], the proper description of ion selectivity can
be a challenging problem due to relative simplicity of forcefield-based description
of ions. To overcome this challenge Sadhu et al. [460] used DFT approach to obtain
accurate free binding energies of Na+ , K+ and Cs+ ions at different, well-defined
ion-chelating sites of NaK channel for which combined with MD approach gave a
more realistic description of channel permeabilities.

References

1. Chou, K.C., Elrod, D.W.: Prediction of membrane protein types and subcellular locations.
Proteins 34(1), 137–153 (1999)
2. White, S.H., Snaider, C.: http://blanco.biomol.uci.edu/mpstruc/listAll/list
3. Lomize, M.A., Pogozheva, I.D., Joo, H., Mosberg, H.I., Lomize, A.L.: OPM database and
PPM web server: resources for positioning of proteins in membranes. Nucleic Acids Res.
40(Database issue), D370–376 (2012). https://doi.org/10.1093/nar/gkr703
4. Jayasinghe, S., Hristova, K., White, S.H.: MPtopo: a database of membrane protein topology.
Protein Sci. 10(2), 455–458 (2001). https://doi.org/10.1110/ps.43501
5. Tusnady, G.E., Dosztanyi, Z., Simon, I.: PDB_TM: selection and membrane localization of
transmembrane proteins in the protein data bank. Nucleic Acids Res. 33(Database issue),
D275–278 (2005). https://doi.org/10.1093/nar/gki002
6. Kozma, D., Simon, I., Tusnady, G.E.: PDBTM: protein data bank of transmembrane proteins
after 8 years. Nucleic Acids Res. 41(Database issue), D524–529 (2013). https://doi.org/10.
1093/nar/gks1169
Modeling of Membrane Proteins 427

7. Raman, P., Cherezov, V., Caffrey, M.: The membrane protein data bank. Cell. Mol. Life Sci.
63(1), 36–51 (2006). https://doi.org/10.1007/s00018-005-5350-6
8. Kazius, J., Wurdinger, K., van Iterson, M., Kok, J., Back, T., Ijzerman, A.P.: GPCR NaVa
database: natural variants in human G protein-coupled receptors. Hum. Mutat. 29(1), 39–44
(2008). https://doi.org/10.1002/humu.20638
9. Okuno, Y., Tamon, A., Yabuuchi, H., Niijima, S., Minowa, Y., Tonomura, K., Kunimoto, R.,
Feng, C.: GLIDA: GPCR—ligand database for chemical genomics drug discovery–database
and tools update. Nucleic Acids Res. 36(Database issue), D907–912 (2008). https://doi.org/
10.1093/nar/gkm948
10. Zhang, J., Zhang, Y.: GPCRRD: G protein-coupled receptor spatial restraint database for
3D structure modeling and function annotation. Bioinformatics 26(23), 3004–3005 (2010).
https://doi.org/10.1093/bioinformatics/btq563
11. Tsirigos, K.D., Bagos, P.G., Hamodrakas, S.J.: OMPdb: a database of beta-barrel outer
membrane proteins from Gram-negative bacteria. Nucleic Acids Res. 39(Database issue),
D324–331 (2011). https://doi.org/10.1093/nar/gkq863
12. Vroling, B., Sanders, M., Baakman, C., Borrmann, A., Verhoeven, S., Klomp, J., Oliveira,
L., de Vlieg, J., Vriend, G.: GPCRDB: information system for G protein-coupled recep-
tors. Nucleic Acids Res. 39(Database issue), D309–319 (2011). https://doi.org/10.1093/nar/
gkq1009
13. Isberg, V., Mordalski, S., Munk, C., Rataj, K., Harpsoe, K., Hauser, A.S., Vroling, B., Bojarski,
A.J., Vriend, G., Gloriam, D.E.: GPCRdb: an information system for G protein-coupled recep-
tors. Nucleic Acids Res. 44(D1), D356–D364 (2016). https://doi.org/10.1093/nar/gkv1178
14. Pandy-Szekeres, G., Munk, C., Tsonkov, T.M., Mordalski, S., Harpsoe, K., Hauser, A.S.,
Bojarski, A.J., Gloriam, D.E.: GPCRdb in 2018: adding GPCR structure models and ligands.
Nucleic Acids Res. 46(D1), D440–D446 (2018). https://doi.org/10.1093/nar/gkx1109
15. Worth, C.L., Kreuchwig, A., Kleinau, G., Krause, G.: GPCR-SSFE: a comprehensive database
of G-protein-coupled receptor template predictions and homology models. BMC Bioinform.
12, 185 (2011). https://doi.org/10.1186/1471-2105-12-185
16. Worth, C.L., Kreuchwig, F., Tiemann, J.K.S., Kreuchwig, A., Ritschel, M., Kleinau, G.,
Hildebrand, P.W., Krause, G.: GPCR-SSFE 2.0-a fragment-based molecular modeling web
tool for Class A G-protein coupled receptors. Nucleic Acids Res. (2017). https://doi.org/10.
1093/nar/gkx399
17. Sharman, J.L., Mpamhanga, C.P., Spedding, M., Germain, P., Staels, B., Dacquet, C., Laudet,
V., Harmar, A.J.: IUPHAR-DB: new receptors and tools for easy searching and visualization
of pharmacological data. Nucleic Acids Res. 39(Database issue), D534–538 (2011). https://
doi.org/10.1093/nar/gkq1062
18. Harding, S.D., Sharman, J.L., Faccenda, E., Southan, C., Pawson, A.J., Ireland, S., Gray,
A.J.G., Bruce, L., Alexander, S.P.H., Anderton, S., Bryant, C., Davenport, A.P., Doerig,
C., Fabbro, D., Levi-Schaffer, F., Spedding, M., Davies, J.A., Nc, I.: The IUPHAR/BPS
guide to PHARMACOLOGY in 2018: updates and expansion to encompass the new guide
to IMMUNOPHARMACOLOGY. Nucleic Acids Res. (2017). https://doi.org/10.1093/nar/
gkx1121
19. Saier, M.H., Jr., Yen, M.R., Noto, K., Tamang, D.G., Elkan, C.: The transporter classification
database: recent advances. Nucleic Acids Res. 37(Database issue), D274–278 (2009). https://
doi.org/10.1093/nar/gkn862
20. Saier, M.H., Jr., Reddy, V.S., Tamang, D.G., Vastermark, A.: The transporter classification
database. Nucleic Acids Res. 42(Database issue), D251–258 (2014). https://doi.org/10.1093/
nar/gkt1097
21. Neumann, S., Fuchs, A., Mulkidjanian, A., Frishman, D.: Current status of membrane protein
structure classification. Proteins 78(7), 1760–1773 (2010). https://doi.org/10.1002/prot.22692
22. Bernsel, A., Viklund, H., Elofsson, A.: Remote homology detection of integral membrane
proteins using conserved sequence features. Proteins 71(3), 1387–1399 (2008). https://doi.
org/10.1002/prot.21825
428 D. Latek et al.

23. Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., De Castro, E., Langendijk-Genevaux, P.S.,
Pagni, M., Sigrist, C.J.: The PROSITE database. Nucleic Acids Res. 34(Database issue),
D227–230 (2006). https://doi.org/10.1093/nar/gkj063
24. Tusnady, G.E., Kalmar, L., Hegyi, H., Tompa, P., Simon, I.: TOPDOM: database of domains
and motifs with conservative location in transmembrane proteins. Bioinformatics 24(12),
1469–1470 (2008). https://doi.org/10.1093/bioinformatics/btn202
25. Senes, A., Engel, D.E., DeGrado, W.F.: Folding of helical membrane proteins: the role of polar,
GxxxG-like and proline motifs. Curr. Opin. Struct. Biol. 14(4), 465–479 (2004). https://doi.
org/10.1016/j.sbi.2004.07.007
26. Shen, H.B., Yang, J., Chou, K.C.: Fuzzy KNN for predicting membrane protein types from
pseudo-amino acid composition. J. Theor. Biol. 240(1), 9–13 (2006). https://doi.org/10.1016/
j.jtbi.2005.08.016
27. Cai, Y.D., Ricardo, P.W., Jen, C.H., Chou, K.C.: Application of SVM to predict membrane
protein types. J. Theor. Biol. 226(4), 373–376 (2004). https://doi.org/10.1016/j.jtbi.2003.08.
015
28. Wang, S.-Q., Yang, J., Chou, K.-C.: Using stacked generalization to predict membrane protein
types based on pseudo-amino acid composition. J. Theor. Biol. 242(4), 941–946 (2006).
https://doi.org/10.1016/j.jtbi.2006.05.006
29. Cedano, J., Aloy, P., Perez-Pons, J.A., Querol, E.: Relation between amino acid composition
and cellular location of proteins. J. Mol. Biol. 266(3), 594–600 (1997). https://doi.org/10.
1006/jmbi.1996.0804
30. Kyte, J., Doolittle, R.F.: A simple method for displaying the hydropathic character of a protein.
J. Mol. Biol. 157(1), 105–132 (1982)
31. Steitz, T.A., Goldman, A., Engelman, D.M.: Quantitative application of the helical hairpin
hypothesis to membrane proteins. Biophys. J. 37(1), 124–125 (1982)
32. Engelman, D.M., Steitz, T.A.: The spontaneous insertion of proteins into and across mem-
branes: the helical hairpin hypothesis. Cell 23(2), 411–422 (1981)
33. Hedin, L.E., Illergard, K., Elofsson, A.: An introduction to membrane proteins. J. Proteome
Res. 10(8), 3324–3331 (2011). https://doi.org/10.1021/pr200145a
34. Elofsson, A., von Heijne, G.: Membrane protein structure: prediction versus reality. Annu. Rev.
Biochem. 76, 125–140 (2007). https://doi.org/10.1146/annurev.biochem.76.052705.163539
35. Bernsel, A., Viklund, H., Falk, J., Lindahl, E., von Heijne, G., Elofsson, A.: Prediction of
membrane-protein topology from first principles. Proc. Natl. Acad. Sci. U.S.A. 105(20),
7177–7181 (2008)
36. Attwood, T.K., Findlay, J.B.: Fingerprinting G-protein-coupled receptors. Protein Eng. 7(2),
195–203 (1994)
37. Fredriksson, R., Lagerström, M.C., Lundin, L.-G., Schiöth, H.B.: The G-protein-coupled
receptors in the human genome form five main families. Phylogenetic analysis, paralogon
groups, and fingerprints. Mol. Pharmacol. 63(6), 1256–1272 (2003). https://doi.org/10.1124/
mol.63.6.1256
38. Otaki, J.M., Mori, A., Itoh, Y., Nakayama, T., Yamamoto, H.: Alignment-free classifica-
tion of G-protein-coupled receptors using self-organizing maps. J. Chem. Inf. Model. 46(3),
1479–1490 (2006). https://doi.org/10.1021/ci050382y
39. Deville, J., Rey, J., Chabbert, M.: An indel in transmembrane helix 2 helps to trace the
molecular evolution of class A G-protein-coupled receptors. J. Mol. Evol. 68(5), 475–489
(2009)
40. Surgand, J.S., Rodrigo, J., Kellenberger, E., Rognan, D.: A chemogenomic analysis of
the transmembrane binding cavity of human G-protein-coupled receptors. Proteins 62(2),
509–538 (2006)
41. Pele, J., Abdi, H., Moreau, M., Thybert, D., Chabbert, M.: Multidimensional scaling reveals
the main evolutionary pathways of class A G-protein-coupled receptors. PLoS ONE 6(4),
e19094 (2011)
42. Lu, G., Wang, Z., Jones, A.M., Moriyama, E.N.: 7TMRmine: a Web server for hierarchical
mining of 7TMR proteins. BMC Genom. 10, 275 (2009). https://doi.org/10.1186/1471-2164-
10-275
Modeling of Membrane Proteins 429

43. Park, K.-J., Gromiha, M.M., Horton, P., Suwa, M.: Discrimination of outer membrane proteins
using support vector machines. Bioinformatics 21(23), 4223–4229 (2005). https://doi.org/10.
1093/bioinformatics/bti697
44. Gromiha, M.M., Suwa, M.: Discrimination of outer membrane proteins using machine learn-
ing algorithms. Proteins 63(4), 1031–1037 (2006). https://doi.org/10.1002/prot.20929
45. Gromiha, M.M., Ahmad, S., Suwa, M.: Neural network-based prediction of transmembrane
beta-strand segments in outer membrane proteins. J. Comput. Chem. 25(5), 762–767 (2004).
https://doi.org/10.1002/jcc.10386
46. Martelli, P.L., Fariselli, P., Krogh, A., Casadio, R.: A sequence-profile-based HMM for
predicting and discriminating beta barrel membrane proteins. Bioinformatics 18(Suppl 1),
S46–S53 (2002)
47. Remmert, M., Linke, D., Lupas, A.N., Soding, J.: HHomp–prediction and classification of
outer membrane proteins. Nucleic Acids Res. 37(Web Server issue), W446–451 (2009).
https://doi.org/10.1093/nar/gkp325
48. Garrow, A.G., Agnew, A., Westhead, D.R.: TMB-Hunt: an amino acid composition based
method to screen proteomes for beta-barrel transmembrane proteins. BMC Bioinform. 6, 56
(2005). https://doi.org/10.1186/1471-2105-6-56
49. Gromiha, M.M., Ahmad, S., Suwa, M.: Application of residue distribution along the sequence
for discriminating outer membrane proteins. Comput. Biol. Chem. 29(2), 135–142 (2005).
https://doi.org/10.1016/j.compbiolchem.2005.02.006
50. Yan, R.-X., Chen, Z., Zhang, Z.: Outer membrane proteins can be simply identified using
secondary structure element alignment. BMC Bioinform. 12(1), 76 (2011)
51. Berven, F.S., Flikka, K., Jensen, H.B., Eidhammer, I.: BOMP: a program to predict integral β-
barrel outer membrane proteins encoded within genomes of Gram-negative bacteria. Nucleic
Acids Res. 32(suppl 2), W394–W399 (2004). https://doi.org/10.1093/nar/gkh351
52. Freeman, T.C., Wimley, W.C.: A highly accurate statistical approach for the prediction of
transmembrane β-barrels. Bioinformatics 26(16), 1965–1974 (2010). https://doi.org/10.1093/
bioinformatics/btq308
53. van Geest, M., Lolkema, J.S.: Membrane topology and insertion of membrane proteins: search
for topogenic signals. Microbiol. Mol. Biol. Rev. 64(1), 13–33 (2000). https://doi.org/10.1128/
mmbr.64.1.13-33.2000
54. Fu, D., Libson, A., Miercke, L.J., Weitzman, C., Nollert, P., Krucinski, J., Stroud, R.M.:
Structure of a glycerol-conducting channel and the basis for its selectivity. Science 290(5491),
481–486 (2000)
55. Bendtsen, J.D., Nielsen, H., von Heijne, G., Brunak, S.: Improved prediction of signal peptides:
SignalP 3.0. J. Mol. Biol. 340(4), 783–795 (2004). https://doi.org/10.1016/j.jmb.2004.05.028
56. Emanuelsson, O., Brunak, S., von Heijne, G., Nielsen, H.: Locating proteins in the cell using
TargetP, SignalP and related tools. Nat. Protoc. 2(4), 953–971 (2007). https://doi.org/10.1038/
nprot.2007.131
57. Kall, L., Krogh, A., Sonnhammer, E.L.: An HMM posterior decoder for sequence feature pre-
diction that includes homology information. Bioinformatics 21(Suppl 1), i251–i257 (2005).
https://doi.org/10.1093/bioinformatics/bti1014
58. Kall, L., Krogh, A., Sonnhammer, E.L.: Advantages of combined transmembrane topology
and signal peptide prediction—the Phobius web server. Nucleic Acids Res. 35(Web Server
issue), W429–432 (2007). https://doi.org/10.1093/nar/gkm256
59. Viklund, H., Granseth, E., Elofsson, A.: Structural classification and prediction of reentrant
regions in alpha-helical transmembrane proteins: application to complete genomes. J. Mol.
Biol. 361(3), 591–603 (2006). https://doi.org/10.1016/j.jmb.2006.06.037
60. Viklund, H., Elofsson, A.: OCTOPUS: improving topology prediction by two-track ANN-
based preference scores and an extended topological grammar. Bioinformatics 24(15),
1662–1668 (2008). https://doi.org/10.1093/bioinformatics/btn221
61. von Heijne, G.: Membrane protein structure prediction: hydrophobicity analysis and the
positive-inside rule. J. Mol. Biol. 225(2), 487–494 (1992). https://doi.org/10.1016/0022-
2836(92)90934-c
430 D. Latek et al.

62. Engelman, D.M., Zaccai, G.: Bacteriorhodopsin is an inside-out protein. Proc. Natl. Acad.
Sci. U.S.A. 77(10), 5894–5898 (1980)
63. Stevens, T.J., Arkin, I.T.: Turning an opinion inside-out: Rees and Eisenberg’s commentary
(Proteins 2000;38:121–122) on “Are membrane proteins ‘inside-out’ proteins?” (Proteins
1999;36:135–143). Proteins: Struct. Funct. Bioinf. 40(3), 463–464 (2000)
64. Adamian, L., Liang, J.: Interhelical hydrogen bonds and spatial motifs in membrane proteins:
polar clamps and serine zippers. Proteins 47(2), 209–218 (2002)
65. Hofmann, K.: TMbase—a database of membrane spanning proteins segments. Biol. Chem.
Hoppe-Seyler 374(166) (1993). doi:citeulike-article-id:9087200
66. Rost, B., Sander, C., Casadio, R., Fariselli, P.: Transmembrane helices predicted at 95%
accuracy. Protein Sci. 4(3), 521–533 (1995)
67. Yachdav, G., Kloppmann, E., Kajan, L., Hecht, M., Goldberg, T., Hamp, T., Honigschmid,
P., Schafferhans, A., Roos, M., Bernhofer, M., Richter, L., Ashkenazy, H., Punta, M., Sch-
lessinger, A., Bromberg, Y., Schneider, R., Vriend, G., Sander, C., Ben-Tal, N., Rost, B.:
PredictProtein–an open resource for online prediction of protein structural and functional
features. Nucleic Acids Res. 42(Web Server issue), W337–343 (2014). https://doi.org/10.
1093/nar/gku366
68. Cserzo, M., Wallin, E., Simon, I., von Heijne, G., Elofsson, A.: Prediction of transmembrane
alpha-helices in prokaryotic membrane proteins: the dense alignment surface method. Protein
Eng. 10(6), 673–676 (1997)
69. Hirokawa, T., Boon-Chieng, S., Mitaku, S.: SOSUI: classification and secondary structure
prediction system for membrane proteins. Bioinformatics 14(4), 378–379 (1998)
70. Pasquier, C., Promponas, V.J., Palaios, G.A., Hamodrakas, J.S., Hamodrakas, S.J.: A novel
method for predicting transmembrane segments in proteins based on a statistical analysis of
the SwissProt database: the PRED-TMR algorithm. Protein Eng. 12(5), 381–385 (1999)
71. Tusnady, G.E., Simon, I.: The HMMTOP transmembrane topology prediction server. Bioin-
formatics 17(9), 849–850 (2001)
72. Krogh, A., Larsson, B., von Heijne, G., Sonnhammer, E.L.: Predicting transmembrane protein
topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 305(3),
567–580 (2001). https://doi.org/10.1006/jmbi.2000.4315
73. Juretic, D., Zoranic, L., Zucic, D.: Basic charge clusters and predictions of membrane protein
topology. J. Chem. Inf. Comput. Sci. 42(3), 620–632 (2002)
74. Liu, Q., Zhu, Y.S., Wang, B.H., Li, Y.X.: A HMM-based method to predict the transmembrane
regions of beta-barrel membrane proteins. Comput. Biol. Chem. 27(1), 69–76 (2003)
75. Jones, D.T.: Improving the accuracy of transmembrane protein topology prediction using
evolutionary information. Bioinformatics 23(5), 538–544 (2007). https://doi.org/10.1093/
bioinformatics/btl677
76. Peters, C., Tsirigos, K.D., Shu, N., Elofsson, A.: Improved topology prediction using the
terminal hydrophobic helices rule. Bioinformatics 32(8), 1158–1162 (2016). https://doi.org/
10.1093/bioinformatics/btv709
77. Viklund, H., Bernsel, A., Skwark, M., Elofsson, A.: SPOCTOPUS: a combined predictor of
signal peptides and membrane protein topology. Bioinformatics 24(24), 2928–2929 (2008)
78. Snider, C., Jayasinghe, S., Hristova, K., White, S.H.: MPEx: a tool for exploring membrane
proteins. Protein Sci. 18(12), 2624–2628 (2009). https://doi.org/10.1002/pro.256
79. Bernsel, A., Viklund, H., Hennerdal, A., Elofsson, A.: TOPCONS: consensus prediction of
membrane protein topology. Nucleic Acids Res. 37(Web Server issue), W465–468 (2009).
https://doi.org/10.1093/nar/gkp363
80. Tsirigos, K.D., Peters, C., Shu, N., Kall, L., Elofsson, A.: The TOPCONS web server for
consensus prediction of membrane protein topology and signal peptides. Nucleic Acids Res.
43(W1), W401–W407 (2015). https://doi.org/10.1093/nar/gkv485
81. Klammer, M., Messina, D.N., Schmitt, T., Sonnhammer, E.L.: MetaTM—a consensus method
for transmembrane protein topology prediction. BMC Bioinform. 10, 314 (2009). https://doi.
org/10.1186/1471-2105-10-314
Modeling of Membrane Proteins 431

82. Ahmad, S., Singh, Y.H., Paudel, Y., Mori, T., Sugita, Y., Mizuguchi, K.: Integrated prediction
of one-dimensional structural features and their relationships with conformational flexibility
in helical membrane proteins. BMC Bioinform. 11, 533 (2010). https://doi.org/10.1186/1471-
2105-11-533
83. Jacoboni, I., Martelli, P.L., Fariselli, P., De Pinto, V., Casadio, R.: Prediction of the trans-
membrane regions of beta-barrel membrane proteins with a neural network-based predictor.
Protein Sci. 10(4), 779–787 (2001). https://doi.org/10.1110/ps.37201
84. Bagos, P.G., Liakopoulos, T.D., Spyropoulos, I.C., Hamodrakas, S.J.: PRED-TMBB: a web
server for predicting the topology of beta-barrel outer membrane proteins. Nucleic Acids Res.
32(Web Server issue), W400–404 (2004). https://doi.org/10.1093/nar/gkh417
85. Natt, N.K., Kaur, H., Raghava, G.P.: Prediction of transmembrane regions of beta-barrel
proteins using ANN- and SVM-based methods. Proteins: Struct. Funct. Bioinf. 56(1), 11–18
(2004). https://doi.org/10.1002/prot.20092
86. Bagos, P.G., Liakopoulos, T.D., Hamodrakas, S.J.: Evaluation of methods for predicting the
topology of beta-barrel outer membrane proteins and a consensus prediction method. BMC
Bioinform. 6, 7 (2005). https://doi.org/10.1186/1471-2105-6-7
87. Bigelow, H., Rost, B.: PROFtmb: a web server for predicting bacterial transmembrane beta
barrel proteins. Nucleic Acids Res. 34(Web Server issue), W186–188 (2006). https://doi.org/
10.1093/nar/gkl262
88. Waldispuhl, J., Berger, B., Clote, P., Steyaert, J.M.: Predicting transmembrane beta-barrels
and interstrand residue interactions from sequence. Proteins 65(1), 61–74 (2006). https://doi.
org/10.1002/prot.21046
89. Randall, A., Cheng, J., Sweredoski, M., Baldi, P.: TMBpro: secondary structure, beta-contact
and tertiary structure prediction of transmembrane beta-barrel proteins. Bioinformatics 24(4),
513–520 (2008). https://doi.org/10.1093/bioinformatics/btm548
90. Hayat, S., Elofsson, A.: BOCTOPUS: improved topology prediction of transmem-
brane beta barrel proteins. Bioinformatics 28(4), 516–522 (2012). https://doi.org/10.1093/
bioinformatics/btr710
91. Hayat, S., Peters, C., Shu, N., Tsirigos, K.D., Elofsson, A.: Inclusion of dyad-repeat pattern
improves topology prediction of transmembrane beta-barrel proteins. Bioinformatics 32(10),
1571–1573 (2016). https://doi.org/10.1093/bioinformatics/btw025
92. Eisenberg, D., Weiss, R.M., Terwilliger, T.C.: The hydrophobic moment detects periodicity
in protein hydrophobicity. Proc. Natl. Acad. Sci. U.S.A. 81(1), 140–144 (1984)
93. Claros, M.G., von Heijne, G.: TopPred II: an improved software for membrane protein struc-
ture predictions. Comput. Appl. Biosci. 10(6), 685–686 (1994)
94. Jayasinghe, S., Hristova, K., White, S.H.: Energetics, stability, and prediction of transmem-
brane helices. J. Mol. Biol. 312(5), 927–934 (2001). https://doi.org/10.1006/jmbi.2001.5008
95. Koehler, J., Woetzel, N., Staritzbichler, R., Sanders, C.R., Meiler, J.: A unified hydrophobicity
scale for multispan membrane proteins. Proteins 76(1), 13–29 (2009). https://doi.org/10.1002/
prot.22315
96. Deber, C.M., Wang, C., Liu, L.P., Prior, A.S., Agrawal, S., Muskat, B.L., Cuticchia, A.J.:
TM Finder: a prediction program for transmembrane protein segments using a combination
of hydrophobicity and nonpolar phase helicity scales. Protein Sci. 10(1), 212–219 (2001).
https://doi.org/10.1110/ps.30301
97. Zhou, H., Zhou, Y.: Predicting the topology of transmembrane helical proteins using mean
burial propensity and a hidden-Markov-model-based method. Protein Sci. 12(7), 1547–1555
(2003). https://doi.org/10.1110/ps.0305103
98. Ganapathiraju, M., Balakrishnan, N., Reddy, R., Klein-Seetharaman, J.: Transmembrane helix
prediction using amino acid property features and latent semantic analysis. BMC Bioinform.
9(Suppl 1), S4 (2008)
99. Persson, B., Argos, P.: Prediction of membrane protein topology utilizing multiple sequence
alignments. J. Protein Chem. 16(5), 453–457 (1997)
100. Shen, H., Chou, J.J.: MemBrain: improving the accuracy of predicting transmembrane helices.
PLoS ONE 3(6), e2399 (2008). https://doi.org/10.1371/journal.pone.0002399
432 D. Latek et al.

101. Cserzo, M., Bernassau, J.M., Simon, I., Maigret, B.: New alignment strategy for transmem-
brane proteins. J. Mol. Biol. 243(3), 388–396 (1994). https://doi.org/10.1006/jmbi.1994.1666
102. Kitsas, I.K., Panas, S.M., Hadjileontiadis, L.J.: Linear discrimination of transmembrane from
non-transmembrane segments in proteins using higher-order crossings. Conf Proc IEEE Eng
Med Biol Soc 1, 5818–5821 (2006)
103. Lio, P., Vannucci, M.: Wavelet change-point prediction of transmembrane proteins. Bioinfor-
matics 16(4), 376–382 (2000)
104. Nugent, T., Jones, D.T.: Transmembrane protein topology prediction using support vector
machines. BMC Bioinform. 10, 159 (2009). https://doi.org/10.1186/1471-2105-10-159
105. Osmanbeyoglu, H.U., Wehner, J.A., Carbonell, J.G., Ganapathiraju, M.K.: Active machine
learning for transmembrane helix prediction. BMC Bioinform. 11 Suppl 1, S58 (2010).
https://doi.org/10.1186/1471-2105-11-s1-s58
106. Schulz, G.E.: Beta-Barrel membrane proteins. Curr. Opin. Struct. Biol. 10(4), 443–447 (2000).
https://doi.org/10.1016/s0959-440x(00)00120-2
107. Bagos, P.G., Liakopoulos, T.D., Spyropoulos, I.C., Hamodrakas, S.J.: A Hidden Markov
Model method, capable of predicting and discriminating beta-barrel outer membrane proteins.
BMC Bioinform. 5, 29 (2004). https://doi.org/10.1186/1471-2105-5-29
108. Ou, Y., Chen, S., Gromiha, M.M.: Prediction of membrane spanning segments and topology
in β-barrel membrane proteins at better accuracy. J. Comput. Chem. 31(1), 217–223 (2010)
109. Gromiha, M.M., Suwa, M.: A simple statistical method for discriminating outer membrane
proteins with better accuracy. Bioinformatics 21(7), 961–968 (2005). https://doi.org/10.1093/
bioinformatics/bti126
110. Park, Y., Hayat, S., Helms, V.: Prediction of the burial status of transmembrane residues of
helical membrane proteins. BMC Bioinform. 8, 302 (2007). https://doi.org/10.1186/1471-
2105-8-302
111. Yuan, Z., Zhang, F., Davis, M.J., Boden, M., Teasdale, R.D.: Predicting the solvent accessi-
bility of transmembrane residues from protein sequence. J. Proteome Res. 5(5), 1063–1070
(2006). https://doi.org/10.1021/pr050397b
112. Illergard, K., Callegari, S., Elofsson, A.: MPRAP: an accessibility predictor for a-helical
transmembrane proteins that performs well inside and outside the membrane. BMC Bioinform.
11, 333 (2010). https://doi.org/10.1186/1471-2105-11-333
113. Beuming, T., Weinstein, H.: A knowledge-based scale for the analysis and prediction of buried
and exposed faces of transmembrane domain proteins. Bioinformatics 20(12), 1822–1835
(2004). https://doi.org/10.1093/bioinformatics/bth143
114. von Heijne, G.: Proline kinks in transmembrane alpha-helices. J. Mol. Biol. 218(3), 499–503
(1991)
115. Yohannan, S., Faham, S., Yang, D., Whitelegge, J.P., Bowie, J.U.: The evolution of trans-
membrane helix kinks and the structural diversity of G protein-coupled receptors. Proc. Natl.
Acad. Sci. U.S.A. 101(4), 959–963 (2004)
116. Meruelo, A.D., Samish, I., Bowie, J.U.: TMKink: a method to predict transmembrane helix
kinks. Protein Sci. 20(7), 1256–1264 (2011). https://doi.org/10.1002/pro.653
117. Kneissl, B., Mueller, S.C., Tautermann, C.S., Hildebrandt, A.: String kernels and high-quality
data set for improved prediction of kinked helices in alpha-helical membrane proteins. J.
Chem. Inf. Model. 51(11), 3017–3025 (2011). https://doi.org/10.1021/ci200278w
118. Göbel, U., Sander, C., Schneider, R., Valencia, A.: Correlated mutations and residue contacts
in proteins. Proteins: Struct. Funct. Bioinf. 18(4), 309–317 (1994)
119. Latek, D., Kolinski, A.: Contact prediction in protein modeling: scoring, folding and refine-
ment of coarse-grained models. BMC Struct. Biol. 8, 36 (2008). https://doi.org/10.1186/1472-
6807-8-36
120. Michino, M., Brooks 3rd, C.L.: Predicting structurally conserved contacts for homologous
proteins using sequence conservation filters. Proteins 77(2), 448–453 (2009). https://doi.org/
10.1002/prot.22456
121. Fuchs, A., Martin-Galiano, A.J., Kalman, M., Fleishman, S., Ben-Tal, N., Frishman, D.: Co-
evolving residues in membrane proteins. Bioinformatics 23(24), 3312–3319 (2007). https://
doi.org/10.1093/bioinformatics/btm515
Modeling of Membrane Proteins 433

122. Taylor, W.R., Jones, D.T., Green, N.M.: A method for alpha-helical integral membrane protein
fold prediction. Proteins 18(3), 281–294 (1994). https://doi.org/10.1002/prot.340180309
123. Walters, R.F., DeGrado, W.F.: Helix-packing motifs in membrane proteins. Proc. Natl. Acad.
Sci. U.S.A. 103(37), 13658–13663 (2006). https://doi.org/10.1073/pnas.0605878103
124. Langosch, D., Heringa, J.: Interaction of transmembrane helices by a knobs-into-holes packing
characteristic of soluble coiled coils. Proteins 31(2), 150–159 (1998)
125. Russ, W.P., Engelman, D.M.: The GxxxG motif: a framework for transmembrane helix-helix
association. J. Mol. Biol. 296(3), 911–919 (2000). https://doi.org/10.1006/jmbi.1999.3489
126. Pilpel, Y., Ben-Tal, N., Lancet, D.: kPROT: a knowledge-based scale for the propensity of
residue orientation in transmembrane segments. Application to membrane protein structure
prediction. J. Mol. Biol. 294(4), 921–935 (1999). https://doi.org/10.1006/jmbi.1999.3257
127. Lo, A., Chiu, Y.Y., Rodland, E.A., Lyu, P.C., Sung, T.Y., Hsu, W.L.: Predicting helix-helix
interactions from residue contacts in membrane proteins. Bioinformatics 25(8), 996–1003
(2009). https://doi.org/10.1093/bioinformatics/btp114
128. MacKenzie, K.R., Engelman, D.M.: Structure-based prediction of the stability of transmem-
brane helix-helix interactions: the sequence dependence of glycophorin A dimerization. Proc.
Natl. Acad. Sci. U.S.A. 95(7), 3583–3590 (1998)
129. Hildebrand, P.W., Lorenzen, S., Goede, A., Preissner, R.: Analysis and prediction of helix-
helix interactions in membrane channels and transporters. Proteins 64(1), 253–262 (2006).
https://doi.org/10.1002/prot.20959
130. Rose, A., Lorenzen, S., Goede, A., Gruening, B., Hildebrand, P.W.: RHYTHM–a server to
predict the orientation of transmembrane helices in channels and membrane-coils. Nucleic
Acids Res. 37(Web Server issue), W575–580 (2009). https://doi.org/10.1093/nar/gkp418
131. Isberg, V., de Graaf, C., Bortolato, A., Cherezov, V., Katritch, V., Marshall, F.H., Mordalski, S.,
Pin, J.P., Stevens, R.C., Vriend, G., Gloriam, D.E.: Generic GPCR residue numbers—aligning
topology maps while minding the gaps. Trends Pharmacol. Sci. 36(1), 22–31 (2015). https://
doi.org/10.1016/j.tips.2014.11.001
132. Kolinski, A., Skolnick, J.: Reduced models of proteins and their applications. Polymer 45(2),
511–524 (2004). https://doi.org/10.1016/j.polymer.2003.10.064
133. Yarov-Yarovoy, V., Schonbrun, J., Baker, D.: Multipass membrane protein structure prediction
using Rosetta. Proteins 62(4), 1010–1025 (2006). https://doi.org/10.1002/prot.20817
134. Wu, H.H., Chen, C.C., Chen, C.M.: Replica exchange Monte-Carlo simulations of helix
bundle membrane proteins: rotational parameters of helices. J. Comput. Aided Mol. Des.
26(3), 363–374 (2012). https://doi.org/10.1007/s10822-012-9562-1
135. Ueno, Y., Kawasaki, K., Saito, O., Arai, M., Suwa, M.: Folding elastic transmembrane helices
to fit in a low-resolution image by electron microscopy. J. Bioinform. Comput. Biol. 9(Suppl
1), 37–50 (2011)
136. Hurwitz, N., Pellegrini-Calace, M., Jones, D.T.: Towards genome-scale structure prediction
for transmembrane proteins. Philos. Trans. R. Soc. Lond. B Biol. Sci. 361(1467), 465–475
(2006). https://doi.org/10.1098/rstb.2005.1804
137. Porter, J.R., Weitzner, B.D., Lange, O.F.: A framework to simplify combined sampling strate-
gies in Rosetta. PLoS ONE 10(9), e0138220 (2015). https://doi.org/10.1371/journal.pone.
0138220
138. Weiner, B.E., Woetzel, N., Karakas, M., Alexander, N., Meiler, J.: BCL:MP-fold: folding
membrane proteins through assembly of transmembrane helices. Structure 21(7), 1107–1117
(2013). https://doi.org/10.1016/j.str.2013.04.022
139. Pellegrini-Calace, M., Carotti, A., Jones, D.T.: Folding in lipid membranes (FILM): a novel
method for the prediction of small membrane protein 3D structures. Proteins 50(4), 537–545
(2003). https://doi.org/10.1002/prot.10304
140. Pieper, U., Webb, B.M., Barkan, D.T., Schneidman-Duhovny, D., Schlessinger, A., Braberg,
H., Yang, Z., Meng, E.C., Pettersen, E.F., Huang, C.C., Datta, R.S., Sampathkumar, P., Mad-
husudhan, M.S., Sjolander, K., Ferrin, T.E., Burley, S.K., Sali, A.: ModBase, a database of
annotated comparative protein structure models, and associated resources. Nucleic Acids Res.
39(Database issue), D465–474 (2011). https://doi.org/10.1093/nar/gkq1091
434 D. Latek et al.

141. Kelm, S., Shi, J., Deane, C.M.: MEDELLER: homology-based coordinate generation
for membrane proteins. Bioinformatics 26(22), 2833–2840 (2010). https://doi.org/10.1093/
bioinformatics/btq554
142. Miszta, P., Pasznik, P., Jakowiecki, J., Sztyler, A., Latek, D., Filipek, S.: GPCRM: a homology
modelling web service with triple membrane-fitted quality assessment of GPCR models.
Nucleic Acids Res. 46(W1), W387–W395 (2018). https://doi.org/10.1093/nar/gky429
143. Rodríguez, D., Bello, X., Gutiérrez-de-Terán, H.: Molecular modelling of G protein-coupled
receptors through the web. Mol. Inform. 31(5), 334–341 (2012)
144. Sandal, M., Duy, T.P., Cona, M., Zung, H., Carloni, P., Musiani, F., Giorgetti, A.: GOMoDo:
a GPCRs online modeling and docking webserver. PLoS ONE 8(9), e74092 (2013). https://
doi.org/10.1371/journal.pone.0074092
145. Latek, D., Pasznik, P., Carlomagno, T., Filipek, S.: Towards improved quality of GPCR models
by usage of multiple templates and profile-profile comparison. PLoS ONE 8(2), e56742
(2013). https://doi.org/10.1371/journal.pone.0056742
146. Ng, P.C., Henikoff, J.G., Henikoff, S.: PHAT: a transmembrane-specific substitution matrix.
Predicted hydrophobic and transmembrane. Bioinformatics 16(9), 760–766 (2000)
147. Muller, T., Rahmann, S., Rehmsmeier, M.: Non-symmetric score matrices and the detection
of homologous transmembrane proteins. Bioinformatics 17(Suppl 1), S182–S189 (2001)
148. Jimenez-Morales, D., Adamian, L., Liang, J.: Detecting remote homologues using scoring
matrices calculated from the estimation of amino acid substitution rates of beta-barrel mem-
brane proteins. Conf. Proc. IEEE Eng. Med. Biol. Soc. 1347–1350 (2008)
149. Pirovano, W., Feenstra, K.A., Heringa, J.: PRALINETM: a strategy for improved multiple
alignment of transmembrane proteins. Bioinformatics 24(4), 492–497 (2008). https://doi.org/
10.1093/bioinformatics/btm636
150. Hill, J.R., Kelm, S., Shi, J., Deane, C.M.: Environment specific substitution tables improve
membrane protein alignment. Bioinformatics 27(13), i15–i23 (2011). https://doi.org/10.1093/
bioinformatics/btr230
151. Forrest, L.R., Tang, C.L., Honig, B.: On the accuracy of homology modeling and sequence
alignment methods applied to membrane proteins. Biophys. J. 91(2), 508–517 (2006). https://
doi.org/10.1529/biophysj.106.082313
152. Shafrir, Y., Guy, H.R.: STAM: simple transmembrane alignment method. Bioinformatics
20(5), 758–769 (2004). https://doi.org/10.1093/bioinformatics/btg482
153. Kufareva, I., Rueda, M., Katritch, V., Stevens, R.C., Abagyan, R.: Status of GPCR modeling
and docking as reflected by community-wide GPCR Dock 2010 assessment. Structure 19(8),
1108–1126 (2011)
154. Khafizov, K., Staritzbichler, R., Stamm, M., Forrest, L.R.: A study of the evolution of
inverted-topology repeats from LeuT-fold transporters using AlignMe. Biochemistry 49(50),
10702–10713 (2010). https://doi.org/10.1021/bi101256x
155. Rychlewski, L., Jaroszewski, L., Li, W., Godzik, A.: Comparison of sequence profiles. Strate-
gies for structural predictions using sequence information. Protein Sci. 9(2), 232–241 (2000).
https://doi.org/10.1110/ps.9.2.232
156. Fiser, A., Sali, A.: Modeller: generation and refinement of homology-based protein struc-
ture models. Methods Enzymol. 374, 461–491 (2003). https://doi.org/10.1016/S0076-
6879(03)74020-8
157. Krieger, E., Darden, T., Nabuurs, S.B., Finkelstein, A., Vriend, G.: Making optimal use of
empirical energy functions: Force-field parameterization in crystal space. Proteins: Struct.
Funct. Bioinf. 57(4), 678–683 (2004)
158. Schwede, T., Kopp, J., Guex, N., Peitsch, M.C.: SWISS-MODEL: an automated protein
homology-modeling server. Nucleic Acids Res. 31(13), 3381–3385 (2003)
159. Raman, S., Vernon, R., Thompson, J., Tyka, M., Sadreyev, R., Pei, J., Kim, D., Kellogg, E.,
DiMaio, F., Lange, O., Kinch, L., Sheffler, W., Kim, B.-H., Das, R., Grishin, N.V., Baker,
D.: Structure prediction for CASP8 with all-atom refinement using Rosetta. Proteins: Struct.
Funct. Bioinf. 77(S9), 89–99 (2009)
Modeling of Membrane Proteins 435

160. Zhang, Y.: I-TASSER server for protein 3D structure prediction. BMC Bioinform. 9, 40
(2008). https://doi.org/10.1186/1471-2105-9-40
161. Latek, D.: Rosetta Broker for membrane protein structure prediction: concentrative nucleoside
transporter 3 and corticotropin-releasing factor receptor 1 test cases. BMC Struct. Biol. 17(1),
8 (2017). https://doi.org/10.1186/s12900-017-0078-8
162. Recanatini, M., Cavalli, A., Masetti, M.: Modeling HERG and its interactions with drugs:
recent advances in light of current potassium channel simulations. ChemMedChem 3(4),
523–535 (2008). https://doi.org/10.1002/cmdc.200700264
163. Latek, D., Kolinski, M., Ghoshdastider, U., Debinski, A., Bombolewski, R., Plazinska, A.,
Jozwiak, K., Filipek, S.: Modeling of ligand binding to G protein coupled receptors: cannabi-
noid CB1, CB2 and adrenergic beta 2 AR. J. Mol. Model. 17(9), 2353–2366 (2011). https://
doi.org/10.1007/s00894-011-0986-7
164. Arora, B., Coudrat, T., Wootten, D., Christopoulos, A., Noronha, S.B., Sexton, P.M.: Prediction
of loops in G protein-coupled receptor homology models: effect of imprecise surroundings
and constraints. J. Chem. Inf. Model. 56(4), 671–686 (2016). https://doi.org/10.1021/acs.jcim.
5b00554
165. Shen, M.Y., Sali, A.: Statistical potential for assessment and prediction of protein structures.
Protein Sci. 15(11), 2507–2524 (2006). https://doi.org/10.1110/ps.062416606
166. Hildebrand, P.W., Goede, A., Bauer, R.A., Gruening, B., Ismer, J., Michalsky, E., Preissner,
R.: SuperLooper–a prediction server for the modeling of loops in globular and membrane
proteins. Nucleic Acids Res. 37(Web Server issue), W571–574 (2009). https://doi.org/10.
1093/nar/gkp338
167. Jamroz, M., Kolinski, A.: Modeling of loops in proteins: a multi-method approach. BMC
Struct. Biol. 10, 5 (2010). https://doi.org/10.1186/1472-6807-10-5
168. Canutescu, A.A., Dunbrack Jr., R.L.: Cyclic coordinate descent: a robotics algorithm for
protein loop closure. Protein Sci. 12(5), 963–972 (2003). https://doi.org/10.1110/ps.0242703
169. Kolinski, M., Filipek, S.: Study of a structurally similar kappa opioid receptor agonist and
antagonist pair by molecular dynamics simulations. J. Mol. Model. 16(10), 1567–1576 (2010).
https://doi.org/10.1007/s00894-010-0678-8
170. Mandell, D.J., Coutsias, E.A., Kortemme, T.: Sub-angstrom accuracy in protein loop recon-
struction by robotics-inspired conformational sampling. Nat. Methods 6(8), 551–552 (2009).
https://doi.org/10.1038/nmeth0809-551
171. Jacobson, M.P., Pincus, D.L., Rapp, C.S., Day, T.J., Honig, B., Shaw, D.E., Friesner, R.A.:
A hierarchical approach to all-atom protein loop prediction. Proteins 55(2), 351–367 (2004).
https://doi.org/10.1002/prot.10613
172. Heim, A.J., Li, Z.: Developing a high-quality scoring function for membrane protein structures
based on specific inter-residue interactions. J. Comput. Aided Mol. Des. 26(3), 301–309
(2012). https://doi.org/10.1007/s10822-012-9556-z
173. Ray, A., Lindahl, E., Wallner, B.: Model quality assessment for membrane proteins. Bioin-
formatics 26(24), 3067–3074 (2010). https://doi.org/10.1093/bioinformatics/btq581
174. Gao, C., Stern, H.A.: Scoring function accuracy for membrane protein structure prediction.
Proteins 68(1), 67–75 (2007). https://doi.org/10.1002/prot.21421
175. Law, R.J., Capener, C., Baaden, M., Bond, P.J., Campbell, J., Patargias, G., Arinaminpathy,
Y., Sansom, M.S.: Membrane protein structure quality in molecular dynamics simulation. J.
Mol. Graph. Model. 24(2), 157–165 (2005). https://doi.org/10.1016/j.jmgm.2005.05.006
176. Woetzel, N., Karakas, M., Staritzbichler, R., Muller, R., Weiner, B.E., Meiler, J.:
BCL:score–knowledge based energy potentials for ranking protein models represented by
idealized secondary structure elements. PLoS ONE 7(11), e49242 (2012). https://doi.org/10.
1371/journal.pone.0049242
177. Latek, D., Bajda, M., Filipek, S.: A hybrid approach to structure and function modeling of
G protein-coupled receptors. J. Chem. Inf. Model. 56(4), 630–641 (2016). https://doi.org/10.
1021/acs.jcim.5b00451
436 D. Latek et al.

178. Mordalski, S., Witek, J., Smusz, S., Rataj, K., Bojarski, A.J.: Multiple conformational
states in retrospective virtual screening—homology models vs. crystal structures: beta-2
adrenergic receptor case study. J. Cheminform. 7, 13 (2015). https://doi.org/10.1186/s13321-
015-0062-x
179. Coudrat, T., Simms, J., Christopoulos, A., Wootten, D., Sexton, P.M.: Improving virtual
screening of G protein-coupled receptors via ligand-directed modeling. PLoS Comput. Biol.
13(11), e1005819 (2017). https://doi.org/10.1371/journal.pcbi.1005819
180. Kufareva, I., Katritch, V., Participants of GPCR DOCK 2013, Stevens, R.C., Abagyan, R.:
Advances in GPCR modeling evaluated by the GPCR Dock 2013 assessment: meeting new
challenges. Structure 22(8), 1120–1139 (2014). https://doi.org/10.1016/j.str.2014.06.012
181. Bissantz, C., Bernard, P., Hibert, M., Rognan, D.: Protein-based virtual screening of chemical
databases. II. Are homology models of G-protein coupled receptors suitable targets? Proteins
50(1), 5–25 (2003). https://doi.org/10.1002/prot.10237
182. Barth, P., Schonbrun, J., Baker, D.: Toward high-resolution prediction and design of trans-
membrane helical protein structures. Proc. Natl. Acad. Sci. U.S.A. 104(40), 15682–15687
(2007). https://doi.org/10.1073/pnas.0702515104
183. Barth, P., Wallner, B., Baker, D.: Prediction of membrane protein structures with complex
topologies using limited constraints. Proc. Natl. Acad. Sci. U.S.A. 106(5), 1409–1414 (2009).
https://doi.org/10.1073/pnas.0808323106
184. Michino, M., Chen, J., Stevens, R.C., Brooks 3rd, C.L.: FoldGPCR: structure prediction
protocol for the transmembrane domain of G protein-coupled receptors from class A. Proteins
78(10), 2189–2201 (2010). https://doi.org/10.1002/prot.22731
185. Abrol, R., Griffith, A.R., Bray, J.K., Goddard, W.A.r.: Structure prediction of G protein-
coupled receptors and their ensemble of functionally important conformations. Complemen-
tary experimental and computational techniques to study membrane protein structure, dynam-
ics and interactions (Methods in Molecular Biology) (2011)
186. Shacham, S., Marantz, Y., Bar-Haim, S., Kalid, O., Warshaviak, D., Avisar, N., Inbal, B.,
Heifetz, A., Fichman, M., Topf, M., Naor, Z., Noiman, S., Becker, O.M.: PREDICT modeling
and in-silico screening for G-protein coupled receptors. Proteins 57(1), 51–86 (2004). https://
doi.org/10.1002/prot.20195
187. Abrol, R., Bray, J.K., Goddard 3rd, W.A.: Bihelix: towards de novo structure prediction of
an ensemble of G-protein coupled receptor conformations. Proteins 80(2), 505–518 (2011).
https://doi.org/10.1002/prot.23216
188. Trabanino, R.J., Hall, S.E., Vaidehi, N., Floriano, W.B., Kam, V.W., Goddard 3rd, W.A.: First
principles predictions of the structure and function of g-protein-coupled receptors: validation
for bovine rhodopsin. Biophys. J. 86(4), 1904–1921 (2004). https://doi.org/10.1016/S0006-
3495(04)74256-3
189. Chun, L., Zhang, W.H., Liu, J.F.: Structure and ligand recognition of class C GPCRs. Acta
Pharmacol. Sin. 33(3), 312–323 (2012). https://doi.org/10.1038/aps.2011.186
190. Nussinov, R., Tsai, C.J., Csermely, P.: Allo-network drugs: harnessing allostery in cellular net-
works. Trends Pharmacol. Sci. 32(12), 686–693 (2011). https://doi.org/10.1016/j.tips.2011.
08.004
191. Canals, M., Sexton, P.M., Christopoulos, A.: Allostery in GPCRs: ‘MWC’ revisited. Trends
Biochem. Sci. 36(12), 663–672 (2011). https://doi.org/10.1016/j.tibs.2011.08.005
192. Levinthal, C., Wodak, S.J., Kahn, P., Dadivanian, A.K.: Hemoglobin interaction in sickle cell
fibers. I: theoretical approaches to the molecular contacts. Proc Natl Acad Sci U S A 72(4),
1330–1334 (1975)
193. Brylinski, M., Konieczny, L., Roterman, I.: Ligation site in proteins recognized in silico.
Bioinformation 1(4), 127–129 (2006)
194. Soga, S., Shirai, H., Kobori, M., Hirayama, N.: Use of amino acid composition to predict
ligand-binding sites. J. Chem. Inf. Model. 47(2), 400–406 (2007). https://doi.org/10.1021/
Ci6002202
195. Koczyk, G., Wyrwicz, L.S., Rychlewski, L.: LigProf: a simple tool for in silico prediction of
ligand-binding sites. J. Mol. Model. 13(3), 445–455 (2007). https://doi.org/10.1007/s00894-
006-0165-4
Modeling of Membrane Proteins 437

196. Lo, Y.T., Wang, H.W., Pai, T.W., Tzou, W.S., Hsu, H.H., Chang, H.T.: Protein-ligand binding
region prediction (PLB-SAVE) based on geometric features and CUDA acceleration. BMC
Bioinform. 14 Suppl 4, S4 (2013). https://doi.org/10.1186/1471-2105-14-s4-s4
197. Chang, D.T., Weng, Y.Z., Lin, J.H., Hwang, M.J., Oyang, Y.J.: Protemot: prediction of protein
binding sites with automatically extracted geometrical templates. Nucleic Acids Res 34(Web
Server issue), W303–309 (2006). https://doi.org/10.1093/nar/gkl344
198. Dundas, J., Ouyang, Z., Tseng, J., Binkowski, A., Turpaz, Y., Liang, J.: CASTp: computed atlas
of surface topography of proteins with structural and topographical mapping of functionally
annotated residues. Nucleic Acids Res. 34, W116–W118 (2006). https://doi.org/10.1093/Nar/
Gkl282
199. Chang, D.T., Oyang, Y.J., Lin, J.H.: MEDock: a web server for efficient prediction of ligand
binding sites based on a novel optimization algorithm. Nucleic Acids Res. 33(Web Server
issue), W233–238 (2005)
200. Brady Jr., G.P., Stouten, P.F.: Fast prediction and visualization of protein binding pockets with
PASS. J. Comput. Aided Mol. Des. 14(4), 383–401 (2000)
201. Molecular Operating Environment (MOE), 2013.08. Chemical Computing Group ULC, 1010
Sherbooke St. West, Suite #910, Montreal, QC, Canada, H3A 2R7 (2017)
202. Dimitropoulos, D., Ionides, J., Henrick, K.: Using PDBeChem to search the PDB ligand
dictionary. Curr. Protoc. Bioinform. 14.13.11–14.13.13 (2006)
203. Irwin, J.J., Sterling, T., Mysinger, M.M., Bolstad, E.S., Coleman, R.G.: ZINC: a free tool to dis-
cover chemistry for biology. J. Chem. Inf. Model. (2012). https://doi.org/10.1021/ci3001277
204. Sterling, T., Irwin, J.J.: ZINC 15–Ligand discovery for everyone. J. Chem. Inf. Model. 55(11),
2324–2337 (2015). https://doi.org/10.1021/acs.jcim.5b00559
205. Li, Q., Cheng, T., Wang, Y., Bryant, S.H.: PubChem as a public resource for drug discovery.
Drug Discov. Today 15(23–24), 1052–1057 (2010). https://doi.org/10.1016/j.drudis.2010.10.
003
206. Kim, S., Thiessen, P.A., Bolton, E.E., Chen, J., Fu, G., Gindulyte, A., Han, L., He, J., He,
S., Shoemaker, B.A., Wang, J., Yu, B., Zhang, J., Bryant, S.H.: PubChem substance and
compound databases. Nucleic Acids Res. 44(D1), D1202–D1213 (2016). https://doi.org/10.
1093/nar/gkv951
207. Liu, T., Lin, Y., Wen, X., Jorissen, R.N., Gilson, M.K.: BindingDB: a web-accessible
database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res.
35(Database issue), D198–201 (2007). https://doi.org/10.1093/nar/gkl999
208. Gaulton, A., Bellis, L.J., Bento, A.P., Chambers, J., Davies, M., Hersey, A., Light, Y.,
McGlinchey, S., Michalovich, D., Al-Lazikani, B., Overington, J.P.: ChEMBL: a large-scale
bioactivity database for drug discovery. Nucleic Acids Res. 40(Database issue), D1100–1107
(2012). https://doi.org/10.1093/nar/gkr777
209. Zsoldos, Z., Reid, D., Simon, A., Sadjad, B.S., Johnson, A.P.: eHITS: an innovative approach
to the docking and scoring function problems. Curr. Protein Pept. Sci. 7(5), 421–435 (2006)
210. Vaque, M., Ardrevol, A., Blade, C., Salvado, M.J., Blay, M., Fernandez-Larrea, J., Arola,
L., Pujadas, G.: Protein-ligand docking: a review of recent advances and future perspectives.
Curr. Pharm. Anal. 4(1), 1–19 (2008)
211. Curco, D., Rodriguez-Ropero, F., Aleman, C.: Force-field parametrization of retro-inverso
modified residues: development of torsional and electrostatic parameters. J. Comput. Aided
Mol. Des. 20(1), 13–25 (2006). https://doi.org/10.1007/s10822-005-9032-0
212. Bohm, H.J.: The computer program LUDI: a new method for the de novo design of enzyme
inhibitors. J. Comput. Aided Mol. Des. 6(1), 61–78 (1992)
213. Ewing, T.J.A., Kuntz, I.D.: Critical evaluation of search algorithms for automated molecular
docking and database screening. J. Comput. Chem. 18(9), 1175–1189 (1997)
214. Rarey, M., Kramer, B., Lengauer, T., Klebe, G.: A fast flexible docking method using an
incremental construction algorithm. J. Mol. Biol. 261(3), 470–489 (1996)
215. Mizutani, M.Y., Tomioka, N., Itai, A.: Rational automatic search method for stable docking
models of protein and ligand. J. Mol. Biol. 243(2), 310–326 (1994)
438 D. Latek et al.

216. Halgren, T.A., Murphy, R.B., Friesner, R.A., Beard, H.S., Frye, L.L., Pollard, W.T., Banks,
J.L.: Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment fac-
tors in database screening. J. Med. Chem. 47(7), 1750–1759 (2004). https://doi.org/10.1021/
jm030644s
217. Friesner, R.A., Banks, J.L., Murphy, R.B., Halgren, T.A., Klicic, J.J., Mainz, D.T., Repasky,
M.P., Knoll, E.H., Shelley, M., Perry, J.K., Shaw, D.E., Francis, P., Shenkin, P.S.: Glide: a
new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking
accuracy. J. Med. Chem. 47(7), 1739–1749 (2004). https://doi.org/10.1021/jm0306430
218. McGann, M.R., Almond, H.R., Nicholls, A., Grant, J.A., Brown, F.K.: Gaussian docking
functions. Biopolymers 68(1), 76–90 (2003). https://doi.org/10.1002/bip.10207
219. Abagyan, R., Totrov, M., Kuznetsov, D.: Icm - a new method for protein modeling and
design—applications to docking and structure prediction from the distorted native conforma-
tion. J. Comput. Chem. 15(5), 488–506 (1994)
220. McMartin, C., Bohacek, R.S.: QXP: powerful, rapid computer algorithms for structure-based
drug design. J. Comput. Aided Mol. Des. 11(4), 333–344 (1997)
221. Trosset, J.Y., Scheraga, H.A.: PRODOCK: software package for protein modeling and dock-
ing. J. Comput. Chem. 20(4), 412–427 (1999)
222. Liu, M., Wang, S.M.: MCDOCK: A Monte Carlo simulation approach to the molecular
docking problem. J. Comput. Aided Mol. Des. 13(5), 435–451 (1999)
223. Jones, G., Willett, P., Glen, R.C., Leach, A.R., Taylor, R.: Development and validation of a
genetic algorithm for flexible docking. J. Mol. Biol. 267(3), 727–748 (1997)
224. Namasivayam, V., Gunther, R.: A fast flexible molecular docking program based on swarm
intelligence. Chem. Biol. Drug Des. 70(6), 475–484 (2007). https://doi.org/10.1111/j.1747-
0285.2007.00588.x
225. Grosdidier, A., Zoete, V., Michielin, O.: SwissDock, a protein-small molecule docking web
service based on EADock DSS. Nucleic Acids Res. 39, W270–W277 (2011). https://doi.org/
10.1093/Nar/Gkr366
226. Pasznik, P., Rutkowska, E., Niewieczerzal, S., Cielecka-Piontek, J., Filipek, S., Latek, D.:
GUT-DOCK—a web-service to predict off-target interactions of drugs with gut hormone
GPCRs. Submitted
227. Labbe, C.M., Rey, J., Lagorce, D., Vavrusa, M., Becot, J., Sperandio, O., Villoutreix, B.O.,
Tuffery, P., Miteva, M.A.: MTiOpenScreen: a web server for structure-based virtual screening.
Nucleic Acids Res. 43(W1), W448–W454 (2015). https://doi.org/10.1093/nar/gkv306
228. Wang, R.X., Liu, L., Lai, L.H., Tang, Y.Q.: SCORE: a new empirical method for estimating
the binding affinity of a protein-ligand complex. J. Mol. Model. 4(12), 379–394 (1998)
229. Eldridge, M.D., Murray, C.W., Auton, T.R., Paolini, G.V., Mee, R.P.: Empirical scoring func-
tions.1. The development of a fast empirical scoring function to estimate the binding affinity
of ligands in receptor complexes. J. Comput. Aided Mol. Des. 11(5), 425–445 (1997)
230. Gohlke, H., Hendlich, M., Klebe, G.: Knowledge-based scoring function to predict protein-
ligand interactions. J. Mol. Biol. 295(2), 337–356 (2000)
231. DeWitte, R.S., Shakhnovich, E.: SMoG: De novo design method based on simple, fast and
accurate free energy estimates. Abstr. Pap. Am. Chem. Soc. 214, 6-Comp (1997)
232. DeWitte, R.S., Ishchenko, A.V., Shakhnovich, E.I.: SMoG: De novo design method based on
simple, fast, and accurate free energy estimates.2. Case studies in molecular design. J. Am.
Chem. Soc. 119(20), 4608–4617 (1997)
233. Mitchell, J.B.O., Laskowski, R.A., Alex, A., Thornton, J.M.: BLEEP—potential of mean
force describing protein-ligand interactions: I. Generating potential. J. Comput. Chem. 20(11),
1165–1176 (1999)
234. Mitchell, J.B.O., Laskowski, R.A., Alex, A., Forster, M.J., Thornton, J.M.: BLEEP - Potential
of mean force describing protein-ligand interactions: II. Calculation of binding energies and
comparison with experimental data. J. Comput. Chem. 20(11), 1177–1185 (1999)
235. Mooij, W.T.M., Verdonk, M.L.: General and targeted statistical potentials for protein-ligand
interactions. Proteins 61(2), 272–287 (2005). https://doi.org/10.1002/Prot.20588
Modeling of Membrane Proteins 439

236. Sherman, W., Day, T., Jacobson, M.P., Friesner, R.A., Farid, R.: Novel procedure for modeling
ligand/receptor induced fit effects. J. Med. Chem. 49(2), 534–553 (2006). https://doi.org/10.
1021/Jm050540c
237. Hanson, M.A., Roth, C.B., Jo, E., Griffith, M.T., Scott, F.L., Reinhart, G., Desale, H., Clemons,
B., Cahalan, S.M., Schuerer, S.C., Sanna, M.G., Han, G.W., Kuhn, P., Rosen, H., Stevens,
R.C.: Crystal structure of a lipid G protein-coupled receptor. Science 335(6070), 851–855
(2012). https://doi.org/10.1126/science.1215904
238. Shoichet, B.K., Kobilka, B.K.: Structure-based drug screening for G-protein-coupled recep-
tors. Trends Pharmacol. Sci. 33(5), 268–272 (2012). https://doi.org/10.1016/j.tips.2012.03.
007
239. Kandt, C., Schlitter, J., Gerwert, K.: Dynamics of water molecules in the bacteriorhodopsin
trimer in explicit lipid/water environment. Biophys. J. 86(2), 705–717 (2004). https://doi.org/
10.1016/S0006-3495(04)74149-1
240. Lemkul, J.A., Allen, W.J., Bevan, D.R.: Practical considerations for building GROMOS-
compatible small-molecule topologies. J. Chem. Inf. Model. 50(12), 2221–2235 (2010).
https://doi.org/10.1021/Ci100335w
241. Malde, A.K., Zuo, L., Breeze, M., Stroet, M., Poger, D., Nair, P.C., Oostenbrink, C., Mark,
A.E.: An automated force field topology builder (ATB) and repository: Version 1.0. J. Chem.
Theory Comput. 7(12), 4026–4037 (2011). https://doi.org/10.1021/ct200196m
242. Schuttelkopf, A.W., van Aalten, D.M.F.: PRODRG: a tool for high-throughput crystallogra-
phy of protein-ligand complexes. Acta Crystallogr. Sect. D-Biol. Crystallogr. 60, 1355–1363
(2004). https://doi.org/10.1107/S0907444904011679
243. Zoete, V., Cuendet, M.A., Grosdidier, A., Michielin, O.: SwissParam: a fast force field gener-
ation tool for small organic molecules. J. Comput. Chem. 32(11), 2359–2368 (2011). https://
doi.org/10.1002/jcc.21816
244. Vanommeslaeghe, K., Hatcher, E., Acharya, C., Kundu, S., Zhong, S., Shim, J., Darian, E.,
Guvench, O., Lopes, P., Vorobyov, I., Mackerell Jr., A.D.: CHARMM general force field: a
force field for drug-like molecules compatible with the CHARMM all-atom additive biological
force fields. J. Comput. Chem. 31(4), 671–690 (2010). https://doi.org/10.1002/jcc.21367
245. Ribeiro, A.A.S.T., Horta, B.A.C., de Alencastro, R.B.: MKTOP: a program for automatic
construction of molecular topologies. J. Brazil Chem. Soc. 19(7), 1433–1435 (2008)
246. Sousa da Silva, A.W.V., W.F.; Laue, E: ACPYPE—AnteChamber PYthon Parser interfacE.
In
247. Sousa da Silva, A.W., Vranken, W.F.: ACPYPE—anteChamber PYthon parser interfacE.
BMC Res. Notes 5, 367 (2012). https://doi.org/10.1186/1756-0500-5-367
248. Jakalian, A., Jack, D.B., Bayly, C.I.: Fast, efficient generation of high-quality atomic charges.
AM1-BCC model: II. Parameterization and validation. J. Comput. Chem. 23(16), 1623–1641
(2002). https://doi.org/10.1002/Jcc.10128
249. Caleman, C., van Maaren, P.J., Hong, M.Y., Hub, J.S., Costa, L.T., van der Spoel, D.: Force
field benchmark of organic liquids: density, enthalpy of vaporization, heat capacities, surface
tension, isothermal compressibility, volumetric expansion coefficient, and dielectric constant.
J. Chem. Theory Comput. 8(1), 61–74 (2012). https://doi.org/10.1021/Ct200731v
250. van der Spoel, D., van Maaren, P.J., Caleman, C.: GROMACS molecule & liquid database.
Bioinformatics 28(5), 752–753 (2012). https://doi.org/10.1093/bioinformatics/bts020
251. Domanski, J., Stansfeld, P.J., Sansom, M.S., Beckstein, O.: Lipidbook: a public repository
for force-field parameters used in membrane simulations. J. Membr. Biol. 236(3), 255–258
(2010). https://doi.org/10.1007/s00232-010-9296-8
252. Adamian, L., Naveed, H., Liang, J.: Lipid-binding surfaces of membrane proteins: evi-
dence from evolutionary and structural analysis. Biochim. Biophys. Acta 1808(4), 1092–1102
(2011). https://doi.org/10.1016/j.bbamem.2010.12.008
253. Opekarova, M., Tanner, W.: Specific lipid requirements of membrane proteins—a putative
bottleneck in heterologous expression. Biochim. Biophys. Acta-Biomembr. 1610(1), 11–22
(2003). https://doi.org/10.1016/S0005-2736(02)00708-3
440 D. Latek et al.

254. Sanders, C.R., Mittendorf, K.F.: Tolerance to changes in membrane lipid composition as a
selected trait of membrane proteins. Biochemistry 50(37), 7858–7867 (2011). https://doi.org/
10.1021/bi2011527
255. Berger, C., Ho, J.T.C., Kimura, T., Hess, S., Gawrisch, K., Yeliseev, A.: Preparation of stable
isotope-labeled peripheral cannabinoid receptor CB2 by bacterial fermentation. Protein Expr.
Purif. 70(2), 236–247 (2010). https://doi.org/10.1016/j.pep.2009.12.011
256. Soubias, O., Gawrisch, K.: The role of the lipid matrix for structure and function of the
GPCR rhodopsin. Biochim. Biophys. Acta 1818(2), 234–240 (2012). https://doi.org/10.1016/
j.bbamem.2011.08.034
257. Lee, S.Y., Lee, A., Chen, J.Y., MacKinnon, R.: Structure of the KvAP voltage-dependent
K+ channel and its dependence on the lipid membrane. Proc. Natl. Acad. Sci. U.S.A. 102(43),
15441–15446 (2005). https://doi.org/10.1073/pnas.0507651102
258. Oostenbrink, C., Villa, A., Mark, A.E., Van Gunsteren, W.F.: A biomolecular force field based
on the free enthalpy of hydration and solvation: the GROMOS force-field parameter sets 53A5
and 53A6. J. Comput. Chem. 25(13), 1656–1676 (2004). https://doi.org/10.1002/jcc.20090
259. Scott, W.R.P., Hunenberger, P.H., Tironi, I.G., Mark, A.E., Billeter, S.R., Fennen, J., Torda,
A.E., Huber, T., Kruger, P., van Gunsteren, W.F.: The GROMOS biomolecular simulation
program package. J. Phys. Chem. A 103(19), 3596–3607 (1999)
260. Foloppe, N., MacKerell, A.D.: All-atom empirical force field for nucleic acids: I. Parameter
optimization based on small molecule and condensed phase macromolecular target data. J.
Comput. Chem. 21(2), 86–104 (2000)
261. Klauda, J.B., Venable, R.M., Freites, J.A., O’Connor, J.W., Tobias, D.J., Mondragon-Ramirez,
C., Vorobyov, I., MacKerell Jr., A.D., Pastor, R.W.: Update of the CHARMM all-atom additive
force field for lipids: validation on six lipid types. J. Phys. Chem. B 114(23), 7830–7843
(2010). https://doi.org/10.1021/jp101759q
262. MacKerell, A.D., Bashford, D., Bellott, M., Dunbrack, R.L., Evanseck, J.D., Field, M.J.,
Fischer, S., Gao, J., Guo, H., Ha, S., Joseph-McCarthy, D., Kuchnir, L., Kuczera, K., Lau,
F.T.K., Mattos, C., Michnick, S., Ngo, T., Nguyen, D.T., Prodhom, B., Reiher, W.E., Roux,
B., Schlenkrich, M., Smith, J.C., Stote, R., Straub, J., Watanabe, M., Wiorkiewicz-Kuczera,
J., Yin, D., Karplus, M.: All-atom empirical potential for molecular modeling and dynamics
studies of proteins. J. Phys. Chem. B 102(18), 3586–3616 (1998)
263. Wang, J.M., Wolf, R.M., Caldwell, J.W., Kollman, P.A., Case, D.A.: Development and testing
of a general amber force field. J. Comput. Chem. 25(9), 1157–1174 (2004)
264. Jorgensen, W.L., Maxwell, D.S., TiradoRives, J.: Development and testing of the OPLS all-
atom force field on conformational energetics and properties of organic liquids. J. Am. Chem.
Soc. 118(45), 11225–11236 (1996)
265. Kaminski, G.A., Friesner, R.A., Tirado-Rives, J., Jorgensen, W.L.: Evaluation and
reparametrization of the OPLS-AA force field for proteins via comparison with accurate
quantum chemical calculations on peptides. J. Phys. Chem. B 105(28), 6474–6487 (2001).
https://doi.org/10.1021/Jp003919d
266. Jambeck, J.P., Lyubartsev, A.P.: Derivation and systematic validation of a refined all-atom
force field for phosphatidylcholine lipids. J. Phys. Chem. B 116(10), 3164–3179 (2012).
https://doi.org/10.1021/jp212503e
267. Marrink, S.J., Risselada, H.J., Yefimov, S., Tieleman, D.P., de Vries, A.H.: The MARTINI
force field: coarse grained model for biomolecular simulations. J. Phys. Chem. B 111(27),
7812–7824 (2007). https://doi.org/10.1021/jp071097f
268. Sansom, M.S.P., Scott, K.A., Bond, P.J.: Coarse-grained simulation: a high-throughput com-
putational approach to membrane proteins. Biochem. Soc. Trans. 36, 27–32 (2008). https://
doi.org/10.1042/Bst0360027
269. Scott, K.A., Bond, P.J., Ivetac, A., Chetwynd, A.P., Khalid, S., Sansom, M.S.P.: Coarse-
grained MD simulations of membrane protein-bilayer self-assembly. Structure 16(4), 621–630
(2008). https://doi.org/10.1016/j.str.2008.01.014
270. Berendsen, H.J.C., van der Spoel, D., van Drunen, R.: GROMACS: a message-passing parallel
molecular dynamics implementation. Comput. Phys. Commun. 91(1–3), 43–56 (1995)
Modeling of Membrane Proteins 441

271. Hess, B., Kutzner, C., van der Spoel, D., Lindahl, E.: GROMACS 4: algorithms for highly
efficient, load-balanced, and scalable molecular simulation. J. Chem. Theory Comput. 4(3),
435–447 (2008)
272. Lindahl, E., Hess, B., van der Spoel, D.: GROMACS 3.0: a package for molecular simulation
and trajectory analysis. J. Mol. Model. 7(8), 306–317 (2001)
273. Van der Spoel, D., Lindahl, E., Hess, B., Groenhof, G., Mark, A.E., Berendsen, H.J.C.:
GROMACS: fast, flexible, and free. J. Comput. Chem. 26(16), 1701–1718 (2005). https://
doi.org/10.1002/jcc.20291
274. Abraham, M.J., Murtola, T., Schulz, R., Páll, S., Smith, J.C., Hess, B., Lindahl, E.: GRO-
MACS: high performance molecular simulations through multi-level parallelism from lap-
tops to supercomputers. SoftwareX 1–2, 19–25 (2015). https://doi.org/10.1016/j.softx.2015.
06.001
275. Phillips, J.C., Braun, R., Wang, W., Gumbart, J., Tajkhorshid, E., Villa, E., Chipot, C., Skeel,
R.D., Kale, L., Schulten, K.: Scalable molecular dynamics with NAMD. J. Comput. Chem.
26(16), 1781–1802 (2005)
276. Brooks, B.R., III, C.L.B., Jr, A.D.M., Nilsson, L., Petrella, R.J., Roux, B., Won, Y., Archontis,
G., Bartels, C., Boresch, S., Caflisch, A., Caves, L., Cui, Q., Dinner, A.R., Feig, M., Fischer,
S., Gao, J., Hodoscek, M., Im, W., Kuczera, K., Lazaridis, T., Ma, J., Ovchinnikov, V., Paci,
E., Pastor, R.W., Post, C.B., Pu, J.Z., Schaefer, M., Tidor, B., Venable, R.M., Woodcock,
H.L., Wu, X., Yang, W., York, D.M., Karplus, M.: CHARMM: the biomolecular simulation
program. J. Comput. Chem. 30(10), 1545–1614 (2009)
277. Case, D.A., Cheatham, T.E., Darden, T., Gohlke, H., Luo, R., Merz, K.M., Onufriev, A.,
Simmerling, C., Wang, B., Woods, R.J.: The amber biomolecular simulation programs. J.
Comput. Chem. 26(16), 1668–1688 (2005). https://doi.org/10.1002/Jcc.20290
278. Jo, S., Kim, T., Iyer, V.G., Im, W.: CHARMM-GUI: a web-based graphical user interface for
CHARMM. J. Comput. Chem. 29(11), 1859–1865 (2008). https://doi.org/10.1002/jcc.20945
279. Jo, S., Lim, J.B., Klauda, J.B., Im, W.: CHARMM-GUI membrane builder for mixed bilayers
and its application to yeast membranes. Biophys. J. 97(1), 50–58 (2009). https://doi.org/10.
1016/j.bpj.2009.04.013
280. Jo, S., Kim, T., Im, W.: Automated builder and database of protein/membrane complexes
for molecular dynamics simulations. PLoS ONE 2(9), e880 (2007). https://doi.org/10.1371/
journal.pone.0000880
281. Humphrey, W., Dalke, A., Schulten, K.: VMD: visual molecular dynamics. J. Mol. Graph.
Model. 14(1), 33–38 (1996)
282. Kandt, C., Ash, W.L., Tieleman, D.P.: Setting up and running molecular dynamics simulations
of membrane proteins. Methods 41(4), 475–488 (2007). https://doi.org/10.1016/j.ymeth.2006.
08.006
283. Wolf, M.G., Hoefling, M., Aponte-Santamaria, C., Grubmuller, H., Groenhof, G.: g_membed:
efficient insertion of a membrane protein into an equilibrated lipid bilayer with minimal
perturbation. J. Comput. Chem. 31(11), 2169–2174 (2010). https://doi.org/10.1002/jcc.21507
284. Krieger, E., Darden, T., Nabuurs, S.B., Finkelstein, A., Vriend, G.: Making optimal use
of empirical energy functions: force-field parameterization in crystal space. Proteins 57(4),
678–683 (2004)
285. Wassenaar, T.A., Ingolfsson, H.I., Bockmann, R.A., Tieleman, D.P., Marrink, S.J.: Computa-
tional lipidomics with insane: a versatile tool for generating custom membranes for molecular
simulations. J. Chem. Theory Comput. 11(5), 2144–2155 (2015). https://doi.org/10.1021/acs.
jctc.5b00209
286. Wassenaar, T.A., Pluhackova, K., Bockmann, R.A., Marrink, S.J., Tieleman, D.P.: Going
backward: a flexible geometric approach to reverse transformation from coarse grained to
atomistic models. J. Chem. Theory Comput. 10(2), 676–690 (2014). https://doi.org/10.1021/
ct400617g
287. Stansfeld, P.J., Goose, J.E., Caffrey, M., Carpenter, E.P., Parker, J.L., Newstead, S., Sansom,
M.S.: MemProtMD: automated insertion of membrane protein structures into explicit lipid
membranes. Structure 23(7), 1350–1361 (2015). https://doi.org/10.1016/j.str.2015.05.006
442 D. Latek et al.

288. Qi, Y., Ingolfsson, H.I., Cheng, X., Lee, J., Marrink, S.J., Im, W.: CHARMM-GUI Martini
maker for coarse-grained simulations with the Martini force field. J. Chem. Theory Comput.
11(9), 4486–4494 (2015). https://doi.org/10.1021/acs.jctc.5b00513
289. Wu, E.L., Cheng, X., Jo, S., Rui, H., Song, K.C., Davila-Contreras, E.M., Qi, Y., Lee, J.,
Monje-Galvan, V., Venable, R.M., Klauda, J.B., Im, W.: CHARMM-GUI membrane builder
toward realistic biological membrane simulations. J. Comput. Chem. 35(27), 1997–2004
(2014). https://doi.org/10.1002/jcc.23702
290. Ribeiro, J.V., Bernardi, R.C., Rudack, T., Stone, J.E., Phillips, J.C., Freddolino, P.L., Schulten,
K.: QwikMD—integrative molecular dynamics toolkit for novices and experts. Sci. Rep. 6,
26536 (2016). https://doi.org/10.1038/srep26536
291. Humphrey, W., Dalke, A., Schulten, K.: VMD: visual molecular dynamics. J Mol Graph 14(1),
33–38, 27–38 (1996)
292. Doerr, S., Harvey, M.J., Noe, F., De Fabritiis, G.: HTMD: high-throughput molecular dynam-
ics for molecular discovery. J. Chem. Theory Comput. 12(4), 1845–1852 (2016). https://doi.
org/10.1021/acs.jctc.6b00049
293. Lu, H., Isralewitz, B., Krammer, A., Vogel, V., Schulten, K.: Unfolding of titin immunoglob-
ulin domains by steered molecular dynamics simulation. Biophys. J. 75(2), 662–671 (1998).
https://doi.org/10.1016/S0006-3495(98)77556-3
294. Kappel, C., Grubmuller, H.: Velocity-dependent mechanical unfolding of bacteriorhodopsin
is governed by a dynamic interaction network. Biophys. J. 100(4), 1109–1119 (2011). https://
doi.org/10.1016/j.bpj.2011.01.004
295. Grubmuller, H., Heymann, B., Tavan, P.: Ligand binding: molecular mechanics calculation
of the streptavidin-biotin rupture force. Science 271(5251), 997–999 (1996)
296. Wriggers, W., Schulten, K.: Stability and dynamics of G-actin: back-door water diffusion
and behavior of a subdomain 3/4 loop. Biophys. J. 73(2), 624–639 (1997). https://doi.org/10.
1016/S0006-3495(97)78098-6
297. Izrailev, S., Stepaniants, S., Isralewitz, B., Kosztin, D., Lu, H., Molnar, F., Wriggers, W.,
Schulten, K.: Steered molecular dynamics. In: Deuflhard, P., Hermans, J., Leimkuhler, B.,
Mark, A.E., Reich, S., Skeel, R.D. (eds.) Computational Molecular Dynamics: Challenges,
Methods, Ideas, vol. 4. pp. 39–65. Springer, Berlin (1998)
298. Izrailev, S., Stepaniants, S., Balsera, M., Oono, Y., Schulten, K.: Molecular dynamics study
of unbinding of the avidin-biotin complex. Biophys. J. 72(4), 1568–1581 (1997). https://doi.
org/10.1016/S0006-3495(97)78804-0
299. Fanelli, F., Seeber, M.: Structural insights into retinitis pigmentosa from unfolding simula-
tions of rhodopsin mutants. FASEB J. 24(9), 3196–3209 (2010). https://doi.org/10.1096/fj.
09-151084
300. Isralewitz, B., Izrailev, S., Schulten, K.: Binding pathway of retinal to bacterio-opsin: a pre-
diction by molecular dynamics simulations. Biophys. J. 73(6), 2972–2979 (1997). https://doi.
org/10.1016/S0006-3495(97)78326-7
301. Wroblowski, B., Diaz, J.F., Schlitter, J., Engelborghs, Y.: Modelling pathways of alpha-
chymotrypsin activation and deactivation. Protein Eng. 10(10), 1163–1174 (1997)
302. Cheng, X., Wang, H., Grant, B., Sine, S.M., McCammon, J.A.: Targeted molecular dynamics
study of C-loop closure and channel gating in nicotinic receptors. PLoS Comput. Biol. 2(9),
e134 (2006). https://doi.org/10.1371/journal.pcbi.0020134
303. Grayson, P., Tajkhorshid, E., Schulten, K.: Mechanisms of selectivity in channels and enzymes
studied with interactive molecular dynamics. Biophys. J. 85(1), 36–48 (2003). https://doi.org/
10.1016/S0006-3495(03)74452-X
304. Sabbadin, D., Moro, S.: Supervised molecular dynamics (SuMD) as a helpful tool to depict
GPCR-ligand recognition pathway in a nanosecond time scale. J. Chem. Inf. Model. 54(2),
372–376 (2014). https://doi.org/10.1021/ci400766b
305. Jakowiecki, J., Filipek, S.: Hydrophobic ligand entry and exit pathways of the CB1 cannabi-
noid receptor. J. Chem. Inf. Model. 56(12), 2457–2466 (2016). https://doi.org/10.1021/acs.
jcim.6b00499
Modeling of Membrane Proteins 443

306. Deganutti, G., Cuzzolin, A., Ciancetta, A., Moro, S.: Understanding allosteric interactions in G
protein-coupled receptors using supervised molecular dynamics: a prototype study analysing
the human A3 adenosine receptor positive allosteric modulator LUF6000. Bioorg. Med. Chem.
23(14), 4065–4071 (2015). https://doi.org/10.1016/j.bmc.2015.03.039
307. Deganutti, G., Moro, S.: Supporting the identification of novel fragment-based positive
allosteric modulators using a supervised molecular dynamics approach: a retrospective analy-
sis considering the human A2A adenosine receptor as a key example. Molecules 22(5) (2017).
https://doi.org/10.3390/molecules22050818
308. Paoletta, S., Sabbadin, D., von Kugelgen, I., Hinz, S., Katritch, V., Hoffmann, K., Abdelrah-
man, A., Strassburger, J., Baqi, Y., Zhao, Q., Stevens, R.C., Moro, S., Muller, C.E., Jacobson,
K.A.: Modeling ligand recognition at the P2Y12 receptor in light of X-ray structural infor-
mation. J. Comput. Aided Mol. Des. 29(8), 737–756 (2015). https://doi.org/10.1007/s10822-
015-9858-z
309. Cuzzolin, A., Sturlese, M., Deganutti, G., Salmaso, V., Sabbadin, D., Ciancetta, A., Moro, S.:
Deciphering the complexity of ligand-protein recognition pathways using supervised molec-
ular dynamics (SuMD) simulations. J. Chem. Inf. Model. 56(4), 687–705 (2016). https://doi.
org/10.1021/acs.jcim.5b00702
310. Fotiadis, D., Liang, Y., Filipek, S., Saperstein, D.A., Engel, A., Palczewski, K.: Atomic-force
microscopy: rhodopsin dimers in native disc membranes. Nature 421(6919), 127–128 (2003).
https://doi.org/10.1038/421127a
311. Gorman, P.M., Kim, S., Guo, M., Melnyk, R.A., McLaurin, J., Fraser, P.E., Bowie, J.U.,
Chakrabartty, A.: Dimerization of the transmembrane domain of amyloid precursor proteins
and familial Alzheimer’s disease mutants. BMC Neurosci. 9, 17 (2008). https://doi.org/10.
1186/1471-2202-9-17
312. George, S.R., O’Dowd, B.F., Lee, S.P.: G-protein-coupled receptor oligomerization and its
potential for drug discovery. Nat. Rev. Drug Discov. 1(10), 808–820 (2002). https://doi.org/
10.1038/nrd913
313. De Strooper, B.: Aph-1, Pen-2, and Nicastrin with Presenilin generate an active gamma-
Secretase complex. Neuron 38(1), 9–12 (2003)
314. Janin, J.: Protein-protein docking tested in blind predictions: the CAPRI experiment. Mol.
BioSyst. 6(12), 2351–2362 (2010). https://doi.org/10.1039/c005060c
315. Moreira, I.S., Fernandes, P.A., Ramos, M.J.: Protein-protein docking dealing with the
unknown. J. Comput. Chem. 31(2), 317–342 (2010). https://doi.org/10.1002/jcc.21276
316. Zacharias, M.: Accounting for conformational changes during protein-protein docking. Curr.
Opin. Struct. Biol. 20(2), 180–186 (2010). https://doi.org/10.1016/j.sbi.2010.02.001
317. Comeau, S.R., Gatchell, D.W., Vajda, S., Camacho, C.J.: ClusPro: a fully automated algorithm
for protein-protein docking. Nucleic Acids Res. 32(Web Server issue), W96–99 (2004).
https://doi.org/10.1093/nar/gkh354
318. Comeau, S.R., Gatchell, D.W., Vajda, S., Camacho, C.J.: ClusPro: an automated docking and
discrimination method for the prediction of protein complexes. Bioinformatics 20(1), 45–50
(2004)
319. Kozakov, D., Brenke, R., Comeau, S.R., Vajda, S.: PIPER: an FFT-based protein docking
program with pairwise potentials. Proteins 65(2), 392–406 (2006). https://doi.org/10.1002/
prot.21117
320. Kozakov, D., Beglov, D., Bohnuud, T., Mottarella, S.E., Xia, B., Hall, D.R., Vajda, S.: How
good is automated protein docking? Proteins 81(12), 2159–2166 (2013). https://doi.org/10.
1002/prot.24403
321. Kozakov, D., Hall, D.R., Xia, B., Porter, K.A., Padhorny, D., Yueh, C., Beglov, D., Vajda, S.:
The ClusPro web server for protein-protein docking. Nat. Protoc. 12(2), 255–278 (2017).
https://doi.org/10.1038/nprot.2016.169
322. Tovchigrechko, A., Vakser, I.A.: GRAMM-X public web server for protein-protein docking.
Nucleic Acids Res. 34(Web Server issue), W310–314 (2006). https://doi.org/10.1093/nar/
gkl206
444 D. Latek et al.

323. Pierce, B.G., Hourai, Y., Weng, Z.: Accelerating protein docking in ZDOCK using an advanced
3D convolution library. PLoS ONE 6(9), e24657 (2011). https://doi.org/10.1371/journal.pone.
0024657
324. Chen, R., Li, L., Weng, Z.: ZDOCK: an initial-stage protein-docking algorithm. Proteins
52(1), 80–87 (2003). https://doi.org/10.1002/prot.10389
325. Li, L., Chen, R., Weng, Z.: RDOCK: refinement of rigid-body protein docking predictions.
Proteins 53(3), 693–707 (2003). https://doi.org/10.1002/prot.10460
326. Chaudhury, S., Gray, J.J.: Conformer selection and induced fit in flexible backbone protein-
protein docking using computational and NMR ensembles. J. Mol. Biol. 381(4), 1068–1087
(2008). https://doi.org/10.1016/j.jmb.2008.05.042
327. Lyskov, S., Gray, J.J.: The RosettaDock server for local protein-protein docking. Nucleic
Acids Res. 36(Web Server issue), W233–238 (2008). https://doi.org/10.1093/nar/gkn216
328. Gray, J.J., Moughon, S., Wang, C., Schueler-Furman, O., Kuhlman, B., Rohl, C.A., Baker,
D.: Protein-protein docking with simultaneous optimization of rigid-body displacement and
side-chain conformations. J. Mol. Biol. 331(1), 281–299 (2003)
329. Lyskov, S., Chou, F.C., Conchuir, S.O., Der, B.S., Drew, K., Kuroda, D., Xu, J., Weitzner,
B.D., Renfrew, P.D., Sripakdeevong, P., Borgo, B., Havranek, J.J., Kuhlman, B., Kortemme,
T., Bonneau, R., Gray, J.J., Das, R.: Serverification of molecular modeling applications: the
Rosetta Online Server that Includes Everyone (ROSIE). PLoS ONE 8(5), e63906 (2013).
https://doi.org/10.1371/journal.pone.0063906
330. Chaudhury, S., Berrondo, M., Weitzner, B.D., Muthu, P., Bergman, H., Gray, J.J.: Benchmark-
ing and analysis of protein docking performance in Rosetta v3.2. PLoS ONE 6(8), e22477
(2011). https://doi.org/10.1371/journal.pone.0022477
331. de Vries, S.J., van Dijk, M., Bonvin, A.M.: The HADDOCK web server for data-driven
biomolecular docking. Nat. Protoc. 5(5), 883–897 (2010). https://doi.org/10.1038/nprot.2010.
32
332. Karaca, E., Melquiond, A.S., de Vries, S.J., Kastritis, P.L., Bonvin, A.M.: Building macro-
molecular assemblies by information-driven docking: introducing the HADDOCK multibody
docking server. Mol. Cell. Proteomics: MCP 9(8), 1784–1794 (2010). https://doi.org/10.1074/
mcp.M000051-MCP201
333. de Vries, S.J., van Dijk, A.D., Krzeminski, M., van Dijk, M., Thureau, A., Hsu, V., Wassenaar,
T., Bonvin, A.M.: HADDOCK versus HADDOCK: new features and performance of HAD-
DOCK2.0 on the CAPRI targets. Proteins 69(4), 726–733 (2007). https://doi.org/10.1002/
prot.21723
334. Dominguez, C., Boelens, R., Bonvin, A.M.: HADDOCK: a protein-protein docking approach
based on biochemical or biophysical information. J. Am. Chem. Soc. 125(7), 1731–1737
(2003). https://doi.org/10.1021/ja026939x
335. van Zundert, G.C.P., Rodrigues, J., Trellet, M., Schmitz, C., Kastritis, P.L., Karaca, E.,
Melquiond, A.S.J., van Dijk, M., de Vries, S.J., Bonvin, A.: The HADDOCK2.2 Web Server:
user-friendly integrative modeling of biomolecular complexes. J. Mol. Biol. 428(4), 720–725
(2016). https://doi.org/10.1016/j.jmb.2015.09.014
336. Schneidman-Duhovny, D., Inbar, Y., Nussinov, R., Wolfson, H.J.: PatchDock and SymmDock:
servers for rigid and symmetric docking. Nucleic Acids Res. 33(Web Server issue), W363–367
(2005). https://doi.org/10.1093/nar/gki481
337. Casciari, D., Seeber, M., Fanelli, F.: Quaternary structure predictions of transmembrane pro-
teins starting from the monomer: a docking-based approach. BMC Bioinform. 7, 340 (2006).
https://doi.org/10.1186/1471-2105-7-340
338. Canals, M., Marcellino, D., Fanelli, F., Ciruela, F., de Benedetti, P., Goldberg, S.R., Neve,
K., Fuxe, K., Agnati, L.F., Woods, A.S., Ferre, S., Lluis, C., Bouvier, M., Franco, R.:
Adenosine A2A-dopamine D2 receptor-receptor heteromerization: qualitative and quantita-
tive assessment by fluorescence and bioluminescence energy transfer. J. Biol. Chem. 278(47),
46741–46749 (2003). https://doi.org/10.1074/jbc.M306451200
339. Palczewski, K., Kumasaka, T., Hori, T., Behnke, C.A., Motoshima, H., Fox, B.A., Le Trong,
I., Teller, D.C., Okada, T., Stenkamp, R.E., Yamamoto, M., Miyano, M.: Crystal structure of
rhodopsin: A G protein-coupled receptor. Science 289(5480), 739–745 (2000)
Modeling of Membrane Proteins 445

340. Lichtarge, O., Bourne, H.R., Cohen, F.E.: An evolutionary trace method defines binding
surfaces common to protein families. J. Mol. Biol. 257(2), 342–358 (1996)
341. Madabushi, S., Gross, A.K., Philippi, A., Meng, E.C., Wensel, T.G., Lichtarge, O.: Evolution-
ary trace of G protein-coupled receptors reveals clusters of residues that determine global and
class-specific functions. J. Biol. Chem. 279(9), 8126–8132 (2004). https://doi.org/10.1074/
jbc.M312671200
342. Gouldson, P.R., Higgs, C., Smith, R.E., Dean, M.K., Gkoutos, G.V., Reynolds, C.A.: Dimer-
ization and domain swapping in g-protein-coupled receptors: a computational study. Neu-
ropsychopharmacology 23(4), S60–S77 (2000)
343. Dean, M.K., Higgs, C., Smith, R.E., Bywater, R.P., Snell, C.R., Scott, P.D., Upton, G.J.G.,
Howe, T.J., Reynolds, C.A.: Dimerization of G-protein-coupled receptors. J. Med. Chem.
44(26), 4595–4614 (2001)
344. Gobel, U., Sander, C., Schneider, R., Valencia, A.: Correlated mutations and residue contacts
in proteins. Proteins 18(4), 309–317 (1994)
345. Gouldson, P.R., Dean, M.K., Snell, C.R., Bywater, R.P., Gkoutos, G., Reynolds, C.A.: Lipid-
facing correlated mutations and dimerization in G-protein coupled receptors. Protein Eng.
14(10), 759–767 (2001)
346. Filizola, M., Olmea, O., Weinstein, H.: Prediction of heterodimerization interfaces of G-
protein coupled receptors with a new subtractive correlated mutation method. Protein Eng.
15(11), 881–885 (2002)
347. Park, K., Kim, D.: Structure-based rebuilding of coevolutionary information reveals functional
modules in rhodopsin structure. Biochim. Biophys. Acta (2012). https://doi.org/10.1016/j.
bbapap.2012.05.015
348. Noivirt, O., Eisenstein, M., Horovitz, A.: Detection and reduction of evolutionary noise in
correlated mutation analysis. Protein Eng. Des. Sel. 18(5), 247–253 (2005). https://doi.org/
10.1093/protein/gzi029
349. Roux, B.: Implicit solvent models. In: Becker, O.M., MacKerell Jr, A.D., Roux, B. (eds.)
Computational Biochemistry and Biophysics. CRC Press (2001)
350. Jackson, J.D.: Classical Electrodynamics. New York (1975)
351. Landau, L.D., Lifshitz, E.M., Pitaevskii, L.P.: Electrodynamics of Continuous Media.
Butterworth-Heinenann, Boston (1982)
352. Still, W.C., Tempczyk, A., Hawley, R.C., Hendrickson, T.: Semianalytical treatment of sol-
vation for molecular mechanics and dynamics. J. Am. Chem. Soc. 112, 6127–6129 (1990)
353. Lee, B., Richards, F.M.: The interpretation of protein structures: estimation of static accesi-
bility. J. Mol. Biol. 55, 379–400 (1971)
354. Lee, M.S., Salsbury, F.R., Brooks, C.L.: Novel generalized Born methods. J. Chem. Phys.
116(24), 10606–10614 (2002). https://doi.org/10.1063/1.1480013
355. Gallicchio, E., Levy, R.M.: AGBNP: an analytic implicit solvent model suitable for molec-
ular dynamics simulations and high-resolution modeling. J. Comput. Chem. 25(4), 479–499
(2004). https://doi.org/10.1002/Jcc.10400
356. Lee, M.S., Feig, M., Salsbury, F.R., Brooks, C.L.: New analytic approximation to the standard
molecular volume definition and its application to generalized born calculations. J. Comput.
Chem. 24(11), 1348–1356 (2003). https://doi.org/10.1002/Jcc.10272
357. Lazaridis, T., Karplus, M.: Effective energy function for proteins in solution. Proteins 35(2),
133–152 (1999)
358. Spassov, V.Z., Yan, L., Szalma, S.: Introducing an implicit membrane in generalized
Born/solvent accessibility continuum solvent models. J. Phys. Chem. B 106(34), 8726–8738
(2002). https://doi.org/10.1021/Jp020674r
359. Tanizaki, S., Feig, M.: A generalized Born formalism for heterogeneous dielectric environ-
ments: Application to the implicit modeling of biological membranes. J. Chem. Phys. 122(12)
(2005). doi:Artn 124706. https://doi.org/10.1063/1.1865992
360. Lazaridis, T.: Effective energy function for proteins in lipid membranes. Proteins 52(2),
176–192 (2003)
446 D. Latek et al.

361. Lazaridis, T., Karplus, M.: Discrimination of the native from misfolded protein models with
an energy function including implicit solvation. J. Mol. Biol. 288(3), 477–487 (1999)
362. Felts, A.K., Gallicchio, E., Wallqvist, A., Levy, R.M.: Distinguishing native conformations
of proteins from decoys with an effective free energy estimator based on the OPLS all-atom
force field and the surface generalized born solvent model. Proteins 48(2), 404–422 (2002).
https://doi.org/10.1002/Prot.10171
363. Rohl, C.A., Strauss, C.E., Misura, K.M., Baker, D.: Protein structure prediction using Rosetta.
Methods Enzymol. 383, 66–93 (2004). https://doi.org/10.1016/S0076-6879(04)83004-0
364. Davis, I.W., Baker, D.: RosettaLigand docking with full ligand and receptor flexibility. J. Mol.
Biol. 385(2), 381–392 (2009). https://doi.org/10.1016/j.jmb.2008.11.010
365. Im, W., Feig, M., Brooks, C.L.: An implicit membrane generalized born theory for the study
of structure, stability, and interactions of membrane proteins. Biophys. J. 85(5), 2900–2918
(2003)
366. Im, W., Brooks, C.L.: Interfacial folding and membrane insertion of designed peptides studied
by molecular dynamics simulations. Proc. Natl. Acad. Sci. U.S.A. 102(19), 6771–6776 (2005).
https://doi.org/10.1073/pnas.0408135102
367. Ulmschneider, J.P., Ulmschneider, M.B.: Folding Simulations of the transmembrane helix of
virus protein U in an implicit membrane model. J. Chem. Theory Comput. 3(6), 2335–2346
(2007). https://doi.org/10.1021/Ct700103k
368. Mottamal, M., Lazaridis, T.: Voltage-dependent energetics of alamethicin monomers in the
membrane. Biophys. Chem. 122(1), 50–57 (2006). https://doi.org/10.1016/j.bpc.2006.02.005
369. Seeber, M., Fanelli, F., Paci, E., Caflisch, A.: Sequential unfolding of individual helices of
bacterioopsin observed in molecular dynamics simulations of extraction from the purple mem-
brane. Biophys. J. 91(9), 3276–3284 (2006). https://doi.org/10.1529/biophysj.106.088591
370. Park, P.S.H., Sapra, K.T., Jastrzebska, B., Maeda, T., Maeda, A., Pulawski, W., Kono, M.,
Lem, J., Crouch, R.K., Filipek, S., Muller, D.J., Palczewski, K.: Modulation of molecular
interactions and function by rhodopsin palmitylation. Biochemistry 48(20), 4294–4304 (2009)
371. Ewald, P.P.: Die Berchnung optischer und elektrostatischer Gitterpotentiale. Ann. Phys. 64,
253–287 (1921)
372. Zhan, H., Lazaridis, T.: Influence of the membrane dipole potential on peptide binding to lipid
bilayers. Biophys. Chem. 161, 1–7 (2012). https://doi.org/10.1016/j.bpc.2011.10.002
373. Zagrovic, B., Pande, V.: Solvent viscosity dependence of the folding rate of a small protein:
distributed computing study. J. Comput. Chem. 24(12), 1432–1436 (2003). https://doi.org/
10.1002/Jcc.10297
374. Lee, M.S., Olson, M.A.: Evaluation of poisson solvation models using a hybrid
explicit/implicit solvent method. J. Phys. Chem. B 109(11), 5223–5236 (2005). https://doi.
org/10.1021/Jp046377z
375. Kelly, C.P., Cramer, C.J., Truhlar, D.G.: Adding explicit solvent molecules to continuum
solvent calculations for the calculation of aqueous acid dissociation constants. J. Phys. Chem.
A 110(7), 2493–2499 (2006). https://doi.org/10.1021/J055336f
376. Stagg, S.M., Harvey, S.C.: Exploring the flexibility of ribosome recycling factor using molec-
ular dynamics. Biophys. J. 89(4), 2659–2666 (2005). https://doi.org/10.1529/biophysj.104.
052373
377. Bast, T., Hentschke, R.: Molecular dynamics simulation of a micellar system. J. Mol. Model.
2(9), 330–340 (1996)
378. Freddolino, P.L., Arkhipov, A.S., Larson, S.B., McPherson, A., Schulten, K.: Molecular
dynamics simulations of the complete satellite tobacco mosaic virus. Structure 14(3), 437–449
(2006). https://doi.org/10.1016/j.str.2005.11.014
379. Levitt, M.: A simplified representation of protein conformations for rapid simulation of protein
folding. J. Mol. Biol. 104(1), 59–107 (1976)
380. Levitt, M., Warshel, A.: Computer simulation of protein folding. Nature 253(5494), 694–698
(1975)
381. Levinthal, C.: Are there pathways for protein folding? J. Chim. Phys. 65, 44–45 (1968)
Modeling of Membrane Proteins 447

382. Taketomi, H., Ueda, Y., Go, N.: Studies on protein folding, unfolding and fluctuations by
computer simulation. I. The effect of specific amino acid sequence represented by specific
inter-unit interactions. Int. J. Pept. Protein Res. 7(6), 445–459 (1975)
383. Ueda, Y., Taketomi, H., Gō, N.: Studies on protein folding, unfolding, and fluctuations by
computer simulation. II. A. Three-dimensional lattice model of lysozyme. Biopolymers 17(6),
1531–1548 (1978)
384. Go, N., Taketomi, H.: Studies on protein folding, unfolding and fluctuations by computer
simulation. III. Effect of short-range interactions. Int. J. Pept. Protein Res. 13(3), 235–252
(1979)
385. Go, N., Taketomi, H.: Studies on protein folding, unfolding and fluctuations by computer
simulation. IV. Hydrophobic interactions. Int. J. Pept. Protein Res. 13(5), 447–461 (1979)
386. Gay, J.G., Berne, B.J.: Modification of the overlap potential to mimic a linear site-site potential.
J. Chem. Phys. 74(6), 3316–3319 (1981)
387. Berne, B.J., Pechukas, P.: Gaussian model potentials for molecular interactions. J. Chem.
Phys. 56(8), 4213–4216 (1972)
388. Smith, G.D., Paul, W.: United atom force field for molecular dynamics simulations of 1,4-
Polybutadiene based on quantum chemistry calculations on model molecules. J. Phys. Chem.
A 102(7), 1200–1208 (1998)
389. Kale, L., Skeel, R., Bhandarkar, M., Brunner, R., Gursoy, A., Krawetz, N., Phillips, J., Shi-
nozaki, A., Varadarajan, K., Schulten, K.: NAMD2: greater scalability for parallel molecular
dynamics. J. Comput. Phys. 151(1), 283–312 (1999)
390. Takada, S.: Coarse-grained molecular simulations of large biomolecules. Curr. Opin. Struct.
Biol. 22(2), 130–137 (2012)
391. Tozzini, V.: Coarse-grained models for proteins. Curr. Opin. Struct. Biol. 15(2), 144–150
(2005)
392. Rader, A.J.: Coarse-grained models: getting more with less. Curr. Opin. Pharmacol. 10(6),
753–759 (2010)
393. Lindahl, E., Sansom, M.S.: Membrane proteins: molecular dynamics simulations. Curr. Opin.
Struct. Biol. 18(4), 425–431 (2008)
394. Shrivastava, I.H., Bahar, I.: Common mechanism of pore opening shared by five different
potassium channels. Biophys. J. 90(11), 3929–3940 (2006)
395. Cieplak, M., Filipek, S., Janovjak, H., Krzysko, K.A.: Pulling single bacteriorhodopsin out of
a membrane: comparison of simulation and experiment. Biochem. Biophys. Acta. 1758(4),
537–544 (2006)
396. Orlandini, E., Seno, F., Banavar, J.R., Laio, A., Maritan, A.: Deciphering the folding kinetics
of transmembrane helical proteins. Proc. Natl. Acad. Sci. U.S.A. 97(26), 14229–14234 (2000)
397. Marrink, S.J., de Vries, A.H., Mark, A.E.: Coarse grained model for semiquantitative lipid
simulations. J. Phys. Chem. B 108(2), 750–760 (2004)
398. Monticelli, L., Kandasamy, S.K., Periole, X., Larson, R.G., Tieleman, D.P., Marrink, S.-J.:
The MARTINI coarse-grained force field: extension to proteins. J. Chem. Theory Comput.
4(5), 819–834 (2008). https://doi.org/10.1021/ct700324x
399. Yesylevskyy, S.O., Schafer, L.V., Sengupta, D., Marrink, S.J.: Polarizable water model for
the coarse-grained MARTINI force field. PLoS Comput. Biol. 6(6), e1000810 (2010)
400. Holdbrook, D.A., Leung, Y.M., Piggot, T.J., Marius, P., Williamson, P.T., Khalid, S.: Stability
and membrane orientation of the fukutin transmembrane domain: a combined multiscale
molecular dynamics and circular dichroism study. Biochemistry 49(51), 10796–10802 (2010)
401. Schafer, L.V., de Jong, D.H., Holt, A., Rzepiela, A.J., de Vries, A.H., Poolman, B., Killian,
J.A., Marrink, S.J.: Lipid packing drives the segregation of transmembrane helices into disor-
dered lipid domains in model membranes. Proc. Natl. Acad. Sci. U.S.A. 108(4), 1343–1348
(2010)
402. Periole, X., Huber, T., Marrink, S.J., Sakmar, T.P.: G protein-coupled receptors self-assemble
in dynamics simulations of model bilayers. J. Am. Chem. Soc. 129(33), 10126–10132 (2007)
403. Bond, P.J., Sansom, M.S.P.: Bilayer deformation by the Kv channel voltage sensor domain
revealed by self-assembly simulations. Proc Natl Acad Sci USA 104(8), 2631–2636 (2007).
https://doi.org/10.1073/pnas.0606822104
448 D. Latek et al.

404. Arnarez, C., Uusitalo, J.J., Masman, M.F., Ingolfsson, H.I., de Jong, D.H., Melo, M.N., Periole,
X., de Vries, A.H., Marrink, S.J.: Dry Martini, a coarse-grained force field for lipid membrane
simulations with implicit solvent. J. Chem. Theory Comput. 11(1), 260–275 (2015). https://
doi.org/10.1021/ct500477k
405. Shih, A.Y., Arkhipov, A., Freddolino, P.L., Schulten, K.: Coarse grained protein-lipid model
with application to lipoprotein particles. J. Phys. Chem. B 110(8), 3674–3684 (2006)
406. Spijker, P., van Hoof, B., Debertrand, M., Markvoort, A.J., Vaidehi, N., Hilbers, P.A.: Coarse
grained molecular dynamics simulations of transmembrane protein-lipid systems. Int. J. Mol.
Sci. 11(6), 2393–2420 (2010)
407. Markvoort, A.J., Pieterse, K., Steijaert, M.N., Spijker, P., Hilbers, P.A.: The bilayer-vesicle
transition is entropy driven. J. Phys. Chem. B 109(47), 22649–22654 (2005)
408. Kar, P., Gopal, S.M., Cheng, Y.M., Panahi, A., Feig, M.: Transferring the PRIMO coarse-
grained force field to the membrane environment: simulations of membrane proteins and
helix-helix association. J. Chem. Theory Comput. 10(8), 3459–3472 (2014). https://doi.org/
10.1021/ct500443v
409. Kar, P., Gopal, S.M., Cheng, Y.M., Predeus, A., Feig, M.: PRIMO: a transferable coarse-
grained force field for proteins. J. Chem. Theory Comput. 9(8), 3769–3788 (2013). https://
doi.org/10.1021/ct400230y
410. Kar, P., Feig, M.: Hybrid all-atom/coarse-grained simulations of proteins by direct coupling
of CHARMM and PRIMO force fields. J. Chem. Theory Comput. 13(11), 5753–5765 (2017).
https://doi.org/10.1021/acs.jctc.7b00840
411. Májek, P., Elber, R.: A coarse-grained potential for fold recognition and molecular dynamics
simulations of proteins. Proteins: Struct. Funct. Bioinf. 76(4), 822–836 (2009). https://doi.
org/10.1002/prot.22388
412. Terstegen, F., Buss, V.: All-trans- and 11-cis-retinal, their N-methyl Schiff base and N-methyl
protonated Schiff base derivatives: a comparative ab initio study. Theochem-J Mol Struc 369,
53–65 (1996)
413. Terstegen, F., Buss, V.: Geometries and interconversion pathways of free and protonated beta-
ionone Schiff bases. An ab initio study of photoreceptor chromophore model compounds.
Chem. Phys. 225(1–3), 163–171 (1997). https://doi.org/10.1016/s0301-0104(97)00194-8
414. Terstegen, F., Carter, E.A., Buss, V.: Interconversion pathways of the protonated beta-ionone
Schiff base: An ab initio molecular dynamics study. Int. J. Quantum Chem. 75(3), 141–145
(1999). https://doi.org/10.1002/(sici)1097-461x(1999)75:3%3c141::aid-qua4%3e3.3.co;2-0
415. Terstegen, F., Buss, V.: Influence of DFT-calculated electron correlation on energies and
geometries of retinals and of retinal derivatives related to the bacteriorhodopsin and rhodopsin
chromophores. Theochem-J. Mol. Struc. 430, 209–218 (1998)
416. Bifone, A., deGroot, H.J.M., Buda, F.: Ab initio molecular dynamics of retinals. Chem. Phys.
Lett. 248(3–4), 165–172 (1996). https://doi.org/10.1016/0009-2614(95)01312-1
417. Buda, F., deGroot, H.J.M., Bifone, A.: Charge localization and dynamics in rhodopsin. Phys.
Rev. Lett. 77(21), 4474–4477 (1996). https://doi.org/10.1103/PhysRevLett.77.4474
418. Bifone, A., deGroot, H.J.M., Buda, F.: Energy storage in the primary photoproduct of vision.
J. Phys. Chem. B 101(15), 2954–2958 (1997). https://doi.org/10.1021/jp9623397
419. La Penna, G., Buda, F., Bifone, A., de Groot, H.J.M.: The transition state in the isomeriza-
tion of rhodopsin. Chem. Phys. Lett. 294(6), 447–453 (1998). https://doi.org/10.1016/s0009-
2614(98)00870-7
420. Garavelli, M., Negri, F., Olivucci, M.: Initial excited-state relaxation of the isolated 11-cis
protonated schiff base of retinal: evidence for in-plane motion from ab initio quantum chemical
simulation of the resonance Raman spectrum. J. Am. Chem. Soc. 121(5), 1023–1029 (1999).
https://doi.org/10.1021/ja981719y
421. Gozem, S., Melaccio, F., Lindh, R., Krylov, A.I., Granovsky, A.A., Angeli, C., Olivucci,
M.: Mapping the excited state potential energy surface of a retinal chromophore model with
multireference and equation-of-motion coupled-cluster methods. J. Chem. Theory Comput.
9(10), 4495–4506 (2013). https://doi.org/10.1021/ct400460h
Modeling of Membrane Proteins 449

422. Sugihara, M., Buss, V., Entel, P., Elstner, M., Frauenheim, T.: 11-cis-retinal protonated Schiff
base: influence of the protein environment on the geometry of the rhodopsin chromophore.
Biochemistry 41(51), 15259–15266 (2002). https://doi.org/10.1021/bi020533f
423. Elstner, M., Porezag, D., Jungnickel, G., Elsner, J., Haugk, M., Frauenheim, T., Suhai, S.,
Seifert, G.: Self-consistent-charge density-functional tight-binding method for simulations
of complex materials properties. Phys. Rev. B 58(11), 7260–7268 (1998). https://doi.org/10.
1103/PhysRevB.58.7260
424. Hufen, J., Sugihara, M., Buss, V.: How the counterion affects ground- and excited-state prop-
erties of the rhodopsin chromophore. J. Phys. Chem. B 108(52), 20419–20426 (2004). https://
doi.org/10.1021/jp046147k
425. Tachikawa, H., Kawabata, H.: Effects of the residues on the excitation energies of protonated
Schiff base of retinal (PSBR) in bR: A TD-DFT study. J. Photochem. Photobiol. B-Biol.
79(3), 191–195 (2005). https://doi.org/10.1016/j.jphotobiol.2005.01.004
426. Sugihara, M., Buss, V., Entel, P., Hafner, J.: The nature of the complex counterion of the
chromophore in rhodopsin. J. Phys. Chem. B 108(11), 3673–3680 (2004). https://doi.org/10.
1021/jp0362786
427. Blomgren, F., Larsson, S.: Exploring the potential energy surface of retinal, a comparison of
the performance of different methods. J. Comput. Chem. 26(7), 738–742 (2005). https://doi.
org/10.1002/jcc.20210
428. Maseras, F., Morokuma, K.: IMOMM—a new integrated ab-initio plus molecular mechanics
geometry optimization scheme of equilibrium structures and transition-states. J. Comput.
Chem. 16(9), 1170–1179 (1995). https://doi.org/10.1002/jcc.540160911
429. Warshel, A., Levitt, M.: Theoretical studies of enzymic reactions—dielectric, electrostatic and
steric stabilization of carbonium-ion in reaction of lysozyme. J. Mol. Biol. 103(2), 227–249
(1976). https://doi.org/10.1016/0022-2836(76)90311-9
430. Gascon, J.A., Batista, V.S.: QM/MM study of energy storage and molecular rearrangements
due to the primary event in vision. Biophys. J. 87(5), 2931–2941 (2004)
431. Gascon, J.A., Sproviero, E.M., Batista, V.S.: QM/MM study of the NMR spectroscopy of the
retinyl chromophore in visual rhodopsin. J. Chem. Theory Comput. 1(4), 674–685 (2005).
https://doi.org/10.1021/ct0500850
432. Gascon, J.A., Sproviero, E.M., Batista, V.S.: Computational studies of the primary photo-
transduction event in visual rhodopsin. Acc. Chem. Res. 39(3), 184–193 (2006). https://doi.
org/10.1021/ar050027t
433. Illingworth, C.J.R., Gooding, S.R., Winn, P.J., Jones, G.A., Ferenczy, G.G., Reynolds, C.A.:
Classical polarization in hybrid QM/MM methods. J. Phys. Chem. A 110(20), 6487–6497
(2006). https://doi.org/10.1021/jp046944i
434. Altun, A., Yokoyama, S., Morokuma, K.: Spectral tuning in visual pigments: an ONIOM(QM:
MM) study on bovine rhodopsin and its mutants. J. Phys. Chem. B 112(22), 6814–6827 (2008).
https://doi.org/10.1021/jp709730b
435. Wiliam Hernandez-Rodriguez, E., Sanchez-Garcia, E., Crespo-Otero, R., Lilian Montero-
Alejo, A., Alberto Montero, L., Thiel, W.: Understanding rhodopsin mutations linked to the
retinitis pigmentosa disease: a QM/MM and DFT/MRCI Study. J. Phys. Chem. B 116(3),
1060–1076 (2012). https://doi.org/10.1021/jp2037334
436. Manathunga, M., Yang, X., Luk, H.L., Gozem, S., Frutos, L.M., Valentini, A., Ferre, N.,
Olivucci, M.: Probing the photodynamics of rhodopsins with reduced retinal chromophores.
J. Chem. Theory Comput. 12(2), 839–850 (2016). https://doi.org/10.1021/acs.jctc.5b00945
437. Gozem, S., Luk, H.L., Schapiro, I., Olivucci, M.: Theory and simulation of the ultrafast
double-bond isomerization of biological chromophores. Chem. Rev. 117(22), 13502–13565
(2017). https://doi.org/10.1021/acs.chemrev.7b00177
438. Stewart, J.J.P.: Application of localized molecular orbitals to the solution of semiempirical
self-consistent field equations. Int. J. Quantum Chem. 58(2), 133–146 (1996). https://doi.org/
10.1002/(sici)1097-461x(1996)58:2%3c133::aid-qua2%3e3.0.co;2-z
439. Daniels, A.D., Millam, J.M., Scuseria, G.E.: Semiempirical methods with conjugate gradient
density matrix search to replace diagonalization for molecular systems containing thousands
of atoms. J. Chem. Phys. 107(2), 425–431 (1997). https://doi.org/10.1063/1.474404
450 D. Latek et al.

440. Dixon, S.L., Merz, K.M.: Fast, accurate semiempirical molecular orbital calculations for
macromolecules. J. Chem. Phys. 107(3), 879–893 (1997). https://doi.org/10.1063/1.474386
441. Stewart, J.J.P.: Optimization of parameters for semiempirical methods V: modification of
NDDO approximations and application to 70 elements. J. Mol. Model. 13(12), 1173–1213
(2007). https://doi.org/10.1007/s00894-007-0233-4
442. Rezac, J., Fanfrlik, J., Salahub, D., Hobza, P.: Semiempirical quantum chemical PM6 method
augmented by dispersion and H-bonding correction terms reliably describes various types of
noncovalent complexes. J. Chem. Theory Comput. 5(7), 1749–1760 (2009). https://doi.org/
10.1021/ct9000922
443. Rezac, J., Hobza, P.: Advanced corrections of hydrogen bonding and dispersion for semiem-
pirical quantum mechanical methods. J. Chem. Theory Comput. 8(1), 141–151 (2012). https://
doi.org/10.1021/ct200751e
444. Ren, L., Martin, C.H., Wise, K.J., Gillespie, N.B., Luecke, H., Lanyi, J.K., Spudich, J.L.,
Birge, R.R.: Molecular mechanism of spectral tuning in sensory rhodopsin II. Biochemistry
40(46), 13906–13914 (2001). https://doi.org/10.1021/bi0116487
445. Lee, I., Greenbaum, E., Budy, S., Hillebrecht, J.R., Birge, R.R., Stuart, J.A.: Photoinduced
surface potential change of bacteriorhodopsin mutant D96N measured by scanning surface
potential microscopy. J. Phys. Chem. B 110(22), 10982–10990 (2006). https://doi.org/10.
1021/jp052948r
446. Stewart, J.J.P.: Application of the PM6 method to modeling proteins. J. Mol. Model. 15(7),
765–805 (2009). https://doi.org/10.1007/s00894-008-0420-y
447. Ohno, K., Kamiya, N., Asakawa, N., Inoue, Y., Sakurai, M.: Application of an integrated
MOZYME plus DFT method to pKa calculations for proteins. Chem. Phys. Lett. 341(3–4),
387–392 (2001). https://doi.org/10.1016/s0009-2614(01)00499-7
448. Yoda, M., Inoue, Y., Sakurai, M.: Effect of protein environment on pK(a) shifts in the active
site of photoactive yellow protein. J. Phys. Chem. B 107(51), 14569–14575 (2003). https://
doi.org/10.1021/jp0364102
449. Gross, K.C., Seybold, P.G., Hadad, C.M.: Comparison of different atomic charge schemes for
predicting pK(a) variations in substituted anilines and phenols. Int. J. Quantum Chem. 90(1),
445–458 (2002). https://doi.org/10.1002/qua.10108
450. Mulliken, R.S.: Electronic population analysis on LCAO-MO molecular wave functions.1. J.
Chem. Phys. 23(10), 1833–1840 (1955). https://doi.org/10.1063/1.1740588
451. Reed, A.E., Weinstock, R.B., Weinhold, F.: Natural-population analysis. J. Chem. Phys. 83(2),
735–746 (1985). https://doi.org/10.1063/1.449486
452. Wang, B., Ford, G.P.: Atomic charges derived from a fast and accurate method for electrostatic
potentials based on modified AM1 calculations. J. Comput. Chem. 15(2), 200–207 (1994).
https://doi.org/10.1002/jcc.540150210
453. Khan, H.M., Grauffel, C., Broer, R., MacKerell Jr., A.D., Havenith, R.W., Reuter, N.: Improv-
ing the force field description of tyrosine-choline cation-pi interactions: QM investigation of
Phenol-N(Me)4(+) interactions. J. Chem. Theory Comput. 12(11), 5585–5595 (2016). https://
doi.org/10.1021/acs.jctc.6b00654
454. Morris, G.M., Goodsell, D.S., Halliday, R.S., Huey, R., Hart, W.E., Belew, R.K., Olson, A.J.:
Automated docking using a Lamarckian genetic algorithm and an empirical binding free
energy function. J. Comput. Chem. 19(14), 1639–1662 (1998)
455. Bikadi, Z., Hazai, E.: Application of the PM6 semi-empirical method to modeling proteins
enhances docking accuracy of AutoDock. J. Cheminform. 1 (2009). https://doi.org/10.1186/
1758-2946-1-15
456. Fanfrlik, J., Bronowska, A.K., Rezac, J., Prenosil, O., Konvalinka, J., Hobza, P.: A reliable
Docking/scoring scheme based on the semiempirical quantum mechanical PM6-DH2 method
accurately covering dispersion and H-bonding: HIV-1 protease with 22 ligands. J. Phys. Chem.
B 114(39), 12666–12678 (2010). https://doi.org/10.1021/jp1032965
457. Sharma, V., Belevich, G., Gamiz-Hernandez, A.P., Rog, T., Vattulainen, I., Verkhovskaya,
M.L., Wikstrom, M., Hummer, G., Kaila, V.R.: Redox-induced activation of the proton pump
in the respiratory complex I. Proc Natl Acad Sci USA 112(37), 11571–11576 (2015). https://
doi.org/10.1073/pnas.1503761112
Modeling of Membrane Proteins 451

458. Maffeo, C., Bhattacharya, S., Yoo, J., Wells, D., Aksimentiev, A.: Modeling and simulation
of ion channels. Chem. Rev. 112(12), 6250–6284 (2012). https://doi.org/10.1021/cr3002609
459. Kutzner, C., Kopfer, D.A., Machtens, J.P., de Groot, B.L., Song, C., Zachariae, U.: Insights
into the function of ion channels by computational electrophysiology simulations. Biochim.
Biophys. Acta 1858(7 Pt B), 1741–1752 (2016). https://doi.org/10.1016/j.bbamem.2016.02.
006
460. Sadhu, B., Sundararajan, M., Bandyopadhyay, T.: Selectivity of a singly permeating ion in
nonselective NaK channel: combined QM and MD based investigations. J. Phys. Chem. B
119(40), 12783–12797 (2015). https://doi.org/10.1021/acs.jpcb.5b05996
Peptide Folding in Cellular
Environments: A Monte Carlo
and Markov Modeling Approach

Daniel Nilsson, Sandipan Mohanty and Anders Irbäck

Abstract Steric interactions with surrounding macromolecules tend to favor the


compact native state of a globular protein over its unfolded state. However, in exper-
iments conducted in cells and concentrated protein solutions, both stabilization and
destabilization of proteins have been observed, compared to dilute-solution condi-
tions. Therefore, in order to understand the effects of surrounding macromolecules
on protein properties such as stability, there is a need for computational modeling
beyond the level of hard-sphere crowders. Here, we discuss some recent exploratory
studies of peptide folding in the presence of explicit protein crowders, carried out
by us using an all-atom Monte Carlo-based approach along with an implicit solvent
force field. For interpreting the simulation data, time-lagged independent component
analysis and Markov state modeling are used.

1 Introduction

In the crowded interior of living cells, proteins are surrounded by high concentrations
of macromolecules. For instance, the cytosol of Escherichia coli bacteria has been
estimated to contain 300–400 g/L of proteins and RNA [1]. However, biophysical
studies of proteins are usually conducted in dilute solutions. A fundamental and
long-standing question, therefore, is how macromolecular crowding affects reactions

D. Nilsson · A. Irbäck (B)


Department of Astronomy and Theoretical Physics, Lund University,
Sölvegatan 14A, SE-223 62 Lund, Sweden
e-mail: anders@thep.lu.se
D. Nilsson
e-mail: daniel.nilsson@thep.lu.se
S. Mohanty
Institute for Advanced Simulation, Jülich Supercomputing Centre,
Forschungszentrum Jülich, D-52425 Jülich, Germany
e-mail: s.mohanty@fz-juelich.de

© Springer Nature Switzerland AG 2019 453


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_13
454 D. Nilsson et al.

such as protein folding, binding and aggregation. This question is currently being
intensely studied by both experimental [2, 3] and computational [4, 5] methods.
Most computational/theoretical studies so far focused on the universal excluded-
volume effect [6, 7], which is independent of the precise nature of the crowders. This
effect favors reactions that increase the available volume, such as the folding of a
globular protein to its compact native state, or the binding of proteins to each other.
Its implications have been extensively studied through simulations, typically with
hard spheres as crowders [8–19]. In particular, it was shown that volume exclusion
can lead to a significant stabilization of globular proteins, depending on the size and
density of the crowders [8, 11, 13]. Moreover, good agreement was found between
simulations with hard-sphere crowders and experiments with inert crowders [10].
While universal, the excluded-volume effect need not dominate the interaction of a
protein with surrounding macromolecules. In fact, both stabilization and destabiliza-
tion of globular proteins have been observed in experiments conducted in cells and
concentrated protein solutions [20, 21]. However, the precise nature of the non-steric
effects involved remains incompletely understood.
Recent years have seen increasing efforts to conduct protein simulations with
explicit crowder molecules [22, 23], rather than hard-sphere crowders. One approach
is to build crowding environments mimicking cellular conditions [24]. A recent exam-
ple is the detailed and extensive model of a bacterial cytoplasm (Mycoplasma geni-
talium) developed by Feig et al. [25], which includes proteins, RNAs, protein/RNA
complexes, metabolites, ions as well as explicit solvent molecules. Another approach
is to use simplified homogeneous crowding environments [26–28], as in experiments
conducted in concentrated protein solutions. In this case, the number of crowder
molecules can be smaller, so that larger timescales can be reached. A common choice
is to have around ten crowder molecules. Nevertheless, even with a moderate number
of crowder molecules, examining the conformational properties of the test protein in
question represents a challenge.
In this article, we summarize some recent Monte Carlo (MC) studies of peptide
folding in the presence of explicit protein crowders [29–32], performed by us using an
all-atom protein model along with an implicit solvent force field. The peptides studied
are the compact α-helical trp-cage [33] and the β-hairpin-forming GB1m3 [34]. Each
peptide is studied using two different crowding agents, namely bovine pancreatic
trypsin inhibitor (BPTI) and the B1 domain of streptococcal protein G (GB1). Both
these proteins are thermally highly stable [35, 36] and therefore modeled using a
fixed-backbone approximation, whereas the peptides are free to fold and unfold in
the simulations.
A challenge when analyzing data from crowding simulations is in identifying
the relevant states and dynamical modes, which may not be easily anticipated. Two
methods that can be used to tackle this problem are time-lagged independent compo-
nent analysis (TICA) [37–40] and Markov state modeling [41–45]. These methods
have in recent years found widespread use in studies of biomolecular processes such
as folding and binding [46, 47]. In this article, we briefly discuss the results obtained
Peptide Folding in Cellular Environments: A Monte Carlo … 455

when using these techniques to elucidate the interplay between peptide folding and
peptide-crowder interactions in our simulations of the β-hairpin-forming GB1m3
peptide [32].
This article is organized as follows. Section 2 briefly describes the systems stud-
ied and our computational methodology. Section 3 gives an overview of our main
findings. The article ends with a brief summary in Sect. 4.

2 Methods

This section describes the simulated systems and outlines the biophysical model,
sampling techniques and data analysis methods used.

2.1 Simulated Systems

Throughout this article, we consider systems consisting of one test molecule (trp-
cage or GB1m3) and eight crowder molecules (BPTI or GB1), confined to a cubic
box and subject to periodic boundary conditions. The crowder density is around 100
g/L. This value is somewhat lower than that for the E. coli cytosol mentioned earlier,
but sufficiently high for the presence of the crowders to have significant effects on
the test peptides in the simulations (see below). The volume fraction occupied by the
crowders is around 7%.
The trp-cage peptide is a designed mini-protein with 20 residues [33]. Its NMR-
derived native fold is compact and helical. The 16-residue GB1m3 peptide is an
optimized variant of the second β-hairpin (residues 41–56) in protein GB1, with
enhanced stability [34]. It differs from the original sequence at 7 of the 16 positions.
To our knowledge, no experimental structure is available for GB1m3, but its native
fold is expected to be similar to the parent β-hairpin in GB1.
Both proteins used as crowders, BPTI and GB1, are small but thermally highly
stable [35, 36], with 58 and 56 residues, respectively.

2.2 Biophysical Model

Our simulations use an all-atom protein representation with torsional degrees of


freedom, and an implicit solvent force field [48]. A detailed description of the force
field can be found elsewhere [48]. In brief, the interaction potential consists of four
main terms, E = E loc + E ev + E hb + E sc . One term (E loc ) represents local inter-
actions between atoms separated by only a few covalent bonds. The other, non-
local terms represent excluded-volume effects (E ev ), hydrogen bonding (E hb ), and
residue-specific interactions between pairs of side-chains, based on hydrophobicity
456 D. Nilsson et al.

and charge (E sc ). This potential is an effective energy function for protein folding
simulations, parameterized through folding thermodynamics studies for a structurally
diverse set of peptides and small proteins [48, 49]. In multi-chain simulations, inter-
molecular interaction terms are taken to have the same form and strength as the
corresponding intramolecular ones.
The model has been applied to study folding/unfolding properties of several pro-
teins with >90 residues [50–55]. Previous applications also include simulations of
peptide aggregation [56–60].
As indicated above, the thermally highly stable BPTI and GB1 proteins are mod-
eled with side-chain rotations as their only internal degrees of freedom; their back-
bones are held fixed in the simulations. The assumed backbone conformations are
model approximations of the crystal structures (PDB codes 4PTI and 2GB1), derived
by MC with minimization. The structures were selected for both low energy and high
similarity to the experimental structures. The root-mean-square deviations from the
experimental structures were 1 Å.

2.3 MC Simulations

The model described above is implemented into the open source MC simulation code
PROFASI [61]. All simulations discussed below were run with this program, using
both vector and thread parallelization.
The efficiency with which the conformational space is sampled in a MC simulation
depends critically on the move set used. Our simulations are based on the following
four elementary moves: (i) pivot-type rotation about individual backbone bonds, (ii) a
semi-local backbone update, Biased Gaussian Steps (BGS) [62], involving concerted
rotation of up to eight angles, (iii) rotation of individual side-chain angles, and (iv)
rigid-body translation or rotation of whole chains. The pivot move can generate
large-scale deformations of a chain, and can, despite its simplicity, be very useful
for unfolded chains in implicit solvent. The semi-local BGS move is an important
complement to the pivot update, especially for folded chains. There are also strictly
local torsion-angle updates available [63, 64], but the computationally convenient
BGS move works well for the peptides studied in this article.
A potentially valuable addition to the move set above would be to include rigid-
body motion of whole clusters of interacting molecules, based, for example, on the
stochastic cluster construction procedure in [65, 66].
The simulations discussed in this article are of two types. Our first set of simu-
lations focuses entirely on the equilibrium thermodynamics of the systems. These
simulations use the full move set described above (i–iv), and the replica exchange,
or parallel tempering, technique [67]. This method, and extensions of it [68–70], are
often used with the aim to enhance the sampling efficiency. Here, we used replica
exchange primarily as a convenient method to study a range of temperatures in a
single simulation.
Peptide Folding in Cellular Environments: A Monte Carlo … 457

Our second set of simulations is generated at a constant temperature, and uses a


restricted move set consisting of “small-step” elementary moves only. In particular,
this means that the global pivot update is omitted. This restriction ensures that the sys-
tem cannot jump between free-energy minima, without having to climb intervening
barriers. With this setup, the simulations should capture some basics of the long-
time dynamics. It is worth noting that the MC evolution of the simulated systems,
even with these restrictions, was sufficiently fast to permit us to generate trajectories
containing multiple folding/unfolding and binding/unbinding events. To interpret the
data from these simulations, we used TICA and built Markov state models (MSMs).

2.4 TICA and MSM Analysis

TICA and MSM methods are becoming increasingly popular tools for analyzing
biomolecular simulations, and several software packages are available for this kind
of analysis [71–74]. The calculations discussed in this article were done using the
pyEMMA software [71].
TICA can be used as a dimensionality reduction method. It is somewhat similar
to principal component analysis, but identifies high-autocorrelation (or slow) rather
than high-variance coordinates. Given time trajectories of a set of observables, {on },
one constructs the time-lagged covariance matrix cnm (τcm ) = on (t)om (t + τcm )t −
on (t)t om (t + τcm )t , where τcm is the lag time and ·t denotes an average over
time t. By solving the generalized eigenvalue problem C(τcm )v̂i = λ̂i C(0)v̂i , slow
linear combinations of the original observables can be identified.
To build an MSM, the state space needs to be discretized. In our calculations,
following [40], the discretization is achieved by clustering the data with the k-means
algorithm [75] in a low-dimensional subspace spanned by slow TICA coordinates. By
computing the probabilities of transition among these clusters in a time τtm (which,
like τcm , is an adjustable parameter), a transition matrix is obtained. Assuming Marko-
vian dynamics, the eigenvectors of this matrix have relaxation times given by

t˜i = −τtm / ln λ̃i (τtm ) (1)

where 1 = λ̃0 > λ̃1 ≥ λ̃2 ≥ · · · > 0 are the eigenvalues. The eigenvalue λ̃0 corre-
sponds to a stationary distribution (t˜0 = ∞), whereas all other eigenvalues corre-
spond to relaxation modes with finite timescales t˜i . The timescales obtained using
Eq. (1) are expected to reproduce the dominant relaxation times of the full system if
the discretization is sufficiently fine [76, 77], or if the lag time is sufficiently large
[77, 78]. However, for a given discretization and a given lag time, the use of Eq. (1)
may entail significant systematic errors.
Another way of estimating the relaxation times of the MSM eigenfunctions is
by computing their autocorrelations. The (normalized) autocorrelation function of a
general property f is given by C f (τ ) = [ f (t) f (t + τ )t −  f (t)t  f (t + τ )t ]/σ 2f ,
458 D. Nilsson et al.

where σ 2f is the variance of f . Let ψiMSM be the ith eigenfunction of a given MSM,
and let ψi be the true ith eigenfunction of the system’s time transfer operator [45].
The autocorrelation function of ψiMSM , Ci (τ ), may be expanded as

Ci (τ ) = c j e−τ/t j (2)
j

where c j = |ψ j , ψiMSM |2 and t j is the exact jth relaxation time. Now, if ψiMSM
is a good approximation of ψi , then c j  ci for j = i. If this holds, Ci (τ ) decays
approximately as e−τ/ti for not too large τ (compared to ti ), so that ti can be estimated
through a simple exponential fit. In the calculations discussed below, we used data
for Ci (τ ) in the range of τ where 0.2 < Ci (τ ) < 0.8. Over this range, Ci (τ ) was
approximately single exponential for all MSM eigenfunctions studied. It is worth
noting that the upper bound on τ is set primarily by statistical uncertainties, rather
than by deviations from single-exponential behavior.

3 Results

This section briefly describes the main findings of our studies of the trp-cage and
GB1m3 peptides in the presence of protein crowders (BPTI or GB1) [29–32]. The
first two subsections describe results obtained using the replica-exchange method.
The final third subsection discusses findings obtained by applying TICA and MSM
techniques to data from constant-temperature simulations.

3.1 The Two Peptides Respond Differently When Adding


Crowders

Using replica exchange with a wide range of temperatures, the folding thermodynam-
ics of the trp-cage and GB1m3 peptides were studied under the following conditions:
with BPTI crowders, with GB1 crowders, with hard-sphere crowders, and without
crowders. The three systems with crowders had the same number of crowders, eight,
and the same box size, (95 Å)3 . However, the volume of the hard spheres was taken
approximately three times larger than that of the BPTI and GB1 molecules, to enhance
the otherwise very weak effects of these crowders.
Figure 1 compares the behavior of trp-cage in the different simulated environ-
ments. To this end, the temperature dependence of four structural properties of trp-
cage are shown, namely the helix content, the radius of gyration, the root-mean-square
deviation from the native structure, and the end-to-end distance. The effects of the
purely steric crowders are, despite their larger size, modest. As expected, the effects
are largest at high temperatures, where the peptide is unfolded and requires the most
Peptide Folding in Cellular Environments: A Monte Carlo … 459

(a) (b)

(c) (d)

Fig. 1 Folding thermodynamics of trp-cage without crowders (red line), with hard-sphere crowders
(red dashes), with BPTI crowders (blue), and with GB1 crowders (magenta). The properties shown
are a the helix content, H , b the radius of gyration, Rg , c the root-mean-square deviation from the
native state, , and d the end-to-end distance, Ree . Reproduced from [30], with the permission of
AIP Publishing

volume. The smaller protein crowders cause only tiny changes at these temperatures.
At low temperatures, the BPTI and GB1 crowders tend to distort the native structure
of trp-cage. In the GB1 case, this effect is weak but noticeable, and in line with a
previous molecular dynamics-based study [27]. In the BPTI case, the distortion is
easily visible, especially from the data for the end-to-end distance (Fig. 1d). BPTI
interacts primarily with the C-terminal tail of trp-cage (see below), and this interac-
tion prevents a native-like packing of this part against the N-terminal α-helix, which
leads to an increased end-to-end distance.
Figure 2 shows a similar compilation of data from the GB1m3 simulations. When
adding hard-sphere crowders, the response of GB1m3 resembles that of trp-cage.
However, GB1m3 responds differently than trp-cage upon the addition of BPTI or
GB1 crowders. While distorting the trp-cage fold, these crowders have a stabilizing
effect on GB1m3 (Fig. 2c). A comparison with the results obtained using hard-sphere
crowders shows that this stabilization cannot be explained in terms of steric inter-
actions alone. Rather, the main cause is the ability of the folded GB1m3 to interact
favorably with both BPTI and GB1. The results obtained with BPTI crowders suggest
an increase in the melting temperature of GB1m3 by as much as roughly 15 K.
460 D. Nilsson et al.

(a) (b)

(c) (d)

Fig. 2 Folding thermodynamics of GB1m3 without crowders (red line), with hard-sphere crowders
(red dashes), with BPTI crowders (blue), and with GB1 crowders (magenta). The properties shown
are a the strand content, S, b the radius of gyration, Rg , c a hydrogen bond-based measure of
nativeness, q, and d the end-to-end distance, Ree . Reproduced from [30], with the permission of
AIP Publishing

3.2 Specific Surface Patches Dominate the Crowder


Interactions

The above comparison with data obtained using hard spheres strongly indicate that
attractive peptide-crowder interactions play an important role in the systems with
protein crowders. Insight into the nature of these attractive interactions can be gained
by computing test peptide-crowder protein residue-pair contact maps. Figure 3 shows
contact maps for all the four test peptide-crowder protein combinations studied,
calculated at the melting temperatures of the respective free peptides, where the
peptides sample a wide spectrum of conformations.
The contact maps reveal that both BPTI and GB1 have specific surface patches
that dominate their interaction with the peptides. A large majority of the contacts
formed by BPTI involve a hydrophobic surface patch centered around its proline
residues Pro8 and Pro9. On GB1, which contains a four-stranded β-sheet, a similar,
although somewhat less dominant, role is played by the two edge strands.
Peptide Folding in Cellular Environments: A Monte Carlo … 461

Fig. 3 Test peptide-crowder protein residue-pair contact maps for the simulated trp-cage–BPTI (left
upper panel), trp-cage–GB1 (left lower panel), GB1m3-BPTI (right upper panel) and GB1m3-GB1
(right lower panel) systems, calculated at the melting temperatures of the respective free peptides.
The color indicates the average number of contacts that a given residue in the test peptide forms
with residues in a given position in any of the eight crowder proteins. Note the differences in scale.
Two residues are in contact if their Cα atoms are within 8 Å from each other. Red lines indicate
the hydrophobic surface patch of BPTI mentioned in the text and the two edge strands of GB1.
Reproduced from [30], with the permission of AIP Publishing

The GB1m3-GB1 system is special, because GB1m3 is an optimized variant of


the second β-hairpin in GB1. The part of GB1 most prone to form contacts with
GB1m3 in our simulations is indeed an edge strand (residues 42–46) that belongs to
the second β-hairpin.

3.3 Slow Modes Can Be Identified by TICA and MSM


Techniques

The previous two subsections dealt separately with the folding properties of the pep-
tides and their interactions with the crowders. For a proper understanding of the
systems, one also has to analyze the interplay between peptide folding and peptide-
crowder interactions. To this end, one needs to identify suitable coordinates in a
high-dimensional space with both intra- and intermolecular degrees of freedom,
which are not easy to guess. A possible approach to this problem is to use TICA
and MSM techniques. These methods have proven useful for analyzing biomolec-
ular simulations [46, 47], but the systems studied were typically relatively small.
Recently, we tested the usefulness of these methods for analyzing data from crowd-
ing simulations, by applying them to data from constant-temperature simulations of
GB1m3 with BPTI and GB1 crowders [32].
This analysis used time trajectories for a broad set of observables, consisting of
all (non-constant) intramolecular Cα -Cα distances within the peptide as well as a
collection of intermolecular distances between the peptide and the crowders, called
di j . Specifically, di j was defined as the shortest Cα -Cα (periodic) distance between
462 D. Nilsson et al.

peptide residue i and residue j in any of the eight crowder molecules. The total
number of intra- and intermolecular distances used as input for the analysis was
around 1000 for each of the two systems studied. Using TICA, a handful of slow
linear combinations of these observables were identified in each system.
The slow TICA coordinates turned out to be capable of separating the major free-
energy minima of the peptide. Additionally, the slow TICA coordinates were used to
define a low-dimensional subspace in which the simulated conformations could be
efficiently clustered. After this discretization, MSMs were built and used to estimate
the dominant (longest) relaxation times. Relaxation times can be conveniently esti-
mated from the MSM eigenvalues via Eq. (1), which, however, assumes Markovian
dynamics. Unfortunately, the results obtained this way showed a strong dependence
on the lag time τtm . A more direct way of estimating relaxation times from the MSMs
is to measure and analyze the autocorrelations of the eigenfunctions. It turned out
that fits to autocorrelation data for the MSM eigenfunctions yield much more robust
relaxation time estimates, with essentially no τtm dependence. A detailed discussion
of these findings can be found in [32].

4 Concluding Remarks

Knowledge of how proteins are affected by macromolecular crowding is needed in


order to understand how proteins function under cellular conditions. Computational
modeling of these effects is a multifaceted challenge, to which there is no one-size-
fits-all solution. In this article, we have discussed results obtained with MC methods,
based on an all-atom protein model with an implicit solvent force field. With this
approach, it was possible for us to investigate the folding thermodynamics of peptides
in the presence of interacting protein crowders in a statistically controlled manner,
through simulations containing multiple folding/unfolding and binding/unbinding
events. This goal would have been computationally very costly to accomplish if
explicit solvent had been included in the simulations; as far as we know, no such
study has been reported.
Our results suggest that the two peptides studied respond differently when adding
the crowders; both crowders (GB1 and BPTI) cause a distortion of the trp-cage fold,
while having a stabilizing effect on the GB1m3 β-hairpin. In the simulations, the
interaction of the crowders with the peptides is dominated by distinct patches on the
respective crowder surfaces. Although universal, the excluded-volume effect plays
only a minor role, as shown by reference simulations with hard-sphere crowders.
Finally, our analysis shows that TICA and MSM techniques provide useful tools for
identifying relevant (slow) coordinates in these high-dimensional systems.
An obvious next step is to extend the scope of these simulations to larger test
molecules. Furthermore, it is of great interest to go beyond folding reactions and
investigate also binding and aggregation reactions under crowding conditions. A
difficult but important task is to validate the simulations against experiments. The
amount of relevant experimental data available for this purpose has so far been
limited, but is growing.
Peptide Folding in Cellular Environments: A Monte Carlo … 463

Acknowledgements The work discussed in this article was in part supported by the Swedish
Research Council (Grant no. 621-2014-4522) and the Swedish strategic research program eSSENCE.
The simulations were performed on resources provided by the Swedish National Infrastructure for
Computing (SNIC) at LUNARC, Lund University, Sweden, and Jülich Supercomputing Centre,
Forschungszentrum Jülich, Germany.

References

1. Zimmerman, S.B., Trach, S.O.: Estimation of macromolecule concentrations and excluded


volume effects for the cytoplasm of escherichia coli. J. Mol. Biol. 222, 599 (1991)
2. Theillet, F.X., Binolfi, A., Frembgen-Kesner, T., Hingorani, K., Sarkar, M., Kyne, C., Li, C.,
Crowley, P.B., Gierasch, L., Pielak, G.J., Elcock, A.H., Gershenson, A., Selenko, P.: Physico-
chemical properties of cells and their effects on intrinsically disordered proteins (IDPs). Chem.
Rev. 114, 6661 (2014)
3. Smith, A.E., Zhang, Z., Pielak, G.J., Li, C.: NMR studies of protein folding and binding in
cells and cell-like environments. Curr. Opin. Struct. Biol. 30, 7 (2015)
4. Zhou, H.X.: Influence of crowded cellular environments on protein folding, binding, and
oligomerization: biological consequences and potentials of atomistic modeling. FEBS Lett.
587, 1053 (2013)
5. Feig, M., Sugita, Y.: Reaching new levels of realism in modeling biological macromolecules
in cellular environments. J. Mol. Graph. Model. 45, 144 (2013)
6. Ellis, R.J.: Macromolecular crowding: obvious but underappreciated. Trends Biochem. Sci.
26, 597 (2001)
7. Zhou, H.X., Rivas, G., Minton, A.P.: Macromolecular crowding and confinement: biochemical,
biophysical, and potential physiological consequences. Annu. Rev. Biophys. 37, 375 (2008)
8. Cheung, M.S., Klimov, D., Thirumalai, D.: Molecular crowding enhances native state stability
and refolding rates of globular proteins. Proc. Natl. Acad. Sci. USA 102, 4753 (2005)
9. Minh, D.D.L., Chang, C.E., Trylska, J., Tozzini, V., McCammon, J.A.: The influence of macro-
molecular crowding on HIV-1 protease internal dynamics. J. Am. Chem. Soc. 128, 6006 (2006)
10. Stagg, L., Zhang, S.Q., Cheung, M.S., Wittung-Stafshede, P.: Molecular crowding enhances
native structure and stability of α/β protein flavodoxin. Proc. Natl. Acad. Sci. USA 104, 18976
(2007)
11. Qin, S., Zhou, H.X.: Atomistic modeling of macromolecular crowding predicts modest
increases in protein folding and binding stability. Biophys. J. 97, 12 (2009)
12. Jefferys, B.R., Kelley, L.A., Sternberg, M.J.E.: Protein folding requires crowd control in a
simulated cell. J. Mol. Biol. 397, 1329 (2010)
13. Tsao, D., Dokholyan, N.V.: Macromolecular crowding induces polypeptide compaction and
decreases folding cooperativity. Phys. Chem. Chem. Phys. 12, 3491 (2010)
14. Mittal, J., Best, R.B.: Dependence of protein folding stability and dynamics on the density and
composition of macromolecular crowders. Biophys. J. 98, 315 (2010)
15. Samiotakis, A., Cheung, M.S.: Folding dynamics of trp-cage in the presence of chemical
interference and macromolecular crowding. i. J. Chem. Phys. 135(17), 175101 (2011)
16. Qin, S., Zhou, H.X.: Effects of macromolecular crowding on the conformational ensembles of
disordered proteins. J. Phys. Chem. Lett. 4, 3429 (2013)
17. Kang, H., Pincus, P.A., Hyeon, C., Thirumalai, D.: Effects of macromolecular crowding on the
collapse of biopolymers. Phys. Rev. Lett. 114, 068303 (2015)
18. Latshaw II, D.C., Hall, C.K.: Effects of hydrophobic macromolecular crowders on amyloid β
(16–22) aggregation. Biophys. J. 109, 124 (2015)
19. Miller, C.M., Kim, Y.C., Mittal, J.: Protein composition determines the effect of crowding on
the properties of disordered proteins. Biophys. J. 111, 28 (2016)
464 D. Nilsson et al.

20. Miklos, A.C., Sarkar, M., Wang, Y., Pielak, G.J.: Protein crowding tunes protein stability. J.
Am. Chem. Soc. 133, 7116 (2011)
21. Guzman, I., Gelman, H., Tai, J., Gruebele, M.: The extracellular protein VlsE is destabilized
inside cells. J. Mol. Biol. 426, 11 (2014)
22. Feig, M., Yu, I., Wang, P.H., Nawrocki, G., Sugita, Y.: Crowding in cellular environments at
an atomistic level from computer simulations. J. Phys. Chem. B 121, 8009 (2017)
23. Qin, S., Zhou, H.X.: Protein folding, binding, and droplet formation in cell-like conditions.
Curr. Opin. Struct. Biol. 43, 28 (2017)
24. McGuffee, S.R., Elcock, A.H.: Diffusion, crowding & protein stability in a dynamic molecular
model of the bacterial cytoplasm. PLOS Comput. Biol. 6, e1000694 (2010)
25. Yu, I., Mori, T., Ando, T., Harada, R., Jung, J., Sugita, Y., Feig, M.: Biomolecular interactions
modulate macromolecular structure and dynamics in atomistic model of a bacterial cytoplasm.
eLife 5, 18457 (2016)
26. Feig, M., Sugita, Y.: Variable interactions between protein crowders and biomolecular solutes
are important in understanding cellular crowding. J. Phys. Chem. B 116, 599 (2012)
27. Predeus, A.V., Gul, S., Gopal, S.M., Feig, M.: Conformational sampling of peptides in the
presence of protein crowders from AA/CG-multiscale simulations. J. Phys. Chem. B 116,
8610 (2012)
28. Macdonald, B., McCarley, S., Noeen, S., van Giessen, A.E.: Protein–protein interactions affect
alpha helix stability in crowded environments. J. Phys. Chem. B 119, 2956 (2015)
29. Bille, A., Linse, B., Mohanty, S., Irbäck, A.: Equilibrium simulation of trp-cage in the presence
of protein crowders. J. Chem. Phys. 143, 175102 (2015)
30. Bille, A., Mohanty, S., Irbäck, A.: Peptide folding in the presence of interacting protein crow-
ders. J. Chem. Phys. 144, 175105 (2016)
31. Irbäck, A., Mohanty, S.: Protein folding/unfolding in the presence of interacting macromolec-
ular crowders. Eur. Phys. J. - Spec. Top. 226, 627 (2017)
32. Nilsson, D., Mohanty, S., Irbäck, A.: Markov modeling of peptide folding in the presence of
protein crowders. J. Chem. Phys. 148, 055101 (2018)
33. Neidigh, J.W., Fesinmeyer, R.M., Andersen, N.H.: Designing a 20-residue protein. Nat. Struct.
Biol. 9, 425 (2002)
34. Fesinmeyer, R.M., Hudson, F.M., Andersen, N.H.: Enhanced hairpin stability through loop
design: the case of the protein g b1 domain hairpin. J. Am. Chem. Soc. 126, 7238 (2004)
35. Moses, E., Hinz, H.J.: Basic pancreatic trypsin inhibitor has unusual thermodynamic stability
parameters. J. Mol. Biol. 170, 765 (1983)
36. Gronenborn, A.M., Filpula, D.R., Essig, N.Z., Achari, A., Whitlow, M., Wingfield, P.T., Clore,
G.M.: A novel, highly stable fold of the immunoglobulin binding domain of streptococcal
protein G. Science 253, 657 (1991)
37. Molgedey, L., Schuster, H.G.: Separation of a mixture of independent signals using time delayed
correlations. Phys. Rev. Lett. 72, 3634 (1994)
38. Naritomi, Y., Fuchigami, S.: Slow dynamics of a protein backbone in molecular dynamics
simulation revealed by time-structure based independent component analysis. J. Chem. Phys.
139, 215102 (2013)
39. Schwantes, C.R., Pande, V.S.: Improvements in Markov state model construction reveal many
non-native interactions in the folding of NTL9. J. Chem. Theor. Comput. 9, 2000 (2013)
40. Pérez-Hernández, G., Paul, F., Giorgino, T., De Fabritiis, G., Noé, F.: Identification of slow
molecular order parameters for Markov model construction. J. Chem. Phys. 139, 015102 (2013)
41. Schütte, C., Fischer, A., Huisinga, W., Deuflhard, P.: A direct approach to conformational
dynamics based on Hybrid Monte Carlo. J. Comput. Phys. 151, 146 (1999)
42. Chodera, J.D., Singhal, N., Pande, V.S., Dill, K.A., Swope, W.C.: Automatic discovery of
metastable states for the construction of markov models of macromolecular conformational
dynamics. J. Chem. Phys. 126, 155101 (2007)
43. Buchete, N.V., Hummer, G.: Coarse master equations for peptide folding dynamics. J. Phys.
Chem. B 112, 6057 (2008)
Peptide Folding in Cellular Environments: A Monte Carlo … 465

44. Bowman, G.R., Beauchamp, K.A., Boxer, G., Pande, V.S.: Progress and challenges in the
automated construction of Markov state models for full protein systems. J. Chem. Phys. 131,
124101 (2009)
45. Prinz, J.H., Wu, H., Sarich, M., Keller, B., Senne, M., Held, M., Chodera, J.D., Schütte, C.,
Noé, F.: Markov models of molecular kinetics: generation and validation. J. Chem. Phys. 134,
174105 (2011)
46. Chodera, J.D., Noé, F.: Markov state models of biomolecular conformational dynamics. Curr.
Opin. Struct. Biol. 25, 135 (2014)
47. Noé, F., Clementi, C.: Collective variables for the study of long-time kinetics from molecular
trajectories: theory and methods. Curr. Opin. Struct. Biol. 43, 141 (2017)
48. Irbäck, A., Mitternacht, S., Mohanty, S.: An effective all-atom potential for proteins. BMC
Biophys. 2, 2 (2009)
49. Irbäck, A., Mohanty, S.: Folding thermodynamics of peptides. Biophys. J. 88, 1560 (2005)
50. Mitternacht, S., Luccioli, S., Torcini, A., Imparato, A., Irbäck, A.: Changing the mechanical
unfolding pathway of FnIII10 by tuning the pulling strength. Biophys. J. 96, 429 (2009)
51. Jónsson, S.Æ., Mohanty, S., Irbäck, A.: Distinct phases of free α-synuclein – a Monte Carlo
study. Proteins 80, 2169 (2012)
52. Mohanty, S., Meinke, J.H., Zimmermann, O.: Folding of Top7 in unbiased all-atom Monte
Carlo simulations. Proteins 81, 1446 (2013)
53. Bille, A., Jónsson, S.Æ., Akke, M., Irbäck, A.: Local unfolding and aggregation mechanisms
of SOD1 – a Monte Carlo exploration. J. Phys. Chem. B 117, 9194 (2013)
54. Jónsson, S.Æ., Mitternacht, S., Irbäck, A.: Mechanical resistance in unstructured proteins.
Biophys. J. 104, 2725 (2013)
55. Petrlova, J., Bhattacherjee, A., Boomsma, W., Wallin, S., Lagerstedt, J.O., Irbäck, A.: Confor-
mational and aggregation properties of the 1–93 fragment of apolipoprotein A-I. Protein Sci.
23, 1559 (2014)
56. Favrin, G., Irbäck, A., Mohanty, S.: Oligomerization of amyloid Aβ16−22 peptides using hydro-
gen bonds and hydrophobicity forces. Biophys. J. 87, 3657 (2004)
57. Cheon, M., Chang, I., Mohanty, S., Luheshi, L.M., Dobson, C.M., Vendruscolo, M., Favrin,
G.: Structural reorganisation and potential toxicity of oligomeric species formed during the
assembly of amyloid fibrils. PLOS Comput. Biol. 3, e173 (2007)
58. Irbäck, A., Mitternacht, S.: Spontaneous β-barrel formation: an all-atom Monte Carlo study of
Aβ(16–22) oligomerization. Proteins 71, 207 (2008)
59. Li, D., Mohanty, S., Irbäck, A., Huo, S.: Formation and growth of oligomers: a Monte Carlo
study of an amyloid tau fragment. PLOS Comput. Biol. 4, e1000238 (2008)
60. Mitternacht, S., Staneva, I., Härd, T., Irbäck, A.: Monte Carlo study of the formation and
conformational properties of dimers of aβ42 variants. J. Mol. Biol. 410, 357 (2011)
61. Irbäck, A., Mohanty, S.: PROFASI: a Monte Carlo simulation package for protein folding and
aggregation. J. Comput. Chem. 27, 1548 (2006)
62. Favrin, G., Irbäck, A., Sjunnesson, F.: Monte Carlo update for chain molecules: biased Gaussian
steps in torsional space. J. Chem. Phys. 114, 8154 (2001)
63. Dodd, L.R., Boone, T.D., Theodorou, D.N.: A concerted rotation algorithm for atomistic Monte
Carlo simulation of polymer melts and glasses. Mol. Phys. 78, 961 (1993)
64. Zamuner, S., Rodriguez, A., Seno, F., Trovato, A.: An efficient algorithm to perform local
concerted movements of a chain molecule. PLOS One 10, e0118342 (2015)
65. Irbäck, A., Jónsson, S.Æ., Linnemann, N., Linse, B., Wallin, S.: Aggregate geometry in amyloid
fibril nucleation. Phys. Rev. Lett. 110, 058101 (2013)
66. Irbäck, A., Wessén, J.: Thermodynamics of amyloid formation and the role of intersheet inter-
actions. J. Chem. Phys. 143, 105104 (2015)
67. Swendsen, R.H., Wang, J.S.: Replica Monte Carlo simulation of spin glasses. Phys. Rev. Lett.
57, 2607 (1986)
68. Neuhaus, T., Hager, J.S.: Free-energy calculations with multiple Gaussian modified ensembles.
Phys. Rev. E 74, 036702 (2006)
466 D. Nilsson et al.

69. Kim, J., Straub, J.E.: Generalized simulated tempering for exploring strong phase transitions.
J. Chem. Phys. 133, 154101 (2010)
70. Lindahl, V., Lidmar, J., Hess, B.: Accelerated weight histogram method for exploring free
energy landscapes. J. Chem. Phys. 141, 044110 (2014)
71. Scherer, M.K., Trendelkamp-Schroer, B., Paul, F., Pérez-Hernández, G., Hoffmann, M., Plat-
tner, N., Wehmeyer, C., Prinz, J.H., Noé, F.: PyEMMA 2: a software package for estimation,
validation, and analysis of Markov models. J. Chem. Theor. Comput. 11, 5525 (2015)
72. Seeber, M., Felline, A., Raimondi, F., Muff, S., Friedman, R., Rao, F., Caflisch, A., Fanelli,
F.: Wordom: A user-friendly program for the analysis of molecular structures, trajectories, and
free energy surfaces. J. Comput. Chem. 32, 1183 (2010)
73. Biarnés, X., Pietrucci, F., Marinelli, F., Laio, A.: METAGUI. A VMD interface for analyzing
metadynamics and molecular dynamics simulations. Comput. Phys. Commun. 183, 203 (2012)
74. Harrigan, M.P., Sultan, M.M., Hernández, C.X., Husic, B.E., Eastman, P., Schwantes, C.R.,
Beauchamp, K.A., McGibbon, R.T., Pande, V.S.: MSMBuilder: statistical models for biomolec-
ular dynamics. Biophys. J. 112, 10 (2017)
75. Lloyd, S., Trans, I.E.E.E.: Least squares quantization in PCM. Inf. Theor. 28, 129 (1982)
76. Kube, S., Weber, M.: A coarse graining method for the identification of transition rates between
molecular conformations. J. Chem. Phys. 126, 024103 (2007)
77. Djurdjevac, N., Sarich, M., Schütte, C.: Estimating the eigenvalue error of Markov state models.
Multiscale Model. Simul. 10, 61 (2012)
78. Prinz, J.H., Chodera, J.D., Noé, F.: Spectral rate theory for two-state kinetics. Phys. Rev. X 4,
011020 (2014)
Molecular Dynamics Studies
on Amyloidogenic Proteins

Sylwia Rodziewicz-Motowidło, Emilia Sikorska and Justyna Iwaszkiewicz

Abstract Molecular dynamics simulations, coupled with experimental investiga-


tions could improve our understanding of the protein aggregation and fibrillization
process of amyloidogenic proteins. Computational tools are being applied to solve
the protein aggregation and fibrillization problem, providing insight into amyloid
structures and aggregation mechanisms. Experimental studies of the nature of pro-
tein aggregation are unfortunately limited by the structure of aggregates and their
insolubility in water. The difficulties have stimulated the development of new experi-
mental methods, and intensive efforts to match computational results with the results
of experimental investigations. The number of papers published on simulations of
amyloidogenic proteins has increased rapidly during the last decade. The simulation
systems covered a range from simple peptides (Alzheimer Aβ peptides or peptides
being fragments of amyloidogenic proteins), to large proteins (transthyretin, prion
protein, cystatin C, β2-microglobulin etc.). In studies of aggregation, very impor-
tant is the integration of experimental and computational methods. Computational
simulations constitute an “analytical tool” for obtaining and processing biological
information and to make useful explanations of the physicochemical principles of
amyloidogenesis, as well as to understand the role amino-acid sequences in amyloido-
genic proteins. Very efficient theoretical models for prediction of protein aggregation
propensities from primary structures have been proposed. At a minimal computa-
tional cost, some of these models can determine putative, aggregation-prone regions
(“hot-spots”) within a protein sequence. The in silico simulations increase our under-
standing of the protein aggregation process. In this chapter the molecular studies of
amyloidogenic proteins like prion protein, transthyretin and human cystatin C are
presented. The MD studies of these proteins show the first steps during amyloids
formation. In addition in this chapter the MD studies of protein fibrils are presented.
Based on MD simulations of fibril models it is possible to interpret some experi-

S. Rodziewicz-Motowidło (B) · E. Sikorska


Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308, Gdańsk, Poland
e-mail: s.rodziewicz-motowidlo@ug.edu.pl
J. Iwaszkiewicz
Swiss Institute of Bioinformatics, Molecular Modeling Group, Bâtiment Genopode, Quartier
Sorge, 1015 Lausanne, Switzerland

© Springer Nature Switzerland AG 2019 467


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_14
468 S. Rodziewicz-Motowidło et al.

mental results and suggest a mechanism of elongation for the fibril protofilament
formation.

1 Introduction

Amyloid is an insoluble protein (or peptide) aggregate of a characteristic fiber-like


shape. Abnormal accumulation of amyloids in tissues and organs can cause diseases,
which are called amyloidosis. The name “amyloid” was proposed by Virchow in the
mid-19th century [1]. At that time, it also referred to a starch-like material (it could
be stained with iodine and sulfuric acid, giving a light blue color) and was used to
describe abnormal extracellular structure observed in post-mortem examination of
liver. Later, Virchow found that amyloids absorb Congo red and turn the passing
light red. Amyloidoses constitute a serious social problem, since they affect a sig-
nificant percentage of human population. The amyloidosis is caused by abnormal
folding of proteins disturbing their special structure [2]. Incorrect protein folding
occurs as a result of mutations in amino acid sequence or because of the contact
with the infectious protein species. Apart from the inner amyloidogenic potential
of a misfolded protein, also other factors can synergistically contribute to amyloid
accumulation. The factors can include patient’s age or altered physiological condi-
tions (low pH, oxidative conditions, elevated temperature, suppressed proteolysis,
metal ions, increased concentration of homocystein, etc.) and shift the equilibrium
towards amyloidogenic state of incomplete folding. Different paths of a denatured
or partly unfolded protein are presented in Fig. 1. The protein turns from its native
form into an insoluble form leading to formation of fibrils (amyloid deposits) which
accumulate in various organs, such as brain, heart, kidneys or liver. The resulting
diseases are called also the conformational diseases, because of their pathogenesis.
The most frequent diseases associated with accumulation of amyloid deposits of the
proteins are: Alzheimer’s disease (amyloid β peptide-Aβ), Creutzfeldt–Jakob disease
(prion protein), Parkinson’s disease (α-synuclein), and Huntington’s disease (hunt-
ingtin protein). The amyloid diseases can be hereditary, sometimes characteristic to
certain races or ethnic groups or infectious.
It is believed that the fibrils or the intermediates leading to formation of the
fibrils are toxic. Amyloidoses are most often diagnosed in people in middle age and
the diagnosis is often difficult because of lack of specific symptoms and uncertain
clinical character. The only way to confirm amyloidosis is to reveal amyloid deposits
in tissues. In order to confirm the disease ELISA tests are performed, in which specific
antibodies bind to amyloid. In spite of the variety of protein precursors all amyloid
fibers seem to have a very similar structure and show relatively strong resistance
to proteolysis [3]. Understanding the mechanisms leading to formation of abnormal
protein deposits is a huge challenge for scientists, since those processes occur more
and more frequently, as the life span of people expands.
Protein synthesis takes place in ribosomes which are thickly scattered inside the
cell. Ribosomes consist of two subunits: small and large. They play the role of a
Molecular Dynamics Studies on Amyloidogenic Proteins 469

Fig. 1 Formation of proteins and various paths of denatured or partly unfolded proteins (according
to [6])

protein “factory” in the cell. Proteins are folded in the ribosome within seconds, thus
gaining their secondary structure. Tertiary structure is obtained within a few minutes
in cytosol or endoplasmic reticulum [4]. This process is assisted by additional enzy-
matic proteins (so called chaperones) and disulfide isomerases. Before the protein is
transported to its final destination it is subjected to a control that rejects misfolded
proteins. Properly folded proteins get into the Golgi apparatus, from where they are
directed into the cytosol. Misfolded proteins are intercepted by the proteosome and
“digested” by a group of proteolytic enzymes. If a misfolded protein is not intercepted
and “digested” on time, it gets into the cell, where it can be “repaired”. Unfortunately,
despite sophisticated quality control system, formation of aggregates and amyloid
deposits may occur.
Oligomerization of proteins may occur spontaneously. This process can have
physiological functions or can be an adverse phenomenon. Both pathological and
normal processes are based on the same mechanisms. It is generally accepted that in
the initial stages of formation of amyloid fibrils monomeric proteins show a partly
unfolded conformation caused by partial denaturation or misfolding. As a result,
470 S. Rodziewicz-Motowidło et al.

hydrophobic areas of the protein become exposed to the solvent, which promotes
aggregation. The first stage of amyloid fibrils formation, a common stage of all
processes of amyloid aggregation, is the formation of meta-stable oligomers. During
the phase preceding the amyloid formation, some amyloid proteins have been shown
to form round, non-fibrillar structures resembling little “tires”. “Tires” can form
channels (pores) in the cell membrane. It is now considered that, instead of amyloid
fibrils—they are the toxic and pathogenic factor. The round intermediate structures
have been observed during formation of amyoid fibrils of: amyloid β peptide (Aβ)
[5], transthyretin [6], insulin [7], β2-microglobulin [8], immunoglobulin light chains
[9], lisozyme [10] and cytostatin C [11]. It is assumed that aggregates share common
structural features, since oligomers formed through aggregation of different proteins
bind specific anti-oligomer antibodies [12]. On the other hand, however, they show
different characteristics, because in some cases the oligomers maintain the original,
native structure of monomers to some extent [13]. There are premises that oligomer
forms are highly cytotoxic, and it was claimed that they are the main pathogenic factor
in many amyloidoses. The second stage of amyloid fibrils formation is aggregation of
the above mentioned oligomers [14], which leads to formation of amyloid fibrils or
amorphic deposits, so called inclusion bodies detected for example in the Parkinson’s
disease.

2 Computational Simulations as a Tool


for Characterization of Amyloidogenic Proteins

Experimental studies of the protein aggregation are unfortunately limited by the non-
crystallizable structure of aggregates, their insolubility in water and often by their
involvement in the cell membrane. These difficulties have stimulated the usage of
computational methods in the studies of amyloid structure as well as development
of new experimental methods as well as intensive efforts to match computational
results with the results of experimental investigations. The number of papers pub-
lished on simulations of amyloidogenic proteins has increased rapidly during the last
decade. The simulation systems covered a range from simple peptides (Alzheimer Aβ
peptides or peptides being fragments of amyloidogenic proteins) [15–18], to large
proteins (transthyretin, prion protein, cystatin C, β2-microglobulin etc.) [19–25].
In studies of aggregation, very important is the complementarity and comparison
of results of experimental and computational studies. Computational simulations
constitute an “analytical tool” to explain the mechanisms of amyloidogenesis, and
to understand the role amino-acid sequences in amyloidogenic proteins. The the-
oretical methods for prediction of protein aggregation propensities from primary
sequence have been proposed [26–28]. The computational methods can predict puta-
tive aggregation-prone regions (“hot-spots”) within a protein sequence, determina-
tion of which is very expensive and time-consuming experimentally. In general the
in silico simulations increase our understanding of the protein aggregation process.
Molecular Dynamics Studies on Amyloidogenic Proteins 471

In the last years, many theoretical methods to model fibril formation have been
applied, but most of the simulation studies aimed at understanding the molecular
mechanism of protein aggregation. Currently, MD simulations are the major com-
putational tool used to help define the structure of many molecular systems, amyloid
proteins, as well as fibrils. MD is now an important tool for understanding confor-
mational and aggregation phenomena at the molecular level [29].
Different algorithms and parameters have been used, depending on the problems
to be solved. Explicit treatment of solvation [30, 31] or Generalized Born solvation
model [32] are used. Traditional all-atom models with explicit water or other solvent
simulations are used most often to test the stability of β-structures in amyloid fibrils
and oligomers [17, 33]. Simulations can be also used to study the conformational
changes of native or intermediate states triggering the amyloid fibril formation [34,
35]. Because in traditional MD simulations the protein molecule can get trapped
in a local minimum, the enhanced sampling techniques like replica exchange MD
(REMD) simulations are often used to overcome this problem. Replica exchange
MD simulation is particularly useful for simulating the large conformational changes
related to modeling of misfolded protein associations [36, 37]. The conformational
changes of proteins and aggregation processes were also simulated with discrete
molecular dynamics (DMD) [15, 30], or the ‘activation-relaxation technique’ [31].
The coarse-grained models in a number of resolutions were applied for modeling pro-
tein folding and aggregation, because in the case of these methods one can use longer
simulation time at lower computational cost compared with all-atom simulations [38,
39]. One case of usage of united-atom MD simulations are the investigations of Aβ
peptide folding [40]. The coarse-grained models (united-atom, united-residue, or
other) were also applied to study protein aggregation [39, 41]. Another method used
simplified models in which the polypeptide chain is represented by a tube and the
interactions between amino acids are determined by geometry and symmetry [42].
Very interesting results were obtained by combining the REMD technique with a
united-residue model to study the Aβ peptide aggregation process [43]. Using this
method Smith and Hall proposed the description of the mechanism of fibrillization
growth at the molecular level (see Fig. 2).

3 Understanding the Instability of Amyloidogenic Proteins

Conformational instability and fluctuations of monomeric forms of amyloidogenic


proteins may have direct connection with the propensity to aggregation [44]. The
conformational change from the native states of proteins is the first and necessary
step toward amyloid formation. Both α-helical and β-sheet proteins can form amyloid
fibrils, but proteins that already have extensive β-sheets in their native state require
only slight conformational change to form amyloid. The situation is more complex in
the case of proteins for which extensive secondary structure or global conformational
changes take place upon amyloid formation. In the case of several proteins, larger
conformational deviations from the native state have been postulated. One of the
472 S. Rodziewicz-Motowidło et al.

Fig. 2 Selected snapshots along a representative trajectory of a monomer binding to a four-chain


fibril. The monomer is initially placed in an extended conformation and positioned 20 Å away from
the end of the fibril [43]

proposed mechanisms leading to amyloid formation in larger proteins is the domain


swapping [45, 46]. The deviations from the native structure were also related to the
pre-formation of “α-pleated” sheet structure formation [47]. It was postulated that
the stability of the intermediate state may control rate of amyloid formation, the
“optimal” intermediate to form amyloid being the medium stable one [48].
For proteins with stable native structures, the folding characteristics as well the
intermediate state affect their rate of aggregation [41]. It was possible to obtain the
denatured or intermediate state with simulation protocols by using high temperature
or low/high pH, in order to provide unfolding conditions [35, 49]. The unfolding sim-
ulations explore the available conformational space at several points of the unfolding
process, thus supplying atomic-scale details to the experimental studies of protein
unfolding [50]. MD simulations of amyloidogenic proteins in various unfolding con-
ditions (high temperature, low pH) or the amyloidogenesis prone variants were done.
The examples of proteins for which MD studies were performed to understand the
connection between the misfolded structure and amyloid formation are: prion protein
and its mutants [51, 52], transthyretin [20, 53, 54], β2-microglobulin [19], cystatin C
[25], WW domain, which is a protein domain with two highly conserved tryptophan
residues that bind proline-rich peptide motifs [55], and the light immunoglobulin
chain and its mutants [56]. Simulations at slightly elevated temperature are appro-
priate to enhance sampling in the vicinity of the native-like state. The unfolding
simulations at high temperature show that proteins unfold so quickly that, in some
Molecular Dynamics Studies on Amyloidogenic Proteins 473

cases, it is not possible to observe the temperature-dependent amyloidogenic poten-


tial. Therefore, in some cases, the simulations at 300–350 K are performed which not
only serve as controls, but also provide important information about the flexibility
under physiological conditions. The MD results for a few amyloidogenic proteins
and their mutants/variants under different conditions are described below.

3.1 Prion Protein

Prion diseases are neurodegenerative disorders characterized by deposits of mis-


folded prion protein (PrPSc ) in various regions of brain depending on the dis-
ease. Prion diseases, collectively called transmissible spongiform encephalopathies
(TSEs) can be infectious, genetic, or sporadic and are untreatable and fatal [57,
58]. The normal form of prion protein (PrPC ) is a secreted cell surface glycoprotein
which mature form is made up of residues 23–231. PrPC is attached to the cell mem-
brane via glycosylphosphatidylinositol anchor at its C terminus [59]. It has a single
disulfide bridge and two glycosylation sites. The amino acid sequences of the normal
form—PrPC , and the toxic form—PrPSc , are identical but the two forms differ in con-
formation [59, 60]. PrPC has an unstructured N-terminal tail (residues 23–120) and a
structured C-terminal domain (residues 121–231), containing a small, two-stranded
β-sheet and three α-helices [61]. Though the exact structure of PrPSc is not known,
studies indicate that it contains significant amount of β-sheet structures, [62–66] in
the region 90–145 of the human sequence. MD simulations have suggested that the
C-terminal α-helical region may have tendency to β-sheet formation [67, 68], which
was supported by site-directed spin-labeling studies of in vitro grown fibrils [69].
The conversion of prion protein from its cellular to its toxic, scrapie form is the
key event at the onset of prion diseases [70]. However, the molecular basis of this
conversion is currently unknown.
As for other amyloidogenic proteins also for prion protein the MD simulations
were employed to understand the transformation between the native and the mis-
folded structure. Daggett and coworkers suggested that the low pH-induced conver-
sion of hamster prion protein to scrapie form might start with the N-terminal β-sheet
extension as indicated by the 10-ns MD simulation in low pH and proposed a crucial
role for Asp178 residue [52]. However Gsponer et al. have not found the substan-
tial differences in the flexibility of the wt murine prion protein as compared with
its aggregation prone Asp178Asn mutant during the nano-scale simulations [51].
Nevertheless the simulations of wild-type human prion protein and the Asp178Asn
mutant at 500 K confirmed the lower stability of the variant [71]. Parchment et al.
performed MD simulations on mouse and Syrian hamster PrPC probing their sta-
bility and discussing the implications of the differences for the activity [72]. Zuegg
et al. studied human prion protein emphasizing the importance of proper treatment
of electrostatic interactions in MD simulations [73]. Liu and coworkers compared
the influence of low pH and temperature on human prion protein using explicit water
simulations and indicated different unfolding paths triggered by these two factors
474 S. Rodziewicz-Motowidło et al.

as well as emphasized greater impact of low pH on prion stability [49]. In line with
that the spectroscopic data shows the strong pH dependence of PrP stability and the
conformation [74–78]. An equilibrium unfolding intermediate of PrP125–228 that
shows similar spectral characteristics as β-sheet proteins has been observed exclu-
sively at acidic pH [74]. Both the acidic, and the high-temperature environment can
lead to a partial unfolding of the PrP protein. MD simulations point to the high
flexibility of the loop 167–171 and the loop between helix 2 and helix 3. The high
flexibility of these two loops may cause the characteristic instability of PrP protein
[51, 52, 71] confirmed also by NMR studies [61].
The MD studies suggested also the subtle stability of the PrP native structure and
the great impact of the disturbed electrostatic interactions on the wt conformation.
The main observed changes in conformation were the extension of the already present
β-sheet and different position and structure of helix H1 and the adjusted S1-H1 loop.
Mutation of some amino acids in prion protein can influence its conformational
transition from PrPC to PrPSc [79–82]. Human familial prion diseases are associated
with about 40 point mutations of the gene coding prion protein (PrP), with most
of them located in the globular domain of the protein [83]. Many simulations were
performed on prion protein variants involved in prion diseases, e.g. D202N, E211Q,
Q217R [79], D178N [71], protonated Asp202, and Glu196 [79]. As most of the
destabilizing mutations are connected with polar residues the special attention should
be paid to proper treatment of the electrostatic interactions. Zuegg and Gready [73]
and El-Bastawissy et al. [71] reported that the stabilization of the native structure
of PrPC could only be achieved by treating the long-range electrostatic interactions
with PME method and by neutralizing the system with counter ions.
The all-atom MD simulations of D202N, E211Q, and Q217R variants in the
third native α-helix of human PrP (see Fig. 3), show that the globular domain was
stable during the simulations of wt PrP protein and its variants with only minor
changes in the secondary structure, although increase in the solvent accessible area
was also reported. The results indicate that substitutions have subtle effects on pro-
tein structures, but influence substantially the electrostatic potential distribution.
These changes may affect intermolecular interactions and facilitate the aggrega-
tion process [79]. MD studies of D178N PrP variant by Gsponer and coworkers
showed only a slight increase in β-sheet content and no other significant structural
changes [51]. The authors suggested that the Arg164–Asp178 salt bridge did not
seem to contribute to the overall stability of mPrPC . Contrarily, the all-atom sim-
ulations of human and Syrian hamster PrPC indicated the importance of three salt
bridges (Glu146/Asp144–Arg208, Arg164–Asp178, Arg156–Glu196) for the stabil-
ity of PrPC [72]. Gu et al. investigated the roles of Glu196 and Asp202 in salt bridge
formation with MD simulations by studying the effect of their protonation [49]. In
these simulations some conformational changes like the helix 2 partial unfolding,
bending of helix 3 or elongation of the overall structure without bending of helix 3
could be observed. The results indicated that the elimination of even a single charge
at certain positions may significantly disturb the native conformation [49].
Molecular Dynamics Studies on Amyloidogenic Proteins 475

Fig. 3 NMR structure of the globular domain of wt human PrP (PDB ID: 1HJN)—residues
125–228. Secondary structure elements in the C-terminal, globular domain are labeled, and the
mutated residues analyzed in another study [79] are shown as sticks. α-helices (H1, H2, H3) and a
very short anti-parallel β-sheet (S1, S2)

3.2 Transthyretin

Transthyretin (TTR) is a plasma protein responsible for transportation of thyroid


hormone. It also binds to retinol-binding protein that in turn associates with retinol.
Structures of the wild-type TTR and some of its amyloidogenic, single-site mutants
have been determined by high-resolution x-ray crystallography [84, 85]. The native
state of TTR is a homotetramer with eight β-strands (named from A to H) organized
into two β-sheets (inner, DAGH and outer, CBEF). Together they form a β-sandwich.
Two monomeric units form a dimer through extensive formation of hydrogen bonds
between the two adjacent H and F strands of each monomer (Fig. 4).
Two dimers are associated along a twofold axis observed on a crystallographic
image, thus forming a tetramer. A central channel wrapped around by two DAGH
sheets runs through the center of the tetramer and holds thyroxin molecules. A path-
way of TTR amyloid formation has been proposed: TTR tetramer first dissociates
into native monomers, which was shown to be a rate-limiting step in the forma-
tion of fibrils [86–88]. Afterwards, the monomeric species partially unfold to form
aggregation intermediates. Once such intermediates are formed, the self-assembly
process that follows is a straightforward polymerization [89]. Tetramer dissociation
into monomers is necessary, but not sufficient, to initiate fibril formation, because
native monomers are non-amyloidogenic, unless they are partially denatured [90].
The conformational changes within the monomers are required for aggregation. The
476 S. Rodziewicz-Motowidło et al.

Fig. 4 Three-dimensional
structure of wt-TTR in the
tetrameric form. The eight
β-strands are named from A
to H. The inner sheet
(DAGH) is shown at the
front, whereas the outer sheet
(CBEF) is at the back [151]

changes can be facilitated either by partial denaturation (low pH or high temperature)


or by a point mutation.
The structural details of the amyloidogenic TTR protein pathway remain
unknown. Since the experimental studies of this phenomenon is highly difficult the
molecular dynamics simulations are used for examining the intrinsic, conformational
properties of the TTR to provide the clues for amyloidogenic transitions. The results
of various experiments suggest that, at pH 3.6–5.2, the amyloidogenic intermediate
of transthyretin is partially unfolded with partially disrupted, though native-like ter-
tiary structure. Therefore Daggett and coworkers studied the monomeric TTR with
MD simulations at neutral (pH 6–7), medium (approximately pH 4–6), and low pH
(approximately 2–4.2) as well as at elevated temperatures [91]. The results obtained
in low and medium pH, that span the amyloidogenic pH range determined experi-
mentally, showed the destabilization of CBEF-sheet. The interactions of main chain
amide groups with the solvent observed during the simulations were consistent with
experimental studies (proteolysis and hydrogen exchange data collected at pH 4.5)
[92]. Based on their observations authors suggest the TTR aggregation involves a
Molecular Dynamics Studies on Amyloidogenic Proteins 477

transition from β-sheet to α-sheet secondary structure, particularly on DAGH-sheet


structure.
More than 80 disease-related mutations in the TTR protein have been identified
so far [93]. For example, L55P, V30M, and Y116S variants form amyloid protofib-
rils after two months of incubation at physiological conditions (pH 7.5 and 37 °C)
whereas the TTR wild-type protein is stable and non-amyloidogenic [94]. The com-
parison of all TTR crystal structures available in the protein data bank, including amy-
loidogenic variants, has led to the conclusion that the structural differences between
wild-type protein and its variants are insignificant [95]. Thus the additional informa-
tion about the dynamical behaviour and stability is needed to highlight the molecular
basis of their amyloidogenic potential. The MD simulations carried out for L55P and
V30M variants in implicit solvent model [53] showed that the D strand is intrinsically
unstable, which is consistent with the recent x-ray data, that the D strand is trapped
in two discrete conformations [96, 97]. L55P mutation in the D strand results in
large global conformational changes in the inner sheets. Under partially denaturing
conditions, the L55P variant is more flexible than the wild-type and V30M variant.
The D strand of wt-TTR can exist in two conformations, the native conformation and
the amyloidogenic fold that resembles the surface loop of residues 54–55 of L55P
variant. The authors provide the detailed description of the plausible changes in the
rest of the TTR structure leading to amyloidogenic transition state [53].
Other MD studies performed for an important, amyloidogenic Y116S variant
of TTR indicate that this mutation leads to disruption of secondary structure and
the hydrogen bonds of the inner DAGH-sheet of the protein. The thyroxin binding
residues conformation is also affected and the overall instability of the Y116S leads
to amyloidogenesis [98].
The effects of solvent behavior around TTR point mutations using MD and density
and spatial distribution entropy maps of the solvent were studied [99]. The authors
found that the water resides long around stability-bearing mutations, whereas the
water around amyloidogenic mutations is exchanged fast with the bulk water. The
behavior of the solvent around these regions is probably crucial for folding and
aggregation processes [99].

3.3 Human Cystatin C

Human cystatin C (HCC) is a small cysteine proteinase inhibitor (120 amino acids)
present in all human body fluids at physiologically relevant concentrations [100].
The physiological role of HCC is to regulate the activity of endogenous cysteine
proteases [101]. HCC monomer structure consists of a core composed of a five-
stranded antiparallel β-sheet wrapped around a central α-helix. Two hairpin loops
(L1 and L2), together with the N-terminal fragment are involved in interactions with
target proteolytic enzymes [102]. In pathological processes, HCC and its mutant
(L68Q) form part of the amyloid deposits in the brain arteries of young adults, which
leads to brain hemorrhages and finally to death of patients with Hereditary Cystatin C
478 S. Rodziewicz-Motowidło et al.

Fig. 5 Superposition of the αC atoms of the final wt (green) and L68Q (blue) cystatin C structures.
The small figure in the right corner shows the placement of Leu68 and Gln68 in a hydrophobic
pocket formed by the β-sheet and α-helix residues [25]

Amyloid Angiopathy [103–105]. Presumably the aggregation begins with formation


of a stable dimer through a mechanism of three-dimensional (3D) domain swapping
[45]. The L68Q variant of HCC forms dimers in human body fluids more easily than
the wild type [106]. Whereas L68Q cystatin C variant undergoes dimerization at the
temperature of human body, the wt cystatin dimerization is promoted by elevated
temperature, low pH or low levels of a denaturant [107–109]. Up to date the only
X-ray structure of HCC is a structure of domain-swapped symmetric dimer [45],
while the experimental structures of the monomer and dimer of L68Q variant remain
elusive. Each of the two domain-swapped HCC dimer subunits is composed of an
α-helix and a β-strand coming from one molecule and a β-sheet coming from another
one. The linker region between two subunits (βL) is formed by a new β-sheet (Ile56-
Gly59) structure, which corresponds to L1 loop in the monomeric HCC (Fig. 5).

MD simulations of the native cystatins and its variants were used as a tool to
analyze the influence of a single-point mutation on the secondary and tertiary con-
formation [25, 110–112]. The MD results at the temperature of 300 K [111] or 308 K
[25] indicate that L68Q cystatin C monomer undergoes substantially bigger struc-
tural changes during the simulation than the wt cystatin C monomer. However, the
global structure remains native-like in both proteins, although some hydrogen bonds
between β4 and β5 strands were broken. As a result, β5 strand was destroyed in the
wt and L68Q molecules at the end of the simulations. Contradictory to the experi-
mental data [113], no significant changes in the α-helix structures of the investigated
Molecular Dynamics Studies on Amyloidogenic Proteins 479

proteins were observed during the MD simulations. According to simulations the


fragments with highest flexibility were: N- and C-termini, AS structure, L1 and L2 -
loops. L1- and L2-loops are more unhindered and were more flexible during the MD
simulations in the wt cystatin C, than in L68Q variant. Although the investigated
proteins adopted a very similar 3D structure shape, the dynamic properties of β1-α-
β2 fragment suggest that the β1-α-β2 fragment of L68Q cystatin C variant is more
dynamic, than the same fragment in the wt protein. This increased flexibility of the
exchangeable fragment could explain the higher tendency for dimer formation shown
by L68Q variant, in comparison with wt cystatin C. The studies also show that three
salt bridges: Gul20/Gul21-Lys54 (between helix and β2), Asp40-Arg70 (between
β2 and β3), and His43-Asp81 (between helix and AS), which act like “molecular
pins”, play essential roles in the stability of the monomeric HCC structure. In both
proteins, the number of salt bridges and hydrogen bonds connecting the swapped
domain with the rest of the structure is small. In L68Q monomer the salt bridges and
hydrogen bonds are weaker and have lower occupancy, than in wt cystatin C [25].
This confirms that the general feature of domain-swapping proteins having few (if
any) salt bridges and hydrogen bonds connecting swapped domains holds true also
for HCC [114]. Moreover, the high difference in the non-bonded interactions between
α- and β-interfaces of both monomers is an additional source of high destabilization
of L68Q variant and is an additional driving force of the dimerization process [25]. In
addition, the mutation substituting hydrophobic residue with hydrophilic one makes
the interior core of L68Q variant unstable, which facilitates domain swapping. MD
simulation results of L68Q [25, 110, 111] and I108T [112] cystatins variants support
the hypothesis that the mutations in the hydrophobic core might be associated with
3D domain swapping of cystatins and with the amyloid formation.
In order to investigate the atomic details of the conformational changes of cystatins
responsible for the early stages of amyloid formation, the MD simulations at high
temperature (500 K) and different pH levels (pH 2, 4, and 7) were performed [110,
111]. The elevated temperature destabilizes the tertiary structures of both monomeric
L68Q, and wt HCCs significantly, particularly under low pH conditions since this
environment disrupts the salt bridges. When the salt bridges are destroyed, further
protein unfolding, leading to domain swapping, may be induced in both HCC and its
amyloidogenic L68Q variant. In addition, the MD simulations at 500 K demonstrate
that the disruption of the hydrophobic core at elevated temperature results in the helix
movement away from the β-region and that high temperature (500 K) and low pH
(pH 2) trigger the domain swapping process of HCC. Based on the MD results, the
domain swapping mechanism of HCC was proposed which follows four steps: (1)
the α-helix moves away from the β region; (2) contacts between β2 and β3-AS are
disrupted; (3) β2-L1- β3 hairpin unfolds; and finally (4) HCC dimer is formed (see
Fig. 6) [110, 111].
Staniforth et al. [115] proposed a possible mechanism of dimerization for the
cystatin proteins family in which the role of a “molecular spring” is played by a
conserved valine residue in β2-L1-β3 loop. In the crystallographic structures of stefin
B Val55 (Val57 in HCC) residue  torsional angle is found in the unfavoured region of
the Ramachandran plot [116, 117]. The conformation of Val55 and other conserved
480 S. Rodziewicz-Motowidło et al.

Fig. 6 The proposed mechanism of domain swapping in monomeric HCC. a The closed-form of
monomeric HCC with a hydrophobic core intact; b partially unfolded monomeric HCC with a
disrupted hydrophobic core; c partially unfolded monomeric HCC with the central helix moving
away from the β-region; d partially unfolded monomeric HCC with the β2-L1-β3 hairpin unfolded
via destruction of three salt bridges following the “zip-up” mechanism; and e open-form structure
of monomeric HCC [111]

Val residues in L1 loop of cystatins might be important for the interactions with
the inhibited enzyme. Investigations of molecular dynamics (MD) of cystatin C
fragments containing point mutations in Val57 position confirm the significance of
this position in L1 loop of human cystatin C for loop structure [118]. We exchanged
the Val57 in L1 loop to residues known to stabilize (Asp, Asn) or destabilize (Pro)
β-turns in proteins and conducted the MD simulations on them and on wt loop.
We observed the expansion of the wt HCC L1 loop that may have been caused by
an alleviation of distortions present in the loop with Val57. During MD simulation
of HCC monomer the size of L1 loop remains stable (data not shown), which is
probably caused by the interactions with the rest of the protein not allowing the
expansion of L1 loop. The L1 loops with V57N and V57D mutations do not expand
during MD simulations whereas the loop with the V57P mutation expands to greater
extent, compared with the wt loop. It implicates that the residue in position 57 is
of great importance to the conformation of β2-L1-β3 fragment of HCC. It seems
that the conformation of Val57 residue, which is forced by the interaction with the
entire protein can be strained, has intrinsic tendency to expand the loop to change its
conformation for more favorable. In addition to the influence of L68Q mutation on the
stability of the hydrophobic part of the protein, the tendency of L1 loop to expand
may trigger the partial unfolding of HCC monomer leading to dimerization and
oligomerization. The opening of the monomeric HCC structure takes place only in
L68Q mutant or in native HCC protein under denaturating conditions. This suggest
that the strained Val57 conformation in the L1 loop of the HCC protein does not
Molecular Dynamics Studies on Amyloidogenic Proteins 481

provide a sufficient force to open the monomeric structure, but can provide such
force when combined with other mutations or under denaturing conditions [118].

4 Protein Aggregation—Protofibril Structure

Polypeptides and proteins able to form amyloid do not share any common struc-
tural features. However, amyloid deposits show homogenous morphology. X-ray
diffraction images of amyloid fibrils show characteristic reflections: meridional one
around 4.75 Å ´ and equatorian one at 10 Å [119, 120]. Such diffraction image is
characteristic to β-sheet structures, so it is generally accepted that amyloid structure
is an extended β-sheet in which β-chains are located perpendicularly to the long
axis of the fibril, and the hydrogen bonds between the main β-chains are located
in parallel to that axis. The presence of a β-structure in amyloid is confirmed by
binding of thioflavin T test. This binding is characteristic to proteins, which are rich
in β-structures. Amyloid fibrils can be also stained with Congo red which results
in apple green birefringence of polarized light [121, 122]. Fibrillar structures that
form amyloid have been investigated by electron transmission microscopy (EM) and
atomic force microscopy (AFM) [123]. It has been shown that the amyloid fibril is
an extended structure most frequently consisting of a few protofilaments of 2–5 nm
in diameter, which are twisted around each other forming fibrils of 7–13 nm in diam-
eter and 1000–1600 nm long [124]. Protofibrils are transitional structures observed
in vitro during formation of mature amyloid fibrils.
In case of amyloidogenic proteins three models of the oligomerization mechanism
have been proposed by Nelson and Eisenberg [125]: refolding, natively disordered,
and gain of interaction (see Fig. 7).
In refolding model, the protein unfolds, and then folds into a defective structure
which is stabilized mostly by hydrogen bonds (Fig. 7a). The hydrogen bonds influ-
ence the structure and stability of fibrils. This model was proposed for SH3 domain
of insulin and prion protein [126, 127]. Natively disordered model (Fig. 7b) was
proposed for amyloid β peptide and huntingtin [128, 129]. In the process of the fibril
formation, part or all of the previously unstructured polypeptides are organized in
β-sheets that form the core of amyloid fibrils. Gain of interaction model (Fig. 7c) is
based on conformational changes that lead to exposition of previously unreachable
fragments of structure to the outside. It enables interaction between those struc-
tures, thus leading to fibril formation. The model includes four sub-models: direct
stacking, cross-β spine, three-dimensional domain swapping, and three-dimensional
domain swapping with a cross-β spine. In the stacking model, the newly formed
fragments of identical molecules stack on each other forming fibrils (Fig. 8a). This
model was proposed for transthyretin [130]. In cross-β spine model (Fig. 8b), β-sheet
structures align in antiparallel to other, identical molecules. In this way, β-spine is
created. The rest of the structural fragments protrude from the spine. An example of
protein which forms fibrils according to this mechanism is β2-microglobulin [131].
482 S. Rodziewicz-Motowidło et al.

Fig. 7 Formation of fibrils according to different models: refolding (a), natively disordered (b),
gain of interaction (c); (according to [125])

In three-dimensional (3D) domain swapping model, fragments of one molecule are


swapped with identical fragments of another molecule, which leads to formation
of a “chain” structure (Fig. 8c). Oligomerization according to this model is char-
acteristic to cystatin C [11]. Three-dimensional domain swapping is observed in
around 60 proteins that are mostly amyloidogenic. The proteins do not show any
structural similarity, and the swapped domains can be located both at the C- and
N-terminus [125]. Prediction if the protein has a tendency to dimerize and oligomer-
ize via three-dimensional domain swapping is difficult. Studies involving theoretical
and statistical methods show that the tendency of proteins is often associated with
the presence of the fragments demonstrating strained conformation and/or the pres-
ence of amino acid residues such as proline in the loops [118]. Three dimensional
domain swapping mechanism can result in aggregation of proteins but it can also
have physiological functions. For example the domain swapping could regulate the
protein function in the organism [132]. Three-dimensional domain swapping with a
cross-β spine is characteristic to ribonuclease A (Fig. 8d). The model is similar to
three-dimensional domain swapping, but in addition, a β-spine is formed as a result
of β-sheets interaction [125].
The attempts to construct possible amyloid protofilaments using the conforma-
tions generated from MD simulations illustrated the connection between the mis-
Molecular Dynamics Studies on Amyloidogenic Proteins 483

Fig. 8 Sub-models of fibril formation of the “gain of interaction” model: stacking (a), cross-β spine
(b), three-dimensional domain swapping with a cross-β spine (c), and three-dimensional domain
swapping (d) (according to [125])

folded structure and amyloid formation. Similar approaches to build protofilaments


have also been tested by starting directly from the native state. Though until now, no
algorithms aimed at solving the problem of amyloid structure prediction. Currently
available docking algorithms are not useful, because they are either highly inefficient
in terms of computational time, or do not take into account relevant biological and
chemical features. Other procedures and models should be used for amyloidogenic
peptides and amyloidogenic proteins. Due to computational limitations, it is not pos-
sible to study the final shape and size of a fibril. However it is possible to model
smaller structural units of a fibril like aggregates or a protofibrils. MD calculations
could also suggest a mechanism of elongation of the fibril protofilament. A num-
ber of MD simulation studies have been reported on the stability and dynamics of
pre-formed aggregates consisting of peptides in extended (strand) or bent conforma-
tions. The lengths of the peptides varied from 4 to 40 residues and the sequences
are either designed de novo (STVIIE) [133] or are derived from the peptide like β-
amyloid [16, 18, 134] or come from proteins, such as: IAPP [135], calcitonin [136],
insulin [137], Sup35 [138] and β2-microglubulin [139]. The number of sheets used
for simulations ranged from 1 to 4, with the most prevalent being 1 or 2. Parallel,
anti-parallel, and mixed arrangements have been considered, with parallel being the
most common. The number of strands per sheet ranged from 1 to 5. The termini are
usually capped in the studies with Aβ, whereas both capped and charged termini have
been considered in other studies. The used temperature was either 300 or 330 K in
most of the MD simulations. The organization of the oligomer and other structural
features considered in those simulations are based on experimental data. There are
484 S. Rodziewicz-Motowidło et al.

many computational studies that provide insight into the characteristics of the short
segments of amyloid-like aggregates [140]. For example, the contributions of differ-
ent structural elements of trimeric and pentameric, full-length Aβ (1–42) peptides to
the aggregation in solution were analyzed [141]. Kent et al. reported that a solvent-
exposed hydrophobic patch is important for the aggregation of Aβ(10–35) [142].
Nussinov and coworkers studied Aβ40 elongation, association, and the aggregation
pathway of β2-microglobulin amyloid [143]. Wang et al. studied the disaggrega-
tion behaviour of GNNQQNY oligomers during the microsecond-scale simulations
[144]. Gnanakaran et al. investigated the aggregation of simple amyloid beta peptide
dimer with REMD technique [145]. The MD results indicate that studies of short
peptide aggregation could reveal some common, fundamental mechanisms of fibril
formation.
There are many computational studies to provide an insight into the characteristic
of the short segments of the protofibrils or aggregates built from the short peptides [15,
18, 140, 143, 145–148] whereas for protein structures mainly docking procedure was
used to model the protofilament of the fibril, e.g., for prion protofibril [24] (Fig. 9).
MD studies of the protofilaments were done for example for transthyrethin [149]
(Fig. 10) and ribonuclease A [150] proteins.
To build amyloid protofilaments of transthyretin from partially disrupted TTR
monomeric structures a docking-and-alignment protocol was used [149]. The con-
structed model of TTR protofibril was in good agreement with known experimental
data and general amyloid properties. The final structure was formed by two extended
continuous b-sheets with the β-strands nearly perpendicular to the main axis of the
protofilament. The protofilament, with a diameter of 50 Å was twisted along its
helical axis with a period of 48° β-strands, that is, 16 monomeric units with two
three-stranded β-sheets each (BEF and AGH) (Fig. 10). After 100 ps-long MD simu-
lation the global fold of the protofilament was not changed. Not all the features of the
model are in agreement with the experimental data, for instance, there are differences
in the helical period. The model of TTR protofibril can be therefore further refined
using some new experimentally derived constraints.
In our laboratory we performed studies of oligomers of HCC by using MD method
and build the HCC protofibril. The results are described below.

4.1 Molecular Structure and Dynamics of Human Cystatin C


Oligomers

Based on the data published so far [45, 115] we developed four models of HCC
oligomers with domain-swapped HCC dimer serving as a building block. In the first
proposed model of HCC oligomer, the dimers with swapped domains were arranged
one after another interacting with “front-back” surfaces, i.e. alternately with β-sheet
and α-helix surfaces (Fig. 11). The dimers were aligned evenly one after another, thus
forming an oligomer, which by analogy to nucleic acids, can be called an oligomer
Molecular Dynamics Studies on Amyloidogenic Proteins 485

Fig. 9 Dimensions of PrP protofibril and higher-order oligomers. a A diglycosylated PrPSc-like


trimer with circumferences (dashed circles) of the β-extended core (magenta), all protein atoms
(gray), and the diglycosylated protofibril (cyan). b Same view as in a of a 48-mer protofibril with
the protein surface shown gray and the sugars shown in cyan. c Side view of a 48-mer protofibril.
Bars at the top indicate diameters of the 35-Å extended β-core (magenta), 65-Å protein diameter
(gray), and a 110-Å diglycosylated protofibril (cyan) [24]

with “blunt ends”. The propagation of such an oligomer occurs through addition of
consecutive domain-swapped dimers to the already associated ones.
The second considered model was proposed by Janowski et al. [45]. The HCC
dimers are stacked one on another and form the oligomer through the interactions
of top and bottom surfaces of consecutive dimers (Fig. 12). Like in the previous
486 S. Rodziewicz-Motowidło et al.

Fig. 10 a Schematic representation of the TTR protofilament model, showing the size of half of
the repeating unit. b Protofilament cross-section dimension including only the core β-strands [149]

Fig. 11 Model I of HCC oligomer structure. The picture contains numeration of dimers. Figure
based on [45]

model the oligomer formed this way can be called an oligomer with “blunt ends”.
Propagation of this oligomer also occurs through addition of consecutive dimers with
swapped domains to the oligomer.
The third model, proposed for cystatin family in general by Staniforth [115],
the oligomer consists of dimers swapping their domains in an unsymmetrical way
with the unpaired monomer at the end of the structure (Fig. 13). In contrast to the
mechanism of propagation in the previous models, the propagation of this oligomer
occurs not through addition of domain-swapped dimers, but through addition of
“open” monomers, which allows domain swapping. By analogy to nucleic acids
such oligomer can be called an oligomer with “sticky ends”, because of an unpaired
monomer at its end. In model III, the oligomer was built with the use of a HCC dimer
subunit in which the conformation of β-L structure was changed, in order to allow
domain swapping between the subunits, which are positioned at an angle, and not
like in a dimer—in parallel.
Molecular Dynamics Studies on Amyloidogenic Proteins 487

Fig. 12 Model II of HCC oligomer structure. The picture contains numeration of dimers. According
to [45]

Model IV (Fig. 14) has a similar topology as model II, but the domain-swapped
HCC dimers that stack one upon another are turned around the long axis of the
oligomer with an angle of 55°.
The analysis of the models stability after nano-scale MD simulations suggests
that the most stable structures were model II and III. The first tested type of dimer
organization, model I, was clearly unstable. All three dimers involved in the oligomer
changed their positions relative to each other, at the same time showing the instability
within the dimer structure itself. Model IV was also unstable, as one of the dimers
488 S. Rodziewicz-Motowidło et al.

Fig. 13 Model III of HCC oligomer structure. The picture contains numeration of dimers. Accord-
ing to [45]

involved in it changed its position relative to rest of the oligomer. Thus it seems
that the structures of oligomers in models I and IV did not maintain the “fibril-
like” topology, i.e. the elongated shape, during the simulation. Moreover the two
Molecular Dynamics Studies on Amyloidogenic Proteins 489

Fig. 14 Model IV of HCC oligomer structure. The picture contains numeration of dimers. Based
on Fig. 2 in [45]

models show higher energy of interactions between the subunits within the oligomer
determined with MM-GBSA (Molecular Mechanics Generalized Born Surface Area)
method, compared to models II and III. On the other hand, the topology of oligomers
of models II and III were stable during the simulation, also due to the interactions
of hydrogen bonds between subunits. Model II built with the dimers stacked on
one another showed high stability. The dimers formed stable hydrogen bonds, and
490 S. Rodziewicz-Motowidło et al.

Fig. 15 Hydrogen bonds (showed as blue lines) in a fragment of model III

salt bridges between each other. The dimer building blocks in this oligomer did not
shift significantly relative to each other and showed only minor changes in their
inner structure. The top and bottom surface of HCC is populated with many polar
or charged amino acids capable to form salt bridges and hydrogen bonds, which
favours this arrangement. The arrangement of subunits in model III, which used
unfolded monomers formed in a structure in which domain swapping was possible,
was stable. The subunits approached each other during the simulation and formed a
network of stable hydrogen bonds. In model III it was also possible for the dimers
stacked one on another to form a continuous β structure, as suggested by Wahlbom
et al. [11]. However, during the simulation, only side-chains hydrogen bonds were
created. The results are consistent with the values of Gibbs energy of the interactions
between oligomer subunits. The most favorable energy level was observed between
the subunits of model III. The second most favorable energy level was observed in
model II. The highest energy of interaction between subunits was found in the least
stable model III (Fig. 15).
It is believed that domain swapping is associated with the formation of amyloid
deposits of HCC. The dimers with swapped domains or the monomers, which swap
Molecular Dynamics Studies on Amyloidogenic Proteins 491

domains in an oligomer are building blocks of oligomers and amyloid protofibrils


of HCC. We have tested the stability of four possible arrangements of subunits in
an oligomer/protofibril of HCC. Model I, in which the subunits interacted with each
other with front and back surfaces, turned to be unstable. Moreover, the results of free
energy of interaction between dimers calculations with MM-GBSA method suggest
that the energy of interaction in such arrangement of subunits is the least favorable
of all the models. Model IV, in which subunits moved relative to each other, was
also unstable. The energy of interactions according to MM-GBSA is lower, than
in model IV, but higher, than in the other two models. Model II, in which dimers
with swapped domains are stacked evenly one on another, and interact with “top-
bottom” surfaces, showed high stability and the energy of interaction between the
subunit and the oligomer was lower, than in models I and IV. The subunits in model
III, called a “sticky ends” model, showed the lowest energy of interactions with
the neighboring monomers, compared to other models. The topology of this model
was stable. Moreover, in this model, there is a possibility to form a continuous β
structure (Fig. 13). The fact that the bottleneck of the process of amyloid formation
involving domain swapping is the domain swapping process itself, since it requires re-
organization of the monomer structure and large conformational changes. Also in the
blood of patients who have HCCAA (Hereditary Cystatin C Amyloid Angiopathy),
i.e. patients in whom L68Q variant of HCC is deposited in the brain blood vessels,
also HCC dimers are detected [101]. If fibrils were formed according to model III, the
monomers or dimers would have to be unfolded in the first place so that the fibril could
be propagated. However one cannot exclude that in the blood of patients who have
HCCAA also unfolded monomers are present. Their levels may be undetectable due
to the short time in which they remain unbound. Also the specificity of interactions
is much higher in model III. Considering all the theoretical and experimental data we
conclude that the most likely structure of the HCC oligomer is model III. The dimers
with swapped domains could be a side-product, a “dead end” of fibril elongation
process. Within the dimers there may occur interactions described in model II, though
with a possibility to form higher-level oligomers.
These conclusions are consistent with the latest experimental data obtained by
the group of Andreas Grubb from the University of Lund [11], who reported that
oligomer, and consequently, amyloid fibrils formation by HCC occurs most likely
through domain swapping in which “sticky ends” are left unbound. The experiments
showed that a dimer with swapped domains, stabilized in this conformation by two
disulfide bonds between the domains, forms neither oligomers nor amyloid fibrils.
Upon reduction of the disulfide bonds, propagation of the oligomer took place. It
indicates that HCC is not able to form oligomers by simply stacking the dimers with
swapped domains on each other.
Based on model III, an oligomer consisting of 24 HCC molecules was built and
subjected to molecular dynamics analysis (Fig. 16). After 100 ps of MD simulations
the global fold of the protofilament was not affected and its structure probably cor-
responds to the structure of protofibrils formed in HCCAA. However, in order to
verify the model, precise experimental, imaging studies would be necessary.
492 S. Rodziewicz-Motowidło et al.

Fig. 16 Schematic representation of the HCC protofilament model obtained for 24 HCC units with
swapped domains (build according to model III)

5 Conclusion

Highlighting the molecular background of amyloidosis still remains a great chal-


lenge for science. To understand the mechanism of these diseases, computational
simulations can be used as a research tool to complement experimental studies. MD
simulations help to understand the initiation as well as further steps of protein aggre-
gation and fibrillization process. The in silico techniques can provide insight into
the aggregation mechanism and reliably reproduce many experimental observations.
Despite many efforts we are still far from understanding conformational diseases
and from ability to cure them. We can hope that the combination of computational
and experimental approaches in this area of research should be helpful in reaching
this goal.

References

1. Virchow, R.: Ueber eine im Gehirn und Rückenmark des Menschen aufgefundene Substanz
mit der chemischen Reaction der Cellulose. Acad. Sci. (Paris) 37, 860–861 (1854)
2. Gertz, M.A., Lacy, M.Q., Dispenzieri, A., Hayman, S.R.: Amyloidosis. Best. Pract. Res. Clin.
Haematol. 18, 709–727 (2005)
3. Hawkins, P.N.: Diagnosis and treatment of amyloidosis. Ann. Rheum. Dis. 56, 631–633 (1997)
4. Stryer, L., Berg, J.M.: Biochemistry 5e+ Hemoglobin Chapter for Biochem 6e. W H Freeman
& Company, New York (2005)
5. Harper, J.D., Wong, S.S., Lieber, C.M., Lansbury, P.T.: Observation of metastable Abeta
amyloid protofibrils by atomic force microscopy. Chem. Biol. 4, 119–125 (1997)
6. Reixach, N., Deechongkit, S., Jiang, X., Kelly, J.W., Buxbaum, J.N.: Tissue damage in the
amyloidoses: transthyretin monomers and nonnative oligomers are the major cytotoxic species
in tissue culture. Proc. Natl. Acad. Sci. U S A 101, 2817–2822 (2004)
7. Krebs, M.R.H., Macphee, C.E., Miller, A.F., Dunlop, I.E., Dobson, C.M., Donald, A.M.: The
formation of spherulites by amyloid fibrils of bovine insulin. Proc. Natl. Acad. Sci. U.S.A.
101, 14420–14424 (2004)
8. Gosal, W.S., Morten, I.J., Hewitt, E.W., Smith, D.A., Thomson, N.H., Radford, S.E.: Compet-
ing pathways determine fibril morphology in the self-assembly of beta2-microglobulin into
amyloid. J. Mol. Biol. 351, 850–864 (2005)
Molecular Dynamics Studies on Amyloidogenic Proteins 493

9. Ionescu-Zanetti, C., Khurana, R., Gillespie, J.R., Petrick, J.S., Trabachino, L.C., Minert, L.J.,
Carter, S.A., Fink, A.L.: Monitoring the assembly of Ig light-chain amyloid fibrils by atomic
force microscopy. Proc. Natl. Acad. Sci. U S A 96, 13175–13179 (1999)
10. Malisauskas, M., Zamotin, V., Jass, J., Noppe, W., Dobson, C.M., Morozova-Roche, L.A.:
Amyloid protofilaments from the calcium-binding protein equine lysozyme: formation of ring
and linear structures depends on pH and metal ion concentration. J. Mol. Biol. 330, 879–890
(2003)
11. Wahlbom, M., Wang, X., Lindström, V., Carlemalm, E., Jaskolski, M., Grubb, A.: Fibrillogenic
oligomers of human cystatin C are formed by propagated domain swapping. J. Biol. Chem.
282, 18318–18326 (2007)
12. Kayed, R., Head, E., Thompson, J.L., McIntire, T.M., Milton, S.C., Cotman, C.W., Glabe,
C.G.: Common structure of soluble amyloid oligomers implies common mechanism of patho-
genesis. Science 300, 486–489 (2003)
13. Rousseau, F., Wilkinson, H., Villanueva, J., Serrano, L., Schymkowitz, J.W.H., Itzhaki, L.S.:
Domain swapping in p13suc1 results in formation of native-like, cytotoxic aggregates. J. Mol.
Biol. 363, 496–505 (2006)
14. Xu, S.: Aggregation drives “misfolding” in protein amyloid fiber formation. Amyloid 14,
119–131 (2007)
15. Nguyen, H.D., Hall, C.K.: Spontaneous fibril formation by polyalanines; discontinuous molec-
ular dynamics simulations. J. Am. Chem. Soc. 128, 1890–1901 (2006)
16. Buchete, N.-V., Tycko, R., Hummer, G.: Molecular dynamics simulations of Alzheimer’s
β-amyloid protofilaments. J. Mol. Biol. 353, 804–821 (2005)
17. Haspel, N., Zanuy, D., Ma, B., Wolfson, H., Nussinov, R.: A comparative study of amyloid
fibril formation by residues 15–19 of the human calcitonin hormone: a single beta-sheet model
with a small hydrophobic core. J. Mol. Biol. 345, 1213–1227 (2005)
18. Röhrig, U.F., Laio, A., Tantalo, N., Parrinello, M., Petronzio, R.: Stability and structure of
oligomers of the Alzheimer peptide Abeta16-22: from the dimer to the 32-mer. Biophys. J.
91, 3217–3229 (2006)
19. Deng, N.-J., Yan, L., Singh, D., Cieplak, P.: Molecular basis for the Cu2+ binding-induced
destabilization of β2-microglobulin revealed by molecular dynamics simulation. Biophys. J.
90, 3865–3879 (2006)
20. Yang, M., Lei, M., Huo, S.: Why is Leu55 → Pro55 transthyretin variant the most amyloido-
genic: Insights from molecular dynamics simulations of transthyretin monomers. Protein Sci.
12, 1222–1231 (2003)
21. Park, S., Saven, J.G.: Simulation of pH-dependent edge strand rearrangement in human beta-2
microglobulin. Protein Sci. 15, 200–207 (2005)
22. Armen, R.S., Daggett, V.: Characterization of two distinct beta2-microglobulin unfolding
intermediates that may lead to amyloid fibrils of different morphology. Biochemistry 44,
16098–16107 (2005)
23. Santini, S., Derreumaux, P.: Helix H1 of the prion protein is rather stable against environmental
perturbations: molecular dynamics of mutation and deletion variants of PrP(90-231). Cell.
Mol. Life Sci. 61, 951–960 (2004)
24. DeMarco, M.L., Daggett, V.: From conversion to aggregation: protofibril formation of the
prion protein. Proc. Natl. Acad. Sci. U S A 101, 2293–2298 (2004)
25. Rodziewicz-Motowidło, S., Wahlbom, M., Wang, X., Lagiewka, J., Janowski, R., Jaskolski,
M., Grubb, A., Grzonka, Z.: Checking the conformational stability of cystatin C and its L68Q
variant by molecular dynamics studies: why is the L68Q variant amyloidogenic? J. Struct.
Biol. 154, 68–78 (2006)
26. DuBay, K.F.K., Pawar, A.P.A., Chiti, F.F., Zurdo, J.J., Dobson, C.M.C., Vendruscolo, M.M.:
Prediction of the absolute aggregation rates of amyloidogenic polypeptide chains. J. Mol.
Biol. 341, 10–10 (2004)
27. Fernandez-Escamilla, A.-M., Rousseau, F., Schymkowitz, J., Serrano, L.: Prediction of
sequence-dependent and mutational effects on the aggregation of peptides and proteins. Nat.
Biotechnol. 22, 1302–1306 (2004)
494 S. Rodziewicz-Motowidło et al.

28. Tartaglia, G.G., Cavalli, A., Pellarin, R., Caflisch, A.: Prediction of aggregation rate and
aggregation-prone segments in polypeptide sequences. Protein Sci. 14, 2723–2734 (2005)
29. Ma, B., Nussinov, R.: Simulations as analytical tools to understand protein aggregation and
predict amyloid conformation. Curr. Opin. Chem. Biol. 10, 445–452 (2006)
30. Borreguero, J.M., Urbanc, B., Lazo, N.D., Buldyrev, S.V., Teplow, D.B., Stanley, H.E.: Folding
events in the 21-30 region of amyloid beta-protein (Abeta) studied in silico. Proc. Natl. Acad.
Sci. U S A 102, 6015–6020 (2005)
31. Wei, G., Mousseau, N., Derreumaux, P.: Sampling the self-assembly pathways of KFFE
hexamers. Biophys. J. 87, 9–9 (2004)
32. Baumketner, A., Shea, J.-E.: Free energy landscapes for amyloidogenic tetrapeptides dimer-
ization. Biophys. J. 89, 1493–1503 (2005)
33. Han, W., Wu, Y.-D.: A strand-loop-strand structure is a possible intermediate in fibril elon-
gation: long time simulations of amyloid-beta peptide (10-35). J. Am. Chem. Soc. 127,
15408–15416 (2005)
34. Ma, B., Nussinov, R.: Molecular dynamics simulations of the unfolding of 2-microglobulin
and its variants. Protein Eng. Des. Sel. 16, 561–575 (2003)
35. Moraitakis, G., Goodfellow, J.M.: Simulations of human lysozyme: probing the conformations
triggering amyloidosis. Biophys. J. 84, 10–10 (2003)
36. Tsai, H.-H.G., Reches, M., Tsai, C.-J., Gunasekaran, K., Gazit, E., Nussinov, R.: Energy land-
scape of amyloidogenic peptide oligomerization by parallel-tempering molecular dynamics
simulation: significant role of Asn ladder. Proc. Natl. Acad. Sci. U S A 102, 8174–8179 (2005)
37. Wu, K.-P., Weinstock, D.S., Narayanan, C., Levy, R.M., Baum, J.: Structural reorganization
of alpha-synuclein at low pH observed by NMR and REMD simulations. J. Mol. Biol. 391,
784–796 (2009)
38. Li, M.S., Klimov, D.K., Straub, J.E., Thirumalai, D.: Probing the mechanisms of fibril for-
mation using lattice models. J. Chem. Phys. 129, 175101 (2008)
39. Zhang, J., Muthukumar, M.: Simulations of nucleation and elongation of amyloid fibrils. J.
Chem. Phys. 130, 035102 (2009)
40. Rojas, A., Liwo, A., Browne, D., Scheraga, H.A.: Mechanism of fiber assembly: treatment
of Aβ peptide aggregation with a coarse-grained united-residue force field. J. Mol. Biol. 404,
537–552 (2010)
41. Fawzi, N.L., Chubukov, V., Clark, L.A., Brown, S., Head-Gordon, T.: Influence of denatured
and intermediate states of folding on protein aggregation. Protein Sci. 14, 993–1003 (2005)
42. Auer, S., Dobson, C.M., Vendruscolo, M.: Characterization of the nucleation barriers for
protein aggregation and amyloid formation. HFSP J. 1, 137–146 (2007)
43. Smith, A.V., Hall, C.K.: Protein refolding versus aggregation: computer simulations on an
intermediate-resolution protein model. J. Mol. Biol. 312, 16–16 (2001)
44. Thirumalai, D., Klimov, D.K., Dima, R.I.: Emerging ideas on the molecular basis of protein
and peptide aggregation. Curr. Opin. Struct. Biol. 13, 14–14 (2003)
45. Janowski, R., Kozak, M., Jankowska, E., Grzonka, Z., Grubb, A., Abrahamson, M., Jaskól-
ski, M.: Human cystatin C, an amyloidogenic protein, dimerizes through three-dimensional
domain swapping. Nat. Struct. Mol. Biol. 8, 316–320 (2001)
46. Bennett, M.J., Sawaya, M.R., Eisenberg, D.: Deposition diseases and 3D domain swapping.
Structure 14, 811–824 (2006)
47. Armen, R.S., DeMarco, M.L., Alonso, D.O.V., Daggett, V.: Pauling and Corey’s alpha-pleated
sheet structure may define the prefibrillar amyloidogenic intermediate in amyloid disease.
Proc. Natl. Acad. Sci. U S A 101, 11622–11627 (2004)
48. Ma, B., Nussinov, R.: The Stability of monomeric intermediates controls amyloid formation:
Aβ25-35 and its N27Q mutant. Biophys. J. 90, 10–10 (2006)
49. Gu, W., Wang, T., Zhu, Y., Shi, J., Liu, H.: Molecular dynamics simulation of the unfolding
of the human prion protein domain under low pH and high temperature conditions. Biophys.
Chem. 104, 16–16 (2003)
50. Alonso, D.O., Alm, E., Daggett, V.: Characterization of the unfolding pathway of the cell-
cycle protein p13suc1 by molecular dynamics simulations: implications for domain swapping.
Structure 8, 101–110 (2000)
Molecular Dynamics Studies on Amyloidogenic Proteins 495

51. Gsponer, J., Ferrara, P., Caflisch, A.: Flexibility of the murine prion protein and its Asp178Asn
mutant investigated by molecular dynamics simulations. J. Mol. Graph. Model. 20, 169–182
(2001)
52. Alonso, D.O., DeArmond, S.J., Cohen, F.E., Daggett, V.: Mapping the early steps in the pH-
induced conformational conversion of the prion protein. Proc. Natl. Acad. Sci. U S A 98,
2985–2989 (2001)
53. Yang, M., Lei, M., Bruschweiler, R., Huo, S.: Initial conformational changes of human
transthyretin under partially denaturing conditions. Biophys. J. 89, 11–11 (2005)
54. Skoulakis, S., Goodfellow, J.M.: The pH-dependent stability of wild-type and mutant
transthyretin oligomers. Biophys. J. 84, 10–10 (2003)
55. Mu, Y., Nordenskiöld, L., Tam, J.P.: Folding, misfolding, and amyloid protofibril formation
of WW domain FBP28. Biophys. J. 90, 10–10 (2006)
56. Nowak, M.: Immunoglobulin kappa light chain and its amyloidogenic mutants: a molecular
dynamics study. Proteins 55, 11–21 (2004)
57. Prusiner, S.B.: Biology and genetics of prion diseases. Annu. Rev. Microbiol. 48, 655–686
(1994)
58. Prusiner, S.B.: Neurodegenerative diseases and prions. N. Engl. J. Med. 344, 1516–1526
(2001)
59. Stahl, N., Prusiner, S.B.: Prions and prion proteins (1991)
60. Riesner, D.: Biochemistry and structure of PrP(C) and PrP(Sc). Br. Med. Bull. 66, 21–33
(2003)
61. Zahn, R.: NMR solution structure of the human prion protein. Proc. Natl. Acad. Sci. 97,
145–150 (2000)
62. Cox, D.L., Lashuel, H., Lee, K.Y.C., Singh, R.R.P.: The materials science of protein aggre-
gation. MRS Bull. 30, 452–457 (2005)
63. Lansbury, P.T., Lashuel, H.A.: A century-old debate on protein aggregation and neurodegen-
eration enters the clinic. Nature 443, 774–779 (2006)
64. Dima, R.I., Thirumalai, D.: Exploring the propensities of helices in PrPC to form β sheet
using NMR structures and sequence alignments. Biophys. J. 83, 1268–1280 (2002)
65. Lu, X., Wintrode, P.L., Surewicz, W.K.: Beta-sheet core of human prion protein amyloid
fibrils as determined by hydrogen/deuterium exchange. Proc. Natl. Acad. Sci. U S A 104,
1510–1515 (2007)
66. Cohen, F.E., Pan, K.M., Huang, Z., Baldwin, M., Fletterick, R.J., Prusiner, S.B.: Structural
clues to prion replication. Science 264, 530–531 (1994)
67. Dima, R.I., Thirumalai, D.: Probing the instabilities in the dynamics of helical fragments from
mouse PrPC. Proc. Natl. Acad. Sci. U S A 101, 15335–15340 (2004)
68. Kunes, K.C., Clark, S.C., Cox, D.L., Singh, R.R.P.: Left handed beta helix models for mam-
malian prion fibrils. Prion 2, 81–90 (2008)
69. Cobb, N.J., Apetri, A.C., Surewicz, W.K.: Prion protein amyloid formation under native-like
conditions involves refolding of the C-terminal alpha-helical domain. J. Biol. Chem. 283,
34704–34711 (2008)
70. Prusiner, S.B., McKinley, M.P., Bowman, K.A., Bolton, D.C., Bendheim, P.E., Groth, D.F.,
Glenner, G.G.: Scrapie prions aggregate to form amyloid-like birefringent rods. Cell 35,
349–358 (1983)
71. El-Bastawissy, E., Knaggs, M.H., Gilbert, I.H.: Molecular dynamics simulations of wild-type
and point mutation human prion protein at normal and elevated temperature. J. Mol. Graph.
Model. 20, 145–154 (2001)
72. Parchment, O.G., Essex, J.W.: Molecular dynamics of mouse and Syrian hamster PrP: impli-
cations for activity. Proteins 38, 327–340 (2000)
73. Zuegg, J., Gready, J.E.: Molecular dynamics simulations of human prion protein: importance
of correct treatment of electrostatic interactions. Biochemistry 38, 13862–13876 (1999)
74. Hornemann, S., Glockshuber, R.: A scrapie-like unfolding intermediate of the prion protein
domain PrP(121-231) induced by acidic pH. Proc. Natl. Acad. Sci. U S A 95, 6010–6014
(1998)
496 S. Rodziewicz-Motowidło et al.

75. Swietnicki, W., Morillas, M., Chen, S.G., Gambetti, P., Surewicz, W.K.: Aggregation and
fibrillization of the recombinant human prion protein huPrP90-231. Biochemistry 39, 424–431
(2000)
76. Swietnicki, W., Petersen, R., Gambetti, P., Surewicz, W.K.: pH-dependent stability and
conformation of the recombinant human prion protein PrP(90-231). J. Biol. Chem. 272,
27517–27520 (1997)
77. Zhang, H., Stockel, J., Mehlhorn, I., Groth, D., Baldwin, M.A., Prusiner, S.B., James, T.L.,
Cohen, F.E.: Physical studies of conformational plasticity in a recombinant prion protein.
Biochemistry 36, 3543–3553 (1997)
78. Jackson, G.S., Hosszu, L.L., Power, A., Hill, A.F., Kenney, J., Saibil, H., Craven, C.J., Waltho,
J.P., Clarke, A.R., Collinge, J.: Reversible conversion of monomeric human prion protein
between native and fibrilogenic conformations. Science 283, 1935–1937 (1999)
79. Guo, J., Ren, H., Ning, L., Liu, H., Yao, X.: Exploring structural and thermodynamic stabilities
of human prion protein pathogenic mutants D202N, E211Q and Q217R. J. Struct. Biol. 178,
225–232 (2012)
80. Collinge, J.: Prion diseases of humans and animals: their causes and molecular basis. Ann.
Rev. Neurosci. 519–550 (2001)
81. Mead, S.: Prion disease genetics. Eur. J. Hum. Genet. 14, 273–281 (2006)
82. van der Kamp, M.W., Daggett, V.: The consequences of pathogenic mutations to the human
prion protein. Protein Eng. Des. Sel. 22, 461–468 (2009)
83. Rossetti, G., Cong, X., Caliandro, R., Legname, G., Carloni, P.: Common structural traits
across pathogenic mutants of the human prion protein and their implications for familial
prion diseases. J. Mol. Biol. 411, 13–13 (2011)
84. Hamilton, J.A., Steinrauf, L.K., Braden, B.C., Liepnieks, J., Benson, M.D., Holmgren, G.,
Sandgren, O., Steen, L.: The x-ray crystal structure refinements of normal human transthyretin
and the amyloidogenic Val-30–> Met variant to 1.7-A resolution. J. Biol. Chem. 268,
2416–2424 (1993)
85. Sebastião, M.P., Saraiva, M.J., Damas, A.M.: The crystal structure of amyloidogenic Leu55–>
Pro transthyretin variant reveals a possible pathway for transthyretin polymerization into
amyloid fibrils. J. Biol. Chem. 273, 24715–24722 (1998)
86. Hammarström, P.: Trans-suppression of misfolding in an amyloid disease. Science 293,
2459–2462 (2001)
87. Hammarström, P., Jiang, X., Hurshman, A.R., Powers, E.T., Kelly, J.W.: Sequence-dependent
denaturation energetics: a major determinant in amyloid disease diversity. Proc. Natl. Acad.
Sci. U S A 99(Suppl 4), 16427–16432 (2002)
88. Schneider, F., Hammarström, P., Kelly, J.W.: Transthyretin slowly exchanges subunits under
physiological conditions: a convenient chromatographic method to study subunit exchange
in oligomeric proteins. Protein Sci. 10, 1606–1613 (2001)
89. Hurshman, A.R., White, J.T., Powers, E.T., Kelly, J.W.: Transthyretin aggregation under par-
tially denaturing conditions is a downhill polymerization. Biochemistry 43, 7365–7381 (2004)
90. Jiang, X., Smith, C.S., Petrassi, H.M., Hammarström, P., White, J.T., Sacchettini, J.C., Kelly,
J.W.: An engineered transthyretin monomer that is nonamyloidogenic, unless it is partially
denatured. Biochemistry 40, 11442–11452 (2001)
91. Armen, R.S., Alonso, D.O.V., Daggett, V.: Anatomy of an amyloidogenic intermediate -
conversion of β-sheet to α-sheet structure in transthyretin at acidic pH. Structure 12, 17–17
(2004)
92. Liu, K., Cho, H.S., Hoyt, D.W., Nguyen, T.N., Olds, P., Kelly, J.W., Wemmer, D.E.: Deuterium-
proton exchange on the native wild-type transthyretin tetramer identifies the stable core of the
individual subunits and indicates mobility at the subunit interface. J. Mol. Biol. 303, 555–565
(2000)
93. Saraiva, M.J.: Transthyretin mutations in hyperthyroxinemia and amyloid diseases. Hum.
Mutat. 17, 493–503 (2001)
94. Lashuel, H.A., Lai, Z., Kelly, J.W.: Characterization of the transthyretin acid denaturation path-
ways by analytical ultracentrifugation: implications for wild-type, V30M, and L55P amyloid
fibril formation. Biochemistry 37, 17851–17864 (1998)
Molecular Dynamics Studies on Amyloidogenic Proteins 497

95. Hörnberg, A., Eneqvist, T., Olofsson, A., Lundgren, E., Sauer-Eriksson, A.E.: A comparative
analysis of 23 structures of the amyloidogenic protein transthyretin. J. Mol. Biol. 302, 21–21
(2000)
96. Wojtczak, A., Neumann, P., Cody, V.: Structure of a new polymorphic monoclinic form of
human transthyretin at 3 Å resolution reveals a mixed complex between unliganded and
T4-bound tetramers of TTR. Acta Crystallogr. D: Biol. Crystallogr. 57, 957–967 (2001)
97. Hörnberg, A., Olofsson, A., Eneqvist, T., Lundgren, E., Sauer-Eriksson, A.E.: The beta-
strand D of transthyretin trapped in two discrete conformations. Biochim. Biophys. Acta
1700, 93–104 (2004)
98. Banerjee, A., Bairagya, H.R., Mukhopadhyay, B.P.B., Nandi, T.K., Bera, A.K.: Structural
insight to mutated Y116S transthyretin by molecular dynamics simulation. Indian J. Biochem.
Biophys. 47, 197–202 (2010)
99. Xu, X., Wang, X., Xiao, Z., Li, Y., Wang, Y.: Probing the structural and functional link
between mutation- and pH-dependent hydration dynamics and amyloidosis of transthyretin.
Soft Matter 8, 324–336 (2011)
100. Abrahamson, M., Barrett, A.J., Salvesen, G., Grubb, A.: Isolation of six cysteine proteinase
inhibitors from human urine. Their physicochemical and enzyme kinetic properties and con-
centrations in biological fluids. J. Biol. Chem. 261, 11282–11289 (1986)
101. Grubb, A.O.: Cystatin C-properties and use as diagnostic marker. In: Advances in Clinical
Chemistry. Elsevier, pp. 63–99 (2001)
102. Grzonka, Z., Jankowska, E., Kasprzykowski, F., et al.: Structural studies of cysteine proteases
and their inhibitors. Acta Biochim. Pol. 48, 1–20 (2001)
103. Ghiso, J., Jensson, O., Frangione, B.: Amyloid fibrils in hereditary cerebral hemorrhage with
amyloidosis of Icelandic type is a variant of gamma-trace basic protein (cystatin C). Proc.
Natl. Acad. Sci. U S A 83, 2974–2978 (1986)
104. Abrahamson, M.: Molecular basis for amyloidosis related to hereditary brain hemorrhage.
Scand. J. Clin. Lab. Invest. Suppl. 226, 47–56 (1996)
105. Olafsson, I., Grubb, A.O.: Hereditary cystatin C amyloid angiopathy. Amyloid 7, 70–79
(2000)
106. Gerhartz, B., Ekiel, I., Abrahamson, M.: Two stable unfolding intermediates of the disease-
causing L68Q variant of human cystatin C. Biochemistry 37, 17309–17317 (1998)
107. Abrahamson, M., Grubb, A.: Increased body temperature accelerates aggregation of the Leu-
68–> Gln mutant cystatin C, the amyloid-forming protein in hereditary cystatin C amyloid
angiopathy. Proc. Natl. Acad. Sci. U S A 91, 1416–1420 (1994)
108. Jankowska, E., Wiczk, W., Grzonka, Z.: Thermal and guanidine hydrochloride-induced denat-
uration of human cystatin C. Eur. Biophys. J. 33, 454–461 (2004)
109. Nilsson, M., Wang, X., Rodziewicz-Motowidlo, S., Janowski, R., Lindström, V., Onnerfjord,
P., Westermark, G., Grzonka, Z., Jaskolski, M.M., Grubb, A.A.: Prevention of domain swap-
ping inhibits dimerization and amyloid fibril formation of cystatin C: use of engineered disul-
fide bridges, antibodies, and carboxymethylpapain to stabilize the monomeric form of cystatin
C. J. Biol. Chem. 279, 24236–24245 (2004)
110. Liu, H.-L., Lin, Y.-M., Zhao, J.-H., Hsieh, M.-C., Lin, H.-Y., Huang, C.-H., Fang, H.-W., Ho,
Y., Chen, W.-Y.: Molecular dynamics simulations of human cystatin C and its L68Q varient
to investigate the domain swapping mechanism. J. Biomol. Struct. Dyn. 25, 135–144 (2007)
111. Lin, Y.-M., Liu, H.-L., Zhao, J.-H., Huang, C.-H., Fang, H.-W., Ho, Y., Chen, W.-Y.: Molecular
dynamics simulations to investigate the domain swapping mechanism of human cystatin C.
Biotechnol. Prog. 23, 577–584 (2008)
112. Yu, Y., Wang, Y., He, J., Liu, Y., Li, H., Zhang, H., Song, Y.: Structural and dynamic properties
of a new amyloidogenic chicken cystatin mutant I108T. J. Biomol. Struct. Dyn. 27, 641–649
(2010)
113. Ekiel, I., Abrahamson, M., Fulton, D.B., et al.: NMR structural studies of human cystatin C
dimers and monomers. J. Mol. Biol. 271, 12–12 (1997)
114. Sinha, N., Tsai, C.J., Nussinov, R.: A proposed structural model for amyloid fibril elongation:
domain swapping forms an interdigitating beta-structure polymer. Protein Eng. 14, 93–103
(2001)
498 S. Rodziewicz-Motowidło et al.

115. Staniforth, R.A., Giannini, S., Higgins, L.D., Conroy, M.J., Hounslow, A.M., Jerala, R.,
Craven, C.J., Waltho, J.P.: Three-dimensional domain swapping in the folded and molten-
globule states of cystatins, an amyloid-forming structural superfamily. EMBO J. 20,
4774–4781 (2001)
116. Stubbs, M.T., Laber, B., Bode, W., Huber, R., Jerala, R., Lenarcic, B., Turk, V.: The refined
2.4 A X-ray crystal structure of recombinant human stefin B in complex with the cysteine
proteinase papain: a novel type of proteinase inhibitor interaction. EMBO J. 9, 1939–1947
(1990)
117. Engh, R.A., Dieckmann, T., Bode, W., Auerswald, E.A., Turk, V., Huber, R., Oschkinat, H.:
Conformational variability of chicken cystatin. Comparison of structures determined by X-ray
diffraction and NMR spectroscopy. J. Mol. Biol. 234, 1060–1069 (1993)
118. Rodziewicz-Motowidło, S., Iwaszkiewicz, J., Sosnowska, R., Czaplewska, P., Sobolewski,
E., Szymańska, A., Stachowiak, K., Liwo, A.: The role of the Val57 amino-acid residue in the
hinge loop of the human cystatin C. Conformational studies of the beta2-L1-beta3 segments
of wild-type human cystatin C and its mutants. Biopolymers 91, 373–383 (2009)
119. Sunde, M., Serpell, L.C., Bartlam, M., Fraser, P.E., Pepys, M.B., Blake, C.C.: Common core
structure of amyloid fibrils by synchrotron X-ray diffraction. J. Mol. Biol. 273, 11–11 (1997)
120. Blake, C., Serpell, L.: Synchrotron X-ray studies suggest that the core of the transthyretin
amyloid fibril is a continuous β-sheet helix. Structure 4, 10–10 (1996)
121. Cohen, A.S., Shirahama, T., Skinner, M.: Electron microscopy of amyloid. Electron
microscopy of proteins 3, 165–205 (1982)
122. Puchtler, H., Sweat, F.: Congo red as a stain for fluorescence microscopy of amyloid. J.
Histochem. Cytochem. 13, 693–694 (1965)
123. Chiti, F., Dobson, C.M.: Protein misfolding, functional amyloid, and human disease. Ann.
Rev. Biochem. 75, 333–366 (2006)
124. Serpell, L.C., Sunde, M., Benson, M.D., Tennent, G.A., Pepys, M.B., Fraser, P.E.: The protofil-
ament substructure of amyloid fibrils. J. Mol. Biol. 300, 1033–1039 (2000)
125. Nelson, R., Eisenberg, D.: Recent atomic models of amyloid fibril structure. Curr. Opin.
Struct. Biol. 16, 260–265 (2006)
126. Jiménez, J.L., Guijarro, J.I., Orlova, E., Zurdo, J., Dobson, C.M., Sunde, M., Saibil, H.R.:
Cryo-electron microscopy structure of an SH3 amyloid fibril and model of the molecular
packing. EMBO J. 18, 815–821 (1999)
127. Govaerts, C., Wille, H., Prusiner, S.B., Cohen, F.E.: Evidence for assembly of prions with
left-handed beta-helices into trimers. Proc. Natl. Acad. Sci. U S A 101, 8342–8347 (2004)
128. Sikorski, P., Atkins, E.: New model for crystalline polyglutamine assemblies and their con-
nection with amyloid fibrils. Biomacromol 6, 425–432 (2005)
129. Lührs, T., Ritter, C., Adrian, M., Riek-Loher, D., Bohrmann, B., Döbeli, H., Schubert, D.,
Riek, R.: 3D structure of Alzheimer’s amyloid-beta(1-42) fibrils. Proc. Natl. Acad. Sci. U S
A 102, 17342–17347 (2005)
130. Serag, A.A., Altenbach, C., Gingery, M., Hubbell, W.L., Yeates, T.O.: Arrangement of subunits
and ordering of beta-strands in an amyloid sheet. Nat. Struct. Biol. 9, 734–739 (2002)
131. Ivanova, M.I., Sawaya, M.R., Gingery, M., Attinger, A., Eisenberg, D.: An amyloid-forming
segment of beta2-microglobulin suggests a molecular model for the fibril. Proc. Natl. Acad.
Sci. U S A 101, 10584–10589 (2004)
132. Gronenborn, A.M.: Protein acrobatics in pairs—dimerization via domain swapping. Curr.
Opin. Struct. Biol. 19, 39–49 (2009)
133. la Paz de, M.L., de Mori, G.M.S., Serrano, L., Colombo, G.: Sequence dependence of amyloid
fibril formation: insights from molecular dynamics simulations. J. Mol. Biol. 349, 14–14
(2005)
134. Li, L., Darden, T.A., Bartolotti, L., Kominos, D., Pedersen, L.G.: An atomic model for the
pleated beta-sheet structure of Abeta amyloid protofilaments. Biophys. J. 76, 2871–2878
(1999)
135. Zanuy, D., Nussinov, R.: The sequence dependence of fiber organization. A comparative
molecular dynamics study of the islet amyloid polypeptide segments 22-27 and 22-29. J.
Mol. Biol. 329, 20–20 (2003)
Molecular Dynamics Studies on Amyloidogenic Proteins 499

136. Haspel, N., Gunasekaran, K., Ma, B., Tsai, C.-J.C., Nussinov, R.: The stability and dynamics
of the human calcitonin amyloid peptide DFNKF. Biophys. J. 87, 13–13 (2004)
137. Ye, W., Chen, Y., Wang, W., Yu, Q., Li, Y., Zhang, J., Chen, H.-F.: Insight into the stability
of cross-β amyloid fibril from VEALYL short peptide with molecular dynamics simulation.
PLoS ONE 7, e36382 (2012)
138. Periole, X., Rampioni, A., Vendruscolo, M., Mark, A.E.: Factors that affect the degree of twist
in beta-sheet structures: A molecular dynamics simulation study of a cross-beta filament of
the GNNQQNY peptide. J. Phys. Chem. B 113, 10548–10548 (2009)
139. Song, W., Wei, G., Mousseau, N., Derreumaux, P.: Self-assembly of the beta2-microglobulin
NHVTLSQ peptide using a coarse-grained protein model reveals a beta-barrel species. J.
Phys. Chem. B 112, 4410–4418 (2008)
140. Berryman, J.T., Radford, S.E., Harris, S.A.: Systematic examination of polymorphism in
amyloid fibrils by molecular-dynamics simulation. Biophys. J. 100, 9–9 (2011)
141. Connelly, L., Jang, H., Arce, F.T., Capone, R., Kotler, S.A., Ramachandran, S., Kagan, B.L.,
Nussinov, R., Lal, R.: Atomic force microscopy and MD simulations reveal pore-like struc-
tures of all-d-enantiomer of Alzheimer’s β-amyloid peptide: relevance to the ion channel
mechanism of AD pathology. J. Phys. Chem. B 116, 1728–1735 (2012)
142. Kent, A., Jha, A.K., Fitzgerald, J.E., Freed, K.F.: Benchmarking implicit solvent folding
simulations of the amyloid beta(10-35) fragment. J. Phys. Chem. B 112, 6175–6186 (2008)
143. Zheng, J., Jang, H., Nussinov, R.: Beta2-microglobulin amyloid fragment organization and
morphology and its comparison to Abeta suggests that amyloid aggregation pathways are
sequence specific. Biochemistry 47, 2497–2509 (2008)
144. Wang, J., Tan, C., Chen, H.-F., Luo, R.: All-atom computer simulations of amyloid fibrils
disaggregation. Biophys. J. 95, 5037–5047 (2008)
145. Gnanakaran, S., Nussinov, R., García, A.E.: Atomic-level description of amyloid beta-dimer
formation. J. Am. Chem. Soc. 128, 2158–2159 (2006)
146. Boucher, G., Mousseau, N., Derreumaux, P.: Aggregating the amyloid Abeta(11-25) peptide
into a four-stranded beta-sheet structure. Proteins 65, 877–888 (2006)
147. Lipfert, J., Franklin, J., Wu, F., Doniach, S.: Protein misfolding and amyloid formation for the
peptide GNNQQNY from yeast prion protein Sup35: simulation by reaction path annealing.
J. Mol. Biol. 349, 11–11 (2005)
148. Soto, P., Cladera, J., Mark, A.E., Daura, X.: Stability of SIV gp32 fusion-peptide single-layer
protofibrils as monitored by molecular-dynamics simulations. Angew. Chem. 117, 1089–1091
(2005)
149. Correia, B.E., Loureiro-Ferreira, N., Rodrigues, J.R., Brito, R.M.M.: A structural model of
an amyloid protofilament of transthyretin. Protein Sci. 15, 28–32 (2005)
150. Colombo, G., Meli, M., De Simone, A.: Computational studies of the structure, dynamics and
native content of amyloid-like fibrils of ribonuclease A. Proteins 70, 863–872 (2007)
151. Cendron, L., Trovato, A., Seno, F., Folli, C., Alfieri, B., Zanotti, G., Berni, R.: Amyloidogenic
potential of transthyretin variants: insights from structural and computational analyses. J. Biol.
Chem. 284, 25832–25841 (2009)
Raman and Infrared Spectra
of Acoustical, Functional Modes
of Proteins from All-Atom
and Coarse-Grained Normal Mode
Analysis

Adrien Nicolaï, Patrice Delarue and Patrick Senet

Abstract The directions of the largest thermal fluctuations of the structure of a pro-
tein in its native state are the directions of its low-frequency modes (below 1 THz),
named acoustical modes by analogy with the acoustical phonons of a material. The
acoustical modes of a protein assist its conformational changes and are related to its
biological functions. Low-frequency modes are difficult to detect experimentally. A
survey of experimental data of low-frequency modes of proteins is presented. Theo-
retical approaches, based on normal mode analysis, are of first interest to understand
the role of the acoustical modes in proteins. In this chapter, the fundamentals of
normal mode analysis using all-atom models and coarse-grained elastic models are
reviewed. Then, they are applied to: first, a protein studied in recent single molecule
experiments, conalbumin and second, to a protein intimately related to human dis-
eases: the 70 kDa Heat-Shock Protein (Hsp70). The conalbumin protein consists of
two homologous N- and C-lobes and was recently used as a benchmark protein for
Extraordinary Acoustic Raman (EAR) spectroscopy. Present all-atom calculations
demonstrate that acoustical modes of conalbumin recently measured experimentally
are both infrared and Raman active. The molecular chaperone Hsp70 is an exem-
plary model to illustrate the different properties of the low-frequency modes of a
multi-domain protein which occurs in two well distinct structural states (open and
closed states), which might be also detectable in the sub-THz frequency range by
single molecule spectroscopy. The role of the low-frequency modes in the transi-
tion between the two states of Hsp70 is analyzed in details. It is shown that the
low-frequency modes provide an easy means of communication between protein
domains separated by a large distance.

A. Nicolaï · P. Delarue · P. Senet (B)


Laboratoire Interdisciplinaire Carnot de Bourgogne, Unité Mixte de Recherche 6303 Centre
National de la Recherche Scientifique-Université de Bourgogne, Université de Bourgogne
Franche-Comté, 9 Avenue Alain Savary BP 47870, Dijon Cedex 21078, France
e-mail: psenet@u-bourgogne.fr

© Springer Nature Switzerland AG 2019 501


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_15
502 A. Nicolaï et al.

1 Introduction

1.1 General Informations

Nanostructures, as thin films, nanoclusters, and proteins, can subtend “confined


acoustical phonons”. Confined acoustical phonons are low-frequency standing-
waves corresponding approximately to the acoustical phonons of the material at
a wavelength λ  2N d, where d is the smallest dimension of the nanostructure and
N is an integer [1, 2]. The lowest frequency acoustical mode of a globular protein of
radius R has a wavelength λ ≈ 2R, which is much larger than the distances between
the atoms because R varies typically between 1 and 10 nm. Therefore the proper-
ties of a confined acoustical mode does not depend on the details of the interactions
between the atoms and can be described by regarding the protein as an effective elastic
medium [3, 4] or an effective coarse-grained elastic network [5–7]. In elastic contin-
uum theory, the lowest frequency of the confined longitudinal acoustical phonon of a
sphere of radius R is given by υ L ≈ VL /2R where VL is the longitudinal velocity of
sound in the material [8]. If the protein is represented by an elastic sphere of radius
R  10 nm and VL  2000 m s−1 [9–12], we deduce υ L ≈ 100 GHz. In spectro-
scopic notations, it corresponds to a wavenumber ν̃ L ≡ υ L /c ≈ 3.3 cm−1 where c is
the velocity of light in vacuum. For a transverse confined acoustical phonon, using
VT  700 m s−1 , measured in protein crystals [9], we estimate υT ≈ 36 GHz and
ν̃T ≈ 1.2 cm−1 . The evaluations of the lowest frequency modes of globular proteins
by using realistic potentials between the atoms lead to similar values [13–18]. For an
elastic sphere, the low-frequency modes can be separated in breathing, torsional and
spheroidal modes. Because protein shape departures from a sphere and possesses
in general several domains or subunits which move rigidly at low frequencies, the
modes are better classified as hinge, shear and twist motions [13, 16, 17, 19–21].
These motions correspond to segmental motions of the main-chain with large ampli-
tudes of the torsional angles of the protein backbone.
Because the Mean-Square Fluctuations (MSF) of the atoms are inversely propor-
tional to the square frequency of the normal modes of a molecule [22] (see Eq. 24
below), a few low-frequency modes of a protein can account for 50% of the atomic
MSF [14]. As the thermal energy at room temperature corresponds to approximately
200 cm−1 , all modes below this frequency will contribute significantly to the enthalpy
and to the entropy of the protein. Importance of the collective modes to the protein
entropy was recognized early [23].

1.2 Low-Frequency Modes and the Biological Function


of Proteins

The directions of the largest thermal fluctuations of the structure of a protein in its
native state are the directions of its low-frequency modes (below 1 THz), named
Raman and Infrared Spectra of Acoustical, Functional Modes … 503

acoustical modes by analogy with the acoustical phonons of a material. The acous-
tical modes assist the conformational changes of proteins necessary to perform their
function [24, 25]. The low-frequency modes are related to the amino-acid sequence
of the protein because they depend on the tertiary structure. Proteins for which the
amino-acid sequences lead to the same fold (having the same main-chain conforma-
tion) have similar confined acoustical modes because the lowest frequency modes
depend mainly on the connectivity of the main chain of the protein and not on
the atomistic details. Natural selection of an amino-acid sequence not only selects a
structure, and thus a biological function, but also the low-frequency collective modes
associated to it.
Since nearly four decades up to today, there has been a considerable interest to
establish the possible role of the low-frequency (<200 cm−1 ) modes of proteins for
their biological function [13, 19–21, 26–31]. To perform their functions, most of
the proteins need to alternate between different states separated by an activation
barrier. The passage from one state to another is coupled to the binding/release of
one or several ligands and could be assisted by confined acoustical modes [29, 30,
32]. The directions of the low-frequency modes provide the direction of the largest
deformation at thermal equilibrium and can serve as collective coordinates to describe
the conformational changes [19–21]. Intrinsic dynamics of proteins correlates with
the structural changes induced by ligand or protein binding [32–34]. In enzymes, MD
simulations revealed long-range interactions which manifest as correlated motions of
distant residues which might play a role in enzyme catalysis [31, 35, 36]. The details
and the importance of the collective modes of proteins for their biological function
are still not fully understood however. This is due to the fact that both theoretical and
experimental approaches were not enable so far to follow the biological events on the
multiple timescales on which these events occur (from femtosecond to second, Fig. 1)
[37, 38]. In spite of these limitations, a perturbative approach, based on the dynamics
of a protein using an harmonic all-atom or coarse-grained potential energy surface,
had proven to be useful to understand the conformational changes of proteins.
The chapter is organized as follows. The fundamentals of the theory of the vibra-
tional modes in the harmonic approximation (also named normal modes) for proteins
are reviewed in Sect. 2. There, we present the equations to compute the absorption
(infrared) and Raman spectra of proteins. The application of the normal mode anal-

Fig. 1 Timescales and


frequencies of typical
motions in proteins.
Timescale axis is in log
scale. Motions and their
corresponding timescales are
indicated above the axis
504 A. Nicolaï et al.

ysis (NMA) to describe transition pathways is briefly reviewed and the limit of the
harmonic approximation on which NMA is based is described. In Sect. 3, we present
a survey of the measurements of the low-frequency modes of proteins in their native
state by Extraordinary Acoustic Raman (EAR) spectroscopy, and their relation with
the present all-atom normal mode calculations of a model protein, conalbumin. As
shown theoretically elsewhere [39], acoustical modes of proteins studied by EAR
spectroscopy are both infrared and Raman active modes, with a remarkable agreement
between theory and experiments. In Sect. 4, an analysis of the low-frequency collec-
tive motions of a large multi-domain protein of first interest in medicine, the human
70 kDa heat-shock protein (hHsp70) is described. The vibrational modes of hHsp70
were studied in the vicinity (harmonic approximation) of the two main local min-
ima of its free-energy landscape: the nucleotide-free or ADP-bound hHsp70 (named
closed state) and the ATP-bound hHsp70 (named open state). As shown elsewhere
[40, 41], the open and closed states of human Hsp70 represent initial and final struc-
tures of the conformational transitions of the functional cycle of this chaperone. In an
attempt to identify the functionally important motions for the transition between the
open and closed states, we computed the collective modes of the open model and the
closed model of hHsp70 using, first, a coarse-grained normal mode analysis using the
popular Anistropic Network Model (ANM) [7], and second, all-atom normal mode
analysis. All-atom and coarse-grained calculations of the low-frequency motions of
hHsp70 were compared. The chapter ends with concluding remarks.

2 Theory

2.1 Introduction

Since several decades [4, 13–17, 26], NMA has been used successfully to determine
protein slow motions, which are coupled to conformational changes. The NMA
method describes all possible (small) deformations a protein can undergo around its
native state by representing the protein by a set of harmonic oscillators [42, 43]. The
vibrational low-frequency modes correspond to collective or global motions, whereas
the higher frequency modes correspond to local deformations. Several studies have
shown that theses low-frequency modes are related to relevant much slower motions
in proteins and that conformational transitions often follow one or a combination of
a few normal modes [18, 30, 44–47].
Two main NMA approaches have been used in the present chapter. The first is the
all-atom NMA (aa-NMA) with the standard all-atom representation of the protein
(Fig. 2a) and an all-atom force-field. The aa-NMA is limited to proteins of hundreds
of residues due to the memory requirements for the diagonalization of the 3N × 3N
force constant (Hessian) matrix, where N is the number of atoms. This is the main
computational limitation of aa-NMA. A reduction of these degrees of freedom is
commonly used to reduce the size of the Hessian matrix. This can be achieved
Raman and Infrared Spectra of Acoustical, Functional Modes … 505

by holding the bond lengths and angles fixed for example [48], or by considering
only the rotation of several residues [49]. The second NMA approach used here, is
the coarse-grained NMA where each residue of a protein is represented by a point
(effective) mass (Fig. 2b). The most well-known coarse-grained model for NMA is
the elastic network model (ENM). In ENM, the all-atom force field is replaced by a
ball-and-spring harmonic potential with a single force constant parameter (Fig. 2c).
The first elastic network model was proposed by Tirion [50], who showed that an
all-atom homogeneous elastic network model can reproduce the shape of the low-
frequency part of the density of states of a protein as well as the fluctuations of
its Cα atoms very well. Later, Hinsen introduced a simplified coarse-grained elastic
network model, based on the position of the Cα atoms only and demonstrated its
usefulness to identify dynamical domains in proteins [51]. Since then, many variants
were developed [7, 52] and applied to a large number of proteins [53–55]. Combined
with a coarse-grained representation of the protein where only the Cα atoms are
considered (Fig. 2b), ENM has emerged as the preferred approach to perform NMA
on large systems. Although simple and efficient (calculation could be done within
seconds on a regular desktop computer), it has been shown to provide robust and
reliable results.

2.2 Gaussian Model for the Structural Fluctuations

The collective modes of a protein can be defined from the eigenvectors and eigen-
values of the covariance matrix of the displacements of the atoms (or group of atoms
in a coarse-grained representation) relative to their equilibrium positions [42]. The
structure of the protein is described by a set of point masses M1 , M2 , . . . M N  {Mi }
located at R1 , R2 , . . . R N  {Ri }, respectively. The most probable position of the
mass Mi is Ri0 (i  1 to N ). Each mass represents either an atomic mass (in an
all-atom representation of the protein) or an effective mass (in a coarse-grained rep-
resentation of the protein). The probability distribution of the displacements of the
point masses relative to their equilibrium position, Ri  Ri − Ri0 , is assumed to
be a multivariate Gaussian distribution
⎡ ⎤
1  N  N
P({Ri })  P(0)ex p ⎣− Ai j : Ri R j ⎦, (1)
2 i1 j1

where P(0) is a normalization constant and A is a 3N × 3N semi-definite positive


symmetrical matrix.
The matrix A of a single protein in solution and in absence of external forces
must obey two properties. Indeed, a translation of the protein as a whole as well as
a rotation of the whole protein around a fixed axis have no effect on the probability
distribution of the displacements of the point masses representing the molecule. In
other words, for a rigid translation U of the molecule, one has
506 A. Nicolaï et al.

Fig. 2 a Representative hydrated structure of hHsp70 in the closed state used as input for both
coarse-grained and all-atom NMA. The color code is the following: NBD-IA = blue, NBD-IB =
marine, NBD-IIA = lightblue, NBD-IIB = cyan, linker = magenta, SBD-β = green, SBD-α = red and
C-terminal = gray. b Structure of hHsp70 in the closed state where only the Cα atoms are represented
in the same view as in panel C. The color code is the same as in panel C. Elastic network connections
between the Cα atoms used to construct the ANM force constant matrix. Cα atoms within a cutoff
of 11 Å are shown connected via a “bond” in black. c Schematic representation of nodes in elastic
network of ANM. Every node is connected to its spatial neighbors by uniform springs. d Distance
vector between two nodes, i and j, is shown by an arrow and labeled Rij . Equilibrium positions
of the ith and jth nodes, R0i and R0j , are shown in xyz coordinates system. R0ij is the equilibrium
distance between nodes i and j. Instantaneous fluctuation vectors, Ri and Rj , and instantaneous
distance vector, Rij , are shown by dashed arrows. Panels c and d were prepared with PyMOL (
http://www.pymol.org)

P({Ri  U})  P(0), (2)

and for a rigid rotation of the molecule around an axis of direction , one finds also

P Ri  Ri0 ×   P(0). (3)
Raman and Infrared Spectra of Acoustical, Functional Modes … 507

By using Eqs. 2 and 3 in Eq. 1, we deduce that the matrix A obey two relations


N
αβ
Ai j  0, (4)
j1

and


N 
αβ 
Ai j R0j ×  β
 0, (5)
j1 β

where α and β denote Cartesian coordinates.


The symmetrical matrix A possesses 3N eigenvalues ak ≥ 0 and 3N eigenvectors
ek i.e.

Aek  ak ek , (6)

N
Ai j ek ( j)  ak ek (i). (7)
j1

y
In Eq. 7, ek (i)  ekx (i), ek (i), ekz (i) is the projection of the N -dimensional
y
eigenvector ek on the site i located at Ri  Rix , Ri , Riz . It is easy to show that
the relation given in Eq. 4 implies that three eigenvalues of A are null. Each of these
modes has a normalized eigenvector corresponding to a rigid translation along one
of the Cartesian axis. Similarly, the relation given in Eq. 5 implies that three other
eigenvalues are zero, each with an eigenvector corresponding to a rigid rotation along
one of the three Cartesian axes. The eigenvalues of A being ranked by increasing
values, the first non-zero eigenvalue of A is a7 .
The eigenvectors of A form a complete basis set:


3N
β
eαk (i)ek ( j)  δi j δαβ . (8)
k1

The scalar product of each eigenvector of A with the displacements Ri defines
a scalar collective coordinate qk :


N
qk ≡ ek (i).Ri . (9)
i1

Any displacement Ri (including the rigid translation and rotation of the
molecule as a whole) can be expanded in collective modes:
508 A. Nicolaï et al.


3N
Ri  qk ek (i) (10)
k1

The relation given in Eq. 10 follows from Eqs. 8 and 9. Using Eq. 6 in Eqs. 1 and
10, we find

1
3N
P({Ri })  P({qk })  P(0)ex p − ak q k .
2
(11)
2 k7

By using Eqs. 6, 7 and 11, one deduces the normalization constant




3N
P(0) ≡ 1  ak [2π ]−(3N −6)/2 . (12)
k7

The covariance matrix of the displacements σ is the generalized inverse of the


matrix A

  3N
1 α  
αβ β β −1αβ
σ i j  Riα R j  ek (i)ek ( j)  Ai j , (13)
a
k7 k

where  means an average over all possible values of the collective coordinates.
The quantities 1/ak and ek represent the eigenvalues and the eigenvectors of the
covariance matrix of the displacements, respectively. From Eqs. 9 and 13, one finds
that 1/ak (k > 6) is simply the average value of the square of the collective coordinate,
i.e.,
 2 1
qk  . (14)
ak

Equation 13 permits to relate the Bi factors, measured in X-ray diffraction (XRD),


to the collective modes by the following relation

8π 2   8π 2 
3N
|ek (i)|2
Bi ≡ |Ri |2  . (15)
3 3 k7 ak

Because the (non-zero) eigenvalue ak appears in the denominator in Eqs. 13–15,


the collective mode with the lowest eigenvalue ak contributes the most to the MSF of
the displacements. The collective modes with the lowest eigenvalues correspond to
low-frequency acoustical modes within the harmonic approximation of the protein
energy, as shown next. It is worth emphasizing that Eqs. 1–15 are valid for any model
of A provided that the matrix obeys the conditions given in Eqs. 4 and 5.
Raman and Infrared Spectra of Acoustical, Functional Modes … 509

2.3 Gaussian Model and Normal Modes

The intra-molecular energy E of a protein in its native state is expanded up to the


second order in terms of the displacements Ri of the N point masses

N  
∂E
E − E(0)  .Ri
i1
∂Ri 0
N  
1 
N
∂ E2
+ : Ri R j ,
2 i1 j1 ∂Ri ∂R j 0

1 
N N
 i j : Ri R j , (16)
2 i1 j1

where the first term in the right-hand


  side of the first line equality is zero because in
∂F
the native state, one assumes ∂R i 0
 0, ∀i.
For E given by Eq. 16, the probability distribution of the displacements Ri in the
canonical ensemble at temperature T (in the classical approximation) is a Gaussian
distribution:
⎛ ⎡ ⎤ ⎞
1 N  N
P({Ri })  P(0) ex p ⎝−⎣ i j : Ri R j ⎦/k B T ⎠. (17)
2 i1 j1

where k B is the Boltzmann constant.


By comparing Eq. 1 with Eq. 17, one deduces A  /k B T . Consequently, the
eigenvalues λk of  are simply given by

λk  ak k B T, (18)

and have the physical dimension of a force constant.


From Eq. 14, one deduces:
 2 k B T
qk  . (19)
λk


By introducing the mass weighed displacements, Ri ≡ Mi Ri , in Eq. 17,
on may write the probability distribution of the mass weighed displacements directly
in terms of the dynamical matrix Di j [18]:
⎛ ⎡ ⎤⎞
 ! 1 N  N
P Ri  P(0)ex p ⎝−⎣ Ai j : Ri R j ⎦⎠, (20)
2 i1 j1
510 A. Nicolaï et al.

with
"
Ai j  Di j /k B T ≡ i j /k B T / Mi M j . (21)

Using the eigenvectors of the dynamical matrix:


N
Di j êk ( j)  ωk2 êk (i), (22)
j1

and

ωk2
ak  , (23)
kB T

one can reformulated the Bi factors (Eq. 15) in terms of the normal modes:
# #2
8π 2 k B T  #êk (i)#
3N
Bi  (24)
3 Mi k7 ωk2

Another useful quantity is the influence υk (i), corresponding to the contribution


of the atom i (or Ca atom in ANM) to the molecular deformation within a mode k:
# #2
υk (i) def#êk (i)# . (25)


The summation of the influence over all the atoms of a subdomain is a measure
of the contribution of this domain to the normal mode.

2.4 Classical Infrared Spectra of Proteins from Normal


Modes

The classical (infrared) absorption spectra P(ω) of a protein in an applied electric


field E(ω), oscillating at frequency ω, is calculated from its calculated normal modes
by using the following formula [39]:
W
8π 4  # #2
3N
d γk ω
P(ω) ≡ ω
   #ρ k # , (26)
ω 2 2
h k7 ω2 − ω2 + ω2 γ 2
k k

where h is the Planck constant, W is the energy absorbed by the molecule, ωk and γk
are respectively the vibrational frequency and damping of the kth vibrational mode,
N is the total number of atoms and ρ k is the variation of the molecular dipole
moment in the vibrational mode k, with
Raman and Infrared Spectra of Acoustical, Functional Modes … 511

N
ql êk (i)
ρ k  √ . (27)
i1
mi

In Eq. 27, qi and m i are the charge and the mass of the atom i of the protein,
respectively. The vector êk (i) is the eigenvector component of the atom i of the kth
mode (Eq. 22). The damping factor γk was taken arbitrarily identical (γk  γ 
0.1 cm−1 ) for all acoustical modes because their frequencies and the scale of their
motions are similar [39].

2.5 Raman Activity of Vibrational Modes of Proteins


from Normal Modes

Raman activity of the vibrational modes of proteins from normal modes calcula-
tions is computed as follows. In a Raman active mode, the elastic deformation of
the molecule induces a variation of the molecular electronic polarizability α [56]
and the Raman intensity is proportional to the square of the derivative of the molec-
ular polarizability relative to the collective normal coordinate q (Eq. 9). As shown
elsewhere [57], the electronic polarizability of an amino acid, computed ab initio, is
simply proportional to its number of electrons. Therefore, making the assumption of
an average electronic density for all amino-acids, the polarizability of an amino acid
is simply proportional to its steric volume [39]. Using this property, the Raman activ-
ity A of each mode k of frequency ωk can be estimated by computing the following
quantity:
# # # # # #2
# ∂α #2 # ∂α∂ V #2 #
2# ∂ V #
#
A(ωk ) ≡ # # # #
# # ∼
C # , (28)
∂q k # ∂ V ∂q k # ∂q k #

where V is the steric volume of the protein and the constant C is 353.34 a.u./nm3
(1 a.u  1.649 × 10−41 C2 m2 J−1 ). The derivative in Eq. 28 is computed by
finite difference using q k  ±0.1 and the steric volume V is computed using the
software GROMACS [58]. Finally, using the Raman activities (Eq. 28), we defined
a continuous Raman spectrum P (ω) using a Lorentzian broadening:

4π 2 A(ωk )
P (ω)   2
, (29)
ωk2 − ω2 + (γ /2)2
512 A. Nicolaï et al.

2.6 Coarse-Grained Anisotropic Elastic Model

A widely coarse-grained model used to study collective motions of large proteins


is ANM [7, 30, 54]. In ANM, a protein in its folded state is simply represented
by a three-dimensional elastic network of nodes at the Cα positions (Fig. 2b). The
interactions between the atoms are replaced by harmonic springs (Fig. 2c) connecting
two nodes which are at a distance smaller than a cutoff distance Rc . Because ANM is
a coarse-grained model, it can only describe collective modes of large wave-length,
i.e. the low-frequency modes. Because the low-frequency modes contribute the most
to the structural fluctuations of the protein [14, 43], ANM reproduces rather well the
structural fluctuations of a protein in its native state and their directionalities. The
structural fluctuations can be decomposed into a series of 3M-6 modes, where M
is the number of residues (= nodes). The number of modes is thus reduced by one
order of magnitude compared to an all-atom approach. In ANM, the force constant
matrix  in Eq. 17 is defined as follows
 0  0
αβ
R i − R 0
j α
Ri − R0j β 
i j  −A # 0 # H Ri j − R c , (30)
# R − R 0 #2
i j

where Ri0 is the equilibrium position of the Cα atom of residue i (Fig. 2d) and H is
the Heaviside function.
There are only two parameters in ANM: the force constant A and the cutoff radius
Rc . The model is strictly equivalent to the Born-von Karman model developed in the
first days of solid state physics to describe the phonons of crystals [59]. Indeed,
Eq. 30 is the simplest form of  which is invariant by global translation and rotation
of the molecule and obeys the relations given in Eqs. 4 and 5.

2.7 Involvement Coefficient of Collective Modes

The contribution of a given collective mode to the transition between two states of
a protein (as for example the open and closed state of Hsp70), can be defined by
an individual and a cumulative involvement coefficient adapted from Ref. [30] and
computed as follows. A “transition pathway” is determined by linearly interpolating
between two structural states of the protein (say A an initial state and B a final
state) after optimal superposition of all the Cα atoms of these two structural states
(Fig. 3a). Only the positions of the Cα atoms are considered to describe the transition
pathway in ANM whereas in the all-atom calculation, the positions of all atoms are
considered. The positions of the ith atom in the structural states A and B are defined
by RiA and RiB , respectively. The linear pathway followed by the ith atom is defined
by RiA − RiB (Fig. 3b). The contribution of the ith atom in the mode k to the transition
between A and B is measured by the following projection
Raman and Infrared Spectra of Acoustical, Functional Modes … 513

Fig. 3 Illustrations of the linear interpolated transition pathway between the Cα atoms (in black,
panel a) and the involvement coefficients (panel b) between the open (red cartoon) and closed (blue
cartoon) states of hHsp70. The superposition of the structures in panel a was done by minimizing
the RMSD of the Cα atoms of the full-length structure


R A − RiB
I˜ik ≡ $% i · ek (i), (31)
2
j Rj − Rj
A B

where ek (i) is the eigenvector of the mode k.


The involvement coefficient of the collective mode k, which describes the degree
of involvement of the kth mode in the conformational transition A → B, is defined
by
# N #
# #
# #
Ik  # I˜i #,
k
(32)
# #
i1

where the sum is over all the N sites considered to represent the molecule, i.e. all
atoms of the protein in aa-NMA and only the Cα atoms in ANM.
Thus, the value of the involvement coefficients Ik indicates in a semi-quantitative
way the contribution of each collective motion to a given conformational change.
The maximum value of Ik is 1 and corresponds to a situation in which a single
mode contributes to the conformational change between the states A and B. In this
case, the eigenvector components are exactly in the direction of the linear interpolated
pathway between the structures A and B. A complementary quantity is the cumulative
involvement coefficient C I K , which is computed as:


K
C IK  Ik2 , (33)
k1
514 A. Nicolaï et al.

which measures the contribution of the K first lowest-frequency modes to the con-
formational change.
The cumulative coefficient is normalized:


3N
Ik2  1. (34)
k1

2.8 Limits of the Harmonic Approximation

The starting point of the harmonic approximation is the representation of the protein
by a single structure corresponding to the structure at the minimum potential energy.
Actually, it is the structure found by minimizing the structure measured by XRD
using a model of the surface potential energy. However, in solution, a protein occurs
in many conformational substates [60]. The free-energy landscape of a protein is
best regarded as a multi-dimensional surface with multiple local minima separated
by barriers. The static structure used in normal mode calculations corresponds to only
one of these minima. The conformational substates of the multi-dimensional free-
energy landscape of a protein can be projected along the amino-acid sequence [61]
showing which part of the backbone and side chains occurred in multiple substates. At
the level of one residue or bonds, the protein motion within local minima corresponds
to an anomalous diffusion [62] which can be related to NMR data [63].
Because the multiple substates are separated by activation barriers, two types of
collective atomic motions are possible in the native state: either intra-minima motion
or jumps between the minima [64]. Because jumps between the minima of the free-
energy landscape are transient events (the probability is minimal at the activation bar-
rier), a protein spent most of its time by oscillating on a multi-dimensional parabolic
free-energy surface [65]. One expects therefore that most of the collective modes
of a protein are actually harmonic in the native state. Principal component analysis
(PCA) of the protein structural fluctuations computed in molecular dynamics (MD)
lead indeed to that conclusion [43, 66]. A small fraction (12–20%) of the lowest
frequency modes (<80 cm−1 ) are anharmonic at room-temperature according to MD
simulations [43, 66]. Strictly speaking, the (harmonic) collective modes between 20
and 80 cm−1 are only well defined in crystals and crystal powders at low hydration in
the harmonic approximation (and are actually measured in these conditions). How-
ever, the study of the directions of these modes in the harmonic approximation is in a
first approximation well correlated to the actual conformational changes of proteins
in solution.
Raman and Infrared Spectra of Acoustical, Functional Modes … 515

3 Normal Mode Analysis in the Native State of a Protein

3.1 A Short Survey of Experimental Data of Low-Frequency


Modes of Proteins in Their Native State

The frequencies of the confined acoustical phonons of proteins are much smaller
than the vibrational frequencies of the chemical groups of the organic molecules and
are more difficult to measure [67]. In proteins, the lowest vibrational frequencies
of the chemical groups correspond to the librations of the methyl groups of the
side chains of the amino acids which form a large band of modes at about ≈8 THz
(240 cm−1 ) [5, 68, 69]. Direct experimental observations of low frequency modes in
proteins (<200 cm−1 ) is hampered by several factors; proximity of the frequencies
with the elastic peak, anharmonicity with lead to asymmetric broadening, damping
of the modes by the hydration layer or by the solvent, the large density of modes and
the absence of symmetry. To the best of our knowledge, the lowest frequency of a
normal mode measured in proteins is about 0.3 cm−1 (10 GHz) and corresponds to
the frequency of a longitudinal acoustical phonon in collagen [5, 70].
The most important source of experimental data about the vibrational spectra of
proteins arises from Inelastic Incoherent Neutron Scattering (IINS) experiments [68].
In IINS, neutrons of thermal energy, i.e., with a typical incoming energy of the order
of 100 meV and a wavelength of 0.1 nm, are diffracted by a protein crystal while
exciting (energy lost) or de-exciting (energy gain) a delocalized vibrational mode of
the crystal (a phonon). Because of the laws of energy and momentum conservation,
the neutrons loose or gain an energy quantum èω and exchange a momentum èQ,
where Q is the wave-vector of the phonon excited or de-excited with a wavelength
λ  2π/ Q and an energy èω. The scattering intensity measured in the detector is
proportional to the so-called incoherent dynamic structure factor Sinc (Q, ω) (where
èω and èQ are the energy transfer and momentum transfer of scattered neutrons).
The function Sinc (Q, ω) is the space-time Fourier transform of the self-correlation
function Gs (r, t) which describes the correlation of the position of an atom at time 0
with its position at time t. Therefore, Sinc (Q, ω) reflects the single particle dynamical
spectra. The number of vibrational modes (phonons) by frequency unit, named the
vibrational density of states (VDOS), can be approximately extracted from Sinc (Q,
ω) at low temperature for small wave-vectors (Q < 10 nm−1 ) [69, 71]. The VDOS
extracted from the IINS function reflects the dynamics of the hydrogen atoms of the
protein. Scattering of neutrons is from the nuclei and all vibrational modes can be
excited/de-excited.
IINS measurements of collagen [69, 72], lysozyme [73–76] and myoglobin [71,
75] crystals at low temperatures revealed one or several peaks between about 600 GHz
(20 cm−1 ) and 1.2 THz (40 cm−1 ) in Sinc (Q, ω). It is worth noting that the positions
of the maxima in the VDOS of proteins do not correspond to the positions of the
peaks in Sinc (Q, ω). In collagen, the IINS VDOS contains only two features at
low frequencies: a broad band with a maximum around 100 cm−1 and a narrower
distribution of modes with a maximum around 250 cm−1 [69]. The low-frequency
516 A. Nicolaï et al.

VDOS of hydrated [71] and dry [77] myoglobin at 100 K extracted from the Sinc (Q,
ω) resembles to the one of collagen with a broad band of modes and a maximum
around 80–100 cm−1 . For proteins, the modes around 20 cm−1 contribute the most to
the dynamic structure factor [69, 71]. Indeed, because of the Bose-Einstein statistics,
the population of these vibrational levels is large.
Low-frequency modes of proteins were also studied by Raman spectroscopy
[78–84]. Raman scattering is an inelastic light scattering process in which incident
photons with energy typically of hν  1 eV (visible light) excite (energy lost) or
de-excite (energy gain) vibrational modes of matter. In addition to scattered light
at the same frequency than the incident light, the energy lost and gain of the pho-
tons appears as scattered light at smaller frequency (named anti-Stoke lines in the
spectra) and larger frequency (named Stoke lines in the spectra), respectively. In
Raman, inelastic scattering is through the electronic density and only vibrational
modes which modify the electronic polarizability of the molecules are probed. Light
scattering probes the vibrational modes at long wavelength and does not require a
protein crystal but can be applied in solution. Because of the lack of symmetry in
proteins, the usual Raman selection rules are broken and most of the normal modes of
the macromolecule should contribute to the Raman intensity. Therefore, the Raman
intensity of the (slow) modes of proteins should be closely related to their weight in
the VDOS [85].
The low-frequency modes of lysozyme were extensively studied by Raman scat-
tering [79, 81–84]. A peak at 29 cm−1 was observed in the Raman spectra of powders
and crystals of α-chymotrypsin in the native state [78]. In the denatured state, this
peak disappeared and was replaced by a broad band between 20 and 150 cm−1 [78]. A
peak in the Raman spectra at frequencies below 30 cm−1 was observed in the native
state of several other proteins (powders and crystals) [80]: bovine serum albumin
(BSA) (14 cm−1 ), thyroglobulin (17 cm−1 ), pepsin and convanavalin A (20 cm−1 ),
insulin and ovalbumin (22 cm−1 ), lysozyme [79] and β-lactoglobulin (25 cm−1 ). For
lysozyme, the peak at 25 cm−1 was observed in protein crystals but not in solu-
tion [79]. The frequency of this peak in the Raman spectra varied with the level of
hydration: from 17 cm−1 for wet lysozyme powders to 27 cm−1 for dried lysozyme
powders [81].
The spectroscopic feature around 20–30 cm−1 observed both in Sinc (Q, ω) by
IINS and in the Raman spectra of proteins is often referred as the “boson peak”
in the literature, by analogy with the boson peak observed in disordered (glass)
materials, see for example Ref. [86]. The interpretation of this peak in glasses [87]
and biopolymers [88, 89] is still debated. In proteins, the boson peak appears at low
temperature (180 K) in protein crystals and persists at high temperature (300 K)
only in dry protein powders. In hydrated powders at high temperature (above the
so-called dynamical transition [90] 200 K) or in solution the hydration water and
side chains of the amino acids diffuse and their contribution to Sinc (Q, ω) overlap the
frequency range of the boson peak. MD simulations of protein powder in realistic
environments compare very well to the IINS data [89]. From these MD simulations,
the boson peak in proteins is believed to arise from (transverse) motions of both the
backbone and of the nonpolar buried side chains and polar hydrated side chains of
Raman and Infrared Spectra of Acoustical, Functional Modes … 517

the amino acids [89]. Beside the 20–30 cm−1 peak, the Raman spectra of lysozyme
crystals shows peaks at 75, 115 and 160 cm−1 [77]. The Raman spectra of dry and
wet lysozyme powders were fitted by using a Brownian oscillator model revealing
four contributions above 30 cm−1 : at 42, 83, 114 and 162 cm−1 in dry lysozyme
which are shifted to 45, 85, 112 and 183 cm−1 in wet lysozyme [79]. Because of the
solvent, one expects the lowest frequency protein modes to be overdamped in general
[91]. In lysozyme, no spectral feature was observed below 75 cm−1 in solution [79].
In solution, the motions of the atoms of the protein are stochastic due to water-
protein collisions. However, in the harmonic approximation of the potential energy
of the protein, the random motions of the atoms are oscillations along the directions
identical to those of the undamped vibrational modes of the molecule.
Infrared spectroscopy, scattering of infrared electromagnetic radiation, provide
information only on normal modes which modify the dipole moment of a molecule.
Infrared, Raman an IINS are hence complementary techniques. Far-infrared absorp-
tion using synchrotron radiation detected absorption at 19 cm−1 in low hydration
lysozyme [92] which could be an undamped vibrational protein mode. At high hydra-
tion, the same technique only showed infrared absorption around 26 and 38 cm−1
which resemble the one of pure water [92]. Applications of new spectroscopic tech-
niques to protein samples, such as Surface Enhanced Raman Spectroscopy [93, 94],
UV resonance Raman [95] and new circular polarization Raman spectroscopies [96]
should provide more accurate vibrational spectra of proteins in near future.
So far, the spectroscopic technique which provides the most detailed information
about low-frequency excitations of proteins (<100 GHz) is the very recent single-
molecule spectroscopy named Extraordinary Acoustic Raman (EAR) spectroscopy
[97]. In this technique, a single protein molecule is trapped in a nanohole and then
excited by two optical lasers of slightly different wavelengths which produce a beat
signal at low-frequency (<100 GHz). The beat signal corresponds to an electro-
magnetic field which can interact with the protein acoustical modes. Vibrational
resonances are then detected by measuring the increase of the molecule fluctuations
when the frequency of the beat field matches the frequency of an acoustical mode.
The mechanism of excitation of the acoustical (Raman) active modes of proteins in
EAR spectroscopy is not fully explained from experiments but is believed to be due
to the modulation of the electrostriction force at the trapping site of the molecule.
Electrostriction is a nonlinear phenomenon in which the strain induced by an elec-
trostatic electric field applied to a dielectric body is proportional to the square of the
applied electric field [98]. At the microscopic level, it is related to the anharmonicity
of the interaction potential between the atoms of a molecule and to the nonlinearity of
its electronic polarisability [99]. Electrostriction is a general nonlinear phenomenon
occurring for all dielectrics to which an electric field is applied.
Three different proteins with different sizes and shapes were tested using EAR
[97] and particularly, conalbumin. In the frequency range between 0 and 2.7 cm−1 ,
a spectral feature made of three peaks around 0.9, 1.5 and 2.5 cm−1 was observed
for this protein. To the best of our knowledge, these are the lowest acoustical modes
of proteins never detected experimentally so far. In addition, different fingerprints
518 A. Nicolaï et al.

were measured for aprotinin, carbonic anhydrase and conalbumin, showing that the
low-frequency spectra depend on the protein size and shape [97].

3.2 The Protein Conalbumin as a Model Case

To illustrate the main vibrational features of a protein in its native state, we performed
all-atom NMA of the model protein conalbumin. Conalbumin or Ovotransferrin was
identified in 1944 [100] and is well known as an iron-transport protein, which can also
bind other metal ions, including toxic ones, and is considered to play an important role
in the transportation of such metal ions. Conalbumin is a ~80 kDa single-chain protein
which is folded into two homologous lobes (N- and C-lobes) with two domains, the
two metal-binding sites being located within the inter-domain clefts of each lobe.
The structure of hydrated conalbumin used to compute its vibrational modes
was taken from the protein data bank (PDB ID: 2D3I) [101]. The normal modes
calculations were performed with the GROMACS package [58, 102] using the TIP3P
water model and the AMBER99sb-ILDN force field [103]. Only the first hydration
water layer of conalbumin was kept corresponding to 1612 water molecules, all
within 3 Å from the protein atoms. More details about the simulations can be found
here [39]. The hydrated structure is represented at Fig. 4 (panel a).
After optimization of the hydrated structure, the normal modes were calculated
and the computed spectra of infrared active modes P(ω) (Eq. 26) and Raman active
modes P (ω) (Eq. 29) were represented and compared below 10 cm−1 in Fig. 4b.
First, an interesting property is that the lowest frequency acoustical modes of conal-
bumin are separated by a gap from the rest of the normal modes. The first three
frequencies computed using all-atom NMA are: ω7  1.9 cm−1 , ω8  2.34 cm−1 ,
ω9  2.86 cm−1 (named modes 1, 2 and 3 in the present work), whereas the fourth one
is characterized by a frequency ω10  4.74 cm−1 (gap of around 2.0 cm−1 ). More-
over, as shown in Fig. 4b, signatures of the low-frequency modes occurred both in the
infrared and Raman spectra of active modes. In Fig. 4c, we compare the spectra of the
lowest frequency-modes (below 3 cm−1 ~ 100 GHz) with EAR experimental data. In
the absorption spectra P(ω), we clearly distinguish three low-frequency acoustical
modes for conalbumin. Similar signatures are found in the spectra of the Raman
active modes, P (ω). There is a striking similarity between the computed spectra of
the acoustical modes of the conalbumin and those measured by EAR except for a
frequency shift of the whole computed vibrational spectrum to higher frequencies
compared to the experimental ones. More precisely, the frequency shift ω between
the computed spectra and the experimental one is around 1.0 cm−1 . This frequency
shift was shown to be dependent on the size of the protein structure [39]. The limits
of the harmonic approximation, as detailed in Sect. 2.7 of the present chapter, might
be the key to understand the frequency shift between theory and experiment. Indeed,
the anharmonicity is not included in the present calculations. Another hypothesis
is the possible softening of the acoustical modes of proteins by the bulk solvent,
which is ignored in the present simulations. For an hydrated biomolecule, there is a
Raman and Infrared Spectra of Acoustical, Functional Modes … 519

Fig. 4 a All-atom structure of conalbumin. Structure is shown using a cartoon representation


and the first layer of water surrounding the protein is shown using a surface representation. The
figure was prepared with PyMOL (http://www.pymol.org). b Classical absorption spectra P(ω) and
Raman activity spectra P (ω) of conalbumin calculated up to 10 cm−1 from Eqs. (26) and (29),
respectively. c Comparison between calculated spectra P(ω) and P (ω) from all-atom NMA and
EAR experimental data extracted from Ref. [97]. All data are shown in the same frequency width

strong coupling between the biopolymer and water, which decreases with the par-
ticle size [39]. Indeed, we observe that the hydration water contributes to 40% to
the atomic displacements in the computed acoustical vibrational modes of conalbu-
min (Rg  3.0 nm) whereas the hydration water contributes to 65% to the atomic
displacements in the computed acoustical vibrational modes of aprotinin (Rg  1.1
nm).
The fact that the acoustical modes of conalbumin are infrared and Raman active
modes, as shown by the theoretical spectra in Fig. 4, means that the beat signal used
experimentally to excite the single molecule trapped in the nanohole induces two
things: first, a variation of the molecular dipole moment of the biomolecule due to
the excitation of the infrared modes which is an absorption, and second, a variation
of the real part of the molecular electronic polarizability due to the excitation of the
Raman active modes. Therefore, NMA calculations here are helpful to get insights
into experimental mechanisms of excitation in the EAR spectroscopic technique.
520 A. Nicolaï et al.

Finally, from the theoretical point of view, the corresponding motions of these
active acoustical modes can be depicted in order to understand the mechanisms of
excitation at the atomic level. As shown in Fig. 5a, the global motions described by
the lowest-frequency modes of conalbumin correspond to torsional motions of the N-
and C-lobes along different axes. In order to decipher the origin of the dipole moments
variations due to electric field excitation, we also computed the distribution of the
variation of the molecular dipole moment ρ k (Eq. 27) along the protein sequence.
As shown in Fig. 5b, the largest contribution to ρ k is due almost exclusively to the
positively charged residues, Arginine and Lysine, which are characterized by longer
side-chains than negatively charged residues. A further analysis of the role of theses
modes in the biological function of conalbumin will be presented by the authors
elsewhere.

4 Normal Mode Analysis for Studying Conformational


Transitions of Proteins

4.1 The Protein Hsp70 as a Model Case

Hsp70 is a major molecular chaperone involved in de novo folding of proteins


in vivo and refolding of proteins in stress conditions [104]. This chaperone increases
the efficiency of protein folding and inhibits the interactions of nascent proteins
extruded from the ribosome with other proteins within the intra-cellular medium
[104–107]. The Hsp70 molecular chaperones are ubiquitous and found in eukaryotes
and prokaryotes [104–107]. Hsp70s of all species share common structural features
and it is hypothesized that they perform their biological functions in a similar manner.
Among the Hsp70s, human Hsp70 (hHsp70) has attracted a great interest because
of its demonstrated implications in numerous misfolding diseases [108] (Parkinson,
Alzheimer,…) and in cancer [109].
The Hsp70 protein is comprised of two main domains: an N-terminal nucleotide
binding domain (NBD), divided in four subdomains (IA, IIA, IB, IIB) and a C-
terminal substrate protein binding domain (SBD) divided in two subdomains (an
α-helical lid named SBD-α and a peptide binding pocked named SBD-β, Fig. 2a).
Hsp70 assists folding of other proteins, through cycles of binding and release of
unfolded polypeptides chains in the SBD-β by binding selectively short peptides
stretches within the chain [104–106]. From a physical point of view, Hsp70 is a
machine performing its biological function through a cycle where the energy of ATP
hydrolysis is converted in large structural changes (Fig. 6). Binding and release of
peptides is indeed governed by ATP hydrolysis in the NBD and by the exchange of
the product of the hydrolysis (ADP and Pi ) [110, 111] and is promoted by the action
of co-chaperone proteins [112–114].
The position of SBD-α relative to SBD-β defines two main conformational states
of the Hsp70 chaperone [104–107]: either the lid is open and the peptide can access
Raman and Infrared Spectra of Acoustical, Functional Modes … 521

Fig. 5 a Cartoon representation of the acoustic collective modes extracted from classical spectra
shown in Fig. 4 for modes 1, 2 and 3 of conalbumin. Black arrows represent the direction and the
strength of the atomic displacement vector in the corresponding mode for the Cα atoms. Colored
arrows represent the global motion of each protein in the corresponding mode. Spheres represent the
position of the Cα and the color code corresponds to the strength of the displacement per residue. The
figure was prepared with PyMOL (http://www.pymol.org). b Norm of the dipole moment variation
ρ k along the amino acid sequence of conalbumin for each normal mode k shown in panel (a).
Positively and negatively charged residues are shown with blue and red dots along the sequence,
respectively. Other residues are shown by black dots

the hydrophobic pocket within SBD-β, we named this state the “open” conformation
of the chaperone (as shown in Fig. 6 conformation a), or the lid is closed and the
peptide is trapped into the pocket, we refer this as the “closed” conformation (as
shown in Fig. 6 conformation b) [40, 41]. In ATP-bound Hsp70 (open structural
state), the SBD is opened with fast binding and release of the protein substrate,
and the SBD and NBD are docked, as shown by low-resolution Small-Angle X-ray
Scattering (SAXS) data [115, 116] and suggested by the XRD structure of an ATP-
Hsp110 homologue [112–114]. In the nucleotide-free Hsp70 and in ADP-bound
Hsp70 (closed structural state), the SBD is assumed to be closed with low binding
and release rate of the protein substrate. The SBD and the NBD are undocked and
522 A. Nicolaï et al.

Fig. 6 Hsp70 chaperone cycle. The color code is the same as in Fig. 1. The main states of Hsp70
are named [A] for the open state, [B] for the closed state and [B*] for the intermediate state after
release of ADP. The directions of motion in the lowest frequency acoustic mode found in all-atom
calculations of the normal modes of hHsp70 in the states [A] and [B] are schematically represented

the inter-domain linker is exposed to solvent, as shown by SAXS data [115, 116]
and by the two-domain Hsp70 NMR derived-structure [117].
Although the main steps of the chaperoning cycle of Hsp70s (Fig. 6) are clearly
identified as described above, the details of the conformational changes and the mech-
anism of communication between the NBD and the SBD remain unclear. Numerical
simulations of the Hsp70 cycle could help to understand the mechanism of commu-
nication between the different subdomains of this rather large (10 nm in the closed
state) molecule. On one hand, only coarse-grained models using realistic anharmonic
potentials are able to reach the time-scale of the conformational changes (opening of
the SBD and docking of the NBD onto the SBD) [118]. However, such a simulation
method missed the detailed interactions between the nucleotides and the NBD pocket
as well as the possible role of water. On the other hand, all-atom simulations easily
include these effects but at the expense of large computational times and because of
that, they are still limited to the microsecond time-scale [40, 41] which is far from
the actual time-scale of the conformational changes (millisecond-second).
Another approach consists not to simulate explicitly the transition between the
open and closed states of the Hsp70 chaperone but only its structural fluctuations in
the vicinity of the two main local minima of its free-energy landscape: the nucleotide-
free or ADP-bound Hsp70 (close state) and the ATP-bound Hsp70 (open state). There
Raman and Infrared Spectra of Acoustical, Functional Modes … 523

is indeed a lot of indications that the low-frequency modes of proteins correlate


to their functional modes (see Sect. 1) and a normal mode analysis using an all-
atom potential model (including water and the nucleotides) may provide hints in the
mechanism of communication between the NBD and SBD domains of Hsp70 [40].
For comparison, all-atom calculations of the low-frequency motions of Hsp70 are
compared to those computed by applying the popular Gaussian elastic model.

4.2 Acoustic Modes of Human Hsp70 and Conformational


Transition Between Its Open and Closed States

4.2.1 Methods

As described in details elsewhere [40], the initial models of human Hsp70 (hHsp70)
in an open state and in a closed state were built by homology modeling based on the
templates Hsp110 (PDB ID: 3C7N chain A) and DnaK (PDB ID: 2KHO), respec-
tively. The models were relaxed by using all-atom MD simulations in explicit water
with the GROMACS software package [58, 102] using the Simple Point Charge
(SPC) water model and the GROMOS96 ffG43a1 force field [119, 120]. The two
hydrated structures of hHsp70 used to compute the normal modes were the repre-
sentative structures extracted from the MD run APO1 of the open model and from
the MD run APO1 of the closed model [40]. Only the first hydration water layer of
hHsp70 was kept corresponding to 939 (open) and 915 (closed) water molecules, all
within 3 Å from the protein atoms, as shown in Fig. 7a.
In the all-atom normal mode analysis, the structure of the protein (including the
first hydration layer) is described by the set of point masses M1 , M2 , . . . M N  {Mi }
located at R1 , R2 , . . . R N  {Ri }, respectively. Each point mass Mi represents an
atomic mass and all degrees of freedom are taken into account explicitly. In this
case, E in the Eq. 16 is simply the all-atom potential energy of the hydrated protein.
The harmonic vibrational modes of the open and closed structures of hHsp70 were
determined using the GROMACS [58, 102] software package and the GROMOS96
ffG43a1 force field [119, 120]. The resulting number of modes is 3N, where N , the
number of atoms is 6265 for the protein plus 2745 (915 water molecules) and 2817
(939 water molecules) atoms for the solvent, which gives 27,030 and 27,246 modes
for the open and closed structures, respectively. The sixth first modes corresponding
to global translation and rotation of the system are not considered here and so the
index of the modes starts from 7.
We applied also ANM to hHsp70 by using Eq. 34 with Rc  11 Å [40] and with A
fitted to the Bi factors (Eq. 24) computed from the all-atom normal mode calculations
for the same systems (Fig. 8). The best values of the force constant A reproducing
the all-atom Bi factors were 4.8 and 4.2 kcal/mol/Å2 for hHsp70 in open and closed
states, respectively. With these force constants, ANM reproduces very nicely the
structural fluctuations along the sequence of hHsp70 with a correlation coefficient ρ
524 A. Nicolaï et al.

Fig. 7 a Cartoon representation of the atomic structures of hydrated hHsp70 protein in the open (left
panel) and closed (right panel) conformations. Water molecules are shown as transparent spheres.
The color code is the following: subdomain IA, blue; IB, marine; IIA, lightblue; IIB, cyan; linker,
magenta; SBD-β, green; SBD-α, red and C-term, gray. These figures were prepared with PyMOL
(https://www.pymol.org). b Density of states D S (ω) of low-frequency vibrations of hHsp70 in the
open (green) and closed (blue) conformations. c Classical absorption spectra P(ω) of hHsp70 for
the open (green) and closed (blue) conformations computed from NMA using the GROMOS96
ffG43a1 force-field. Different damping γ are represented: 0.1 cm−1 (top panel), 1.0 cm−1 (middle
panel) and 10.0 cm−1 (bottom panel)

 0.86 (open) and 0.89 (closed) between the Bi factors computed in ANM and those
computed with an all-atom force field, as shown in Fig. 8.

4.2.2 Acoustical Modes of hHsp70

From aa-NMA calculations, there are 24 and 26 non-zero collective modes below
10 cm−1 for the open and closed conformations, respectively (Fig. 7b). The lowest
non-zero frequency mode of hHsp70, i.e. ω7 , occurs at 2.77 cm−1 for the open confor-
mational state and at 1.22 cm−1 ) for the closed conformational state. This difference
Raman and Infrared Spectra of Acoustical, Functional Modes … 525

Fig. 8 B-factors computed in all-atom normal mode calculations (black line) and in the anisotropic
network model (ANM) (red line) for the open state (panel a) and for the closed state (panel b) of
hHsp70 using Eqs. 24 and 15 in the text, respectively. The values of the constants A (Eq. 33) given
in the figure are the best values giving the highest correlation ρ between the two sets of computed
B factors

of 1.5 cm−1 can be explained by the fact that the closed state has a more elongated
structure than the open state, for which the two domains are docked. Consequently,
the closed state of hHsp70 may subtend modes of longer wavelength than the closed
state of hHsp70 and thus of smaller frequency. However, these differences between
the two main conformational states of hHsp70 are not visible in their density of states
D S (ω), as shown in panel b of Fig. 7.
Therefore, at first glance, it seems not possible to identify the two main conforma-
tional states of hHsp70 based on the sole measurement of D S (ω) using for example
inelastic Neutron scattering [68]. However, based on the experimental results of EAR
[97], one may expect the acoustical modes to interact with an electric field. Each
acoustical mode should have a different signature in the EAR or in the far-infrared
spectra of hHsp70 depending on its dipole moment. A variation of the molecu-
lar dipole moment at acoustical frequency is expected as shown for conalbumin in
Sect. 3.2 because hHsp70 has a strong dipolar character, as 81 residues are positively
charged and 92 are negatively charged [121].
We computed the classical infrared (absorption) spectra of hHsp70 from aa-NMA
as done for conalbumin using Eq. 26. On the opposite to conalbumin, there is no EAR
experimental data and we cannot estimate the damping effects. Therefore, we decided
to study the infrared spectra of hHsp70 using different order of magnitude for the
damping factor γ (Eq. 26), from weakly damped (as observed in EAR for conalbumin
[97], i.e. 0.1 cm−1 ) to overdamped modes (as observed in dielectric spectroscopy for
lysozyme [67], i.e. 10 cm−1 ). As explained above for conalbumin, the damping has
a huge impact on both the intensities and the positions of the peaks in the IR spectra.
Figure 7b shows spectra P(ω) of hHsp70 for the open and closed conformational
states as a function of the value of the damping constant γ . First of all, it is clear
from Fig. 7b that the open and closed conformations show different spectra P(ω),
independently of the value of the damping constant γ and also independently of the
force-field used for the calculations, as detailed elsewhere [121]. In fact, the closed
526 A. Nicolaï et al.

conformation shows an intense peak at ω  1.2 cm−1 whereas the same peak is
shifted at ω  3.5 cm−1 for the open conformation. As expected, an increase of the
damping constant γ from 0.1 to 1.0 cm−1 goes together with an increase of the width
of the peaks and with a decrease of the spectral resolution but does not change the
position of the peaks because all acoustical modes have frequencies larger than γ
(regime of damped modes). By increasing the damping constant from 1.0 to 10 cm−1 ,
another phenomenon is observed in Fig. 7b. There is a shift of the most intense peak
of the closed conformation to a lower frequency, namely 0.3 cm−1 , because the most
intense and lowest frequency modes have frequencies ω7 , ω8 and ω9 that are larger
than γ (regime of overdamped modes). Finally, as shown in Fig. 7b, even for γ as large
as 10.0 cm−1 , the two conformational states of the protein could be distinguished.
Note that the exact same conclusions were extracted from the NMA and IR spectra
calculations using the AMBER99sb-ILDN and the CHARMM27 force-field [121].

4.3 Functional Modes of hHsp70

4.3.1 Contribution of the Low-Frequency Modes to the Transition


Between the Open and Closed States of hHsp70

In the lowest frequency acoustical modes, only some of them are useful to identify the
functionally important motions for the transition between the open and closed state of
the chaperone cycle (Fig. 6). As shown for a few proteins, the collective modes of the
structures in the initial and final states of a conformational change contain information
about the dynamic of the transition [24]. The relevance of the low-frequency modes
for the conformational transition of hHsp70 between its open and closed states can be
quantified by their involvement coefficients (see Sect. 2.6). In brief, a linear pathway
interpolating between the two conformations (open and closed) was built. For each
collective mode k, the projection of the atomic displacements within the mode k
on the interpolated pathway defined the involvement coefficient Ik of the mode (the
maximum value is 1, corresponding to a perfect match between the displacements
of the atoms within the mode and the interpolating pathway, see Eq. 32). The sum of
the square involvement coefficient of each mode up to an index K is the cumulative
involvement coefficient C I K , (see Eq. 33). The coefficients Ik and the cumulative
coefficients C I K for the transition from the open to the closed states (Fig. 9a, b) and
vice versa (Fig. 9c, d) were computed in the coarse-grained approach for the first 100
modes and in the all-atom approach up to 25 cm−1 (corresponding to 250 modes, i.e.
less than 1% of the total number of degrees of freedom).
In ANM, the cumulative involvement coefficient indicates that the first 10 and the
first 100 slow modes of a total of the 1917 modes of nonzero frequency of hHsp70
account for 45% (C I10  0.45) and 69% (C I100  0.69) of the displacement from the
open to the closed state, respectively (Fig. 9a). The same result is observed for the
reverse transition, from the closed state to the open state (C I10  0.60 and C I100 
0.73, Fig. 9c). This emphasizes the high contribution of the slowest modes to the tran-
Raman and Infrared Spectra of Acoustical, Functional Modes … 527

Fig. 9 The individual (boxes) and cumulative (full line) involvement coefficients of the modes com-
puted in the coarse-grained (red) and all-atom (black) normal mode calculations for the transition
open → closed (panels a and b) and closed → open (panels c and d) of hHsp70

sition. In addition, the mode contributing the most to this transition (open → closed)
is the mode having the lowest-nonzero “frequency” λ7 , which has an involvement
coefficient I7  0.62 whereas, for the transition closed → open, the mode λ12 (I12 
0.57) is the mode which contributes the most to the transition (Fig. 9c).
In the all-atom NMA calculation, the cumulative coefficient C I K for the
open → closed transition (Fig. 9b) and for the closed → open transition (Fig. 9d)
reached about 0.35 at 5 cm−1 and increases linearly at higher frequency to reach
about 0.50 at 25 cm−1 . For the transition open → closed, the lowest nonzero fre-
quency mode ω7  2.78 cm−1 (83.4 GHz) has a large involvement coefficient I7 
0.49 (Fig. 9b), whereas, for the transition closed → open, the mode ω11  3.22 cm−1
(96.6 GHz) has the largest contribution (I11  0.39, Fig. 9d). In addition, the modes
ω7  1.22 cm−1 (36.6 GHz) and ω13  4.19 cm−1 (125.7 GHz) have an involvement
coefficient significantly larger than the other modes (Fig. 9d), with respectively I7 
0.26, and I13  0.31.
528 A. Nicolaï et al.

4.3.2 Functional Modes for the Transition Open → Closed

Functional Mode with the Highest Contribution for the Transition


Open → Closed: Comparison Between All-Atom and Coarse-Grained NMA

First, we compare the collective mode of the open conformation of hHsp70 having the
largest contribution in the coarse-grained and in the all-atom NMA for the transition
from the open to the closed state, i.e. the lowest nonzero frequency modes λ7 and
ω7 (Fig. 9a, b). The global motion described by the lowest-frequency mode is the
same in ANM and in the all-atom calculation, i.e. it corresponds to the closure of the
SBD (Fig. 10a, b). Indeed, in the mode λ7 computed in ANM, the SBD is the most
mobile part, whereas the NBD moves as rigid unit (Fig. 10a). In the global motion
described by the modes λ7 and ω7 , the helix A of the SBD-α serves as a hinge region
around which the SBD-β and the rest of the SBD-α (helices B + C + D) move toward
(Fig. 10a, b). The SBD-β and SBD-α move in opposite directions from each other in
both all-atom and ANM calculations (Fig. 10a, b) [40].

Functional Mode as a Linear Combination of a Subset of All-Atom Normal


Modes Weighted by Their Contribution

A linear combination of a subset of low-frequency all-atom normal modes, weighted


by their involvement coefficients is more realistic for describing a given conforma-
tional change between two structural states [24, 122], due to the fact that other modes
than the mode ω7 contribute significantly to the transition open → closed (such as
ω11 , with I11  0.22, Fig. 9b). This superposition of modes is named the Involvement
Coefficient weighted mode (ICw):

eiI Cw ≡ Ik eik . (41)
k,ωk <25 cm−1

Because the superposition of the modes within the frequency range 0–25 cm−1
(1% of the total number of modes of hHsp70) catches more than 50% of the structural
change (Fig. 9b), we decided to build the ICw mode from the all-atom normal modes
with ω < 25 cm−1 . In all-atom NMA, the mode ICw, describing the best the transition
from the open to the closed state in the harmonic approximation, corresponds to a
global motion of the two parts of the SBD moving in opposite directions and simu-
lating the closure of the SBD (Fig. 11a). In addition, deformations of the subdomains
IB, IIA and IIB of the NBD were observed: they modify the structure of the NBD.
Indeed, the subdomain IB tends to follow the motion of the SBD-β. Considering the
fact that the SBD-β is very close to the subdomain IB of the NBD and the fact that
the SBD-α is bound to the lobe I of the NBD in the open conformation of hHsp70, it
seems logical that a rearrangement of the lobe I of the NBD must be coupled to the
undocking of the SBD from the NBD. In the lobe II of the NBD, the motion in the
subdomain IIA tends to modify the surface binding cleft between the subdomains IA
Raman and Infrared Spectra of Acoustical, Functional Modes … 529

(a) (b)

open hHsp70 mode λ7 open hHsp70 mode ω7

(c) (d)

closed hHsp70 mode λ12 closed hHsp70 mode ω11

Fig. 10 a Graphical representation of the collective mode λ7 of the open state of hHsp70 computed
from ANM. b Graphical representation of the collective mode ω7 of the open state of hHsp70
computed from all-atom method. c Graphical representation of the collective mode λ12 of the
closed state of hHsp70 computed from ANM. d Graphical representation of the collective mode
ω11 of the closed state of hHsp70 computed from all-atom method. Eigenvectors are represented
by gray arrows and black arrows represent the sum of the eigenvectors of the residues belonging
to the same subdomain, i.e. NBD-IA, IB, IIA, IIB, linker, SBD-β, SBD-α and C-terminal. Black
spheres represent the center of mass of each subdomain. The color code is the same as in Fig. 1.
The panels a and b (c and d) correspond to the same view. The figure was prepared with PyMOL (
http://www.pymol.org)

and IIA. It has been demonstrated that this cleft IA/IIA is crucial for conformational
dynamics of Hsp70 [123].
530 A. Nicolaï et al.

open hHsp70 mode ICw closed hHsp70 mode ICw

Fig. 11 a Representation of the collective mode ICw of hHsp70 for the transition open → closed
computed from all-atom NMA. b Representation of the collective mode ICw of hHsp70 for the
transition closed → open computed from all-atom NMA. The representation properties are the
same as in Fig. 10. Panels a and b were prepared with PyMOL (http://www.pymol.org)

4.3.3 Functional Modes for the Transition Closed → Open

Functional Mode with the Highest Contribution for the Transition


Closed → Open: Comparison Between All-Atom and Coarse-Grained NMA

In ANM, as shown in Fig. 10c, the mode contributing the most to the transition
closed → open is the mode λ12 whereas in the all-atom calculation, the mode con-
tributing the most to the transition closed → open is the mode ω11 (Fig. 9d). The global
motion described by the mode λ12 to a compression/elongation which restricts the
mobility of the linker (Fig. 10c). This mode does not correspond to a direct opening
of the lid although there are important fluctuations in the SBD. The motion in the
SBD corresponds to a sliding motion of the lid coupled to a sideward motion of
the C-terminal part (Fig. 10c), the SBD-β tending to perform an upward motion. In
the NBD, the rotation of the subdomains IIA and IIB is observed. In the all-atom
calculation, the global motion described by the mode ω11 corresponds to a compres-
sion/elongation of the structure, as observed in ANM. In addition, the same motion
is observed within the SBD, i.e. a sliding motion of the lid coupled with a side-
ward motion of the C-terminal part with the SBD-β performing an upward motion
(Fig. 10d). In the NBD, we observed a rigid rotation of the complete lobe II of the
NBD. In the all-atom NMA, the fluctuations are less distributed in the whole structure
compared with ANM (Fig. 10c). The all-atom calculations confirm the dynamical
coupling between the lobe II of the NBD and the SBD-α observed in ANM [40].
This provides a natural mechanism of communication between the C-terminal part
of the protein and its N-terminal part which are about 10 nm apart in the closed state.
Raman and Infrared Spectra of Acoustical, Functional Modes … 531

Functional Mode as a Linear Combination of a Subset of All-Atom Normal


Modes Weighted by Their Contribution

In all-atom normal mode analysis, the mode ICw which describes the best the tran-
sition from the closed to the open state corresponds to a coordinated motion of the
NBD and of the SBD (Fig. 11b). The global motion observed in the ICw mode for
the transition closed → open is due to a large displacement of the SBD-α (Fig. 11b),
which tends to open the SBD (the SBD-β and the SBD-α moves in different direc-
tions, Fig. 11b) and which is coupled to large motions of the lobe II and of the
subdomain IB of the NBD (Fig. 11b). The deformation of the NBD in the ICw mode
corresponds to a rotation of the lobe II of the NBD. The subdomain IA being rather
immobile and the subdomain IB tends to follow the motion of the subdomain IIA
(Fig. 11b). The deformation of the SBD in the ICw mode corresponds to an upward
motion of the SBD-β as well as a sideward motion of the lid (Fig. 11b).
As observed for the transition open → closed, the motions within the NBD are
coupled to the motions of the SBD, establishing a communication channel between
the NBD and the SBD. It is very interesting to observe that the rotation of the
subdomain IIB is coupled to the motion of the SBD as observed in the ANM calcu-
lation [40]. In addition, rotation of the subdomain IIB was shown by NMR to be an
important characteristic of the structural changes induced by a nucleotide-exchange
co-chaperone and by the replacement of ADP by ATP in the NBD structures [113,
124, 125].

4.3.4 Functional Modes and Bose-Einstein Statistics

Involvement coefficients and the ICw modes permit to define a few collective coor-
dinates describing the structural changes between two states of hHsp70. According
to this objective, the weight attributed to each low-frequency mode corresponds to its
involvement coefficient, in order to construct the best collective mode interpolating
linearly between the geometries of the two states of the molecule [24]. This standard
approach does not consider the statistics, i.e. the probability to find the protein in
a given mode at a given temperature that we explore here. At equilibrium, in the
harmonic approximation, the number of modes excited at a given frequency ω is
given by the Bose-Einstein statistic n B E (ω).
1
nBE   ω , (35)
exp kT
−1

where k is the Boltzamnn constant and T is the temperature.


In order to evaluate the importance of the statistics for the low-frequency structural
fluctuations of hHsp70, we built a superposition of all the normal modes of the protein
for ω < 100 cm−1 , where each mode was weighted by its population n B E (ω) (Eq. 35).
532 A. Nicolaï et al.

eiB Ew ≡ n B E (ωk )eik . (36)
k,ωk <100 cm−1

We compared this superposition of modes (we named the Bose-Einstein weighted


mode, BEw) for the two structural states (open and closed) with the superposition
of modes built from the linear pathway between the open and closed states (the ICw
mode).
In the open state, the overlap between the BEw and the ICw vectors is significant
(0.64) indicating that the motions are rather similar in both calculations. The overlap
computed between the eigenvectors projected on each subdomain of the protein
(Fig. 12c) demonstrates that the similarity between the BEw and ICw modes is the
smallest for the N-terminal part of subdomain IA (residues 1–39) and for the SBD-α
(residues 508–615). The highest similarity of the subdomain motions was found for
the subdomain IIB of the NBD. In summary, for the open state of hHsp70, weighting
the modes by using the Bose-Einstein statistics does not modify significantly the
description of the structural fluctuations facilitating the transition between the two
structural states although the amplitudes of motions of the SBD are reduced in the
BEw mode compared with the ICw mode, as shown in Fig. 12a.
For the closed state of hHsp70, the BEw mode is quite similar to the ICw mode as
shown in Fig. 12b (the global overlap is 0.66). The overlap between the two vectors
projected on the SBD-β and SBD-α is around 0.65 (Fig. 12d). In the NBD, there are
some large differences for the influences of the residues located in the subdomain IB
(residues 40–115) and IIB (residues 229–306, Fig. 12d). However, the calculation
of the overlap between the eigenvectors ICw and BEw is quite high for these two
subdomains, being respectively 0.8 (IB) and 0.7 (IIB, Fig. 12d). The largest overlap
between the ICw and the BEw mode is observed for the linker whereas the smallest
overlap between the ICw and the BEw mode is observed for the subdomain IIA of
the NBD (Fig. 10d).
In summary the global motion described by the BEw mode of the closed structure
is similar to the one described by the ICw mode. This property is important due
to the fact for several proteins which show large conformational changes, only one
conformation has been solved experimentally and one could not use the involvement
coefficient method. Therefore, the BEw mode could be a good representation of the
ICw mode for such analysis.

5 Concluding Remarks

The low-frequency modes of proteins have been studied since about four decades.
Experimentally, low-frequency vibrational modes of proteins were measured by Neu-
tron scattering, Raman spectroscopy and by Far-Infrared spectroscopy. The first well-
resolved acoustical modes of several proteins at frequencies as low as 0–3.3 cm−1
(0–100 GHz) were detected very recently by using a nanobiosensing device: the
Extraordinary Acoustic Raman (EAR) spectroscopy. In EAR, a single molecule is
Raman and Infrared Spectra of Acoustical, Functional Modes … 533

(a)
(b)

open hHsp70 mode BEw vs. mode ICw closed hHsp70 mode BEwvs. mode ICw

(c) (d)

Fig. 12 Graphical representation of the functional mode BEw of hHsp70 in the open (panel a) and
closed (panel b) states. The sum of the eigenvectors per subdomain of the BEw and ICw mode are
represented by black and gray arrows, respectively. Overlap of the eigenvectors of the BEw and
ICw modes computed for each subdomain for the open (panel c) and for the closed (panel d) states
of hHsp70. The color code is the same as in Fig. 1. Panels a and b were prepared with PyMOL (
http://www.pymol.org)

trapped and excited by a low-frequency electric field. From all-atom normal modes
calculations applied to conalbumin, we demonstrated that detected modes are both
IR and Raman active and we identified the type of motions and the origin of the
mechanisms: they are torsional large-scale vibrational modes producing significant
local variation of the molecular dipole moment due to the motions of the charged
residues, i.e. Arginine and Lysine residues, with the longest side chains.
In human Hsp70, the modes at a frequency below 30 cm−1 contribute the most to
the transition between its two structural states. In fact, only a few modes are enough to
describe the motions of the protein which are the most collinear to a simplified inter-
polated pathway between its two structural states. These findings are in agreement
534 A. Nicolaï et al.

of what was observed so far by normal mode analysis of conformational changes in


proteins. The contribution of a low-frequency mode to a conformational transition is
generally measured by its involvement coefficient. This concept does not take into
account the probability of excitation of the protein mode at the thermal equilibrium.
We introduced a superposition of the low-frequency modes of a protein weighed by
the Bose-Einstein statistics and have shown that this superposition of modes covers
the same functional motions than the usual superposition of modes weighed by the
involvements coefficients. In the case of human Hsp70 and using coarse-grained and
all-atom calculations, we found that the lowest non-zero frequency mode is delo-
calized through the whole protein. This mode provides a means of communication
between the NBD and SBD domains separated by a distance as large as about 10 nm.
The combination of simulations presented here and the development of new (non-
linear) spectroscopies dedicated to proteins should bring a revival of the normal mode
calculations of proteins of therapeutic interest. Interpretation of the vibrational pro-
tein spectra, in particular in the frequency region including the functional modes and
high frequency modes (up to 1000 cm−1 ) should help to identify interesting protein
intermediates and conformational pathways. We hope that the present results will
stimulate new experimental investigations of acoustical modes of biomolecules such
as proteins, protein nanomachines and viruses.

References

1. Benedek, G., Ellis, J., Reichmuth, A., Ruggerone, P., Schief, H., Toennies, J.P.: Organ-pipe
modes of sodium epitaxial multilayers on Cu(001) observed by inelastic helium-atom scat-
tering. Phys. Rev. Lett. 69, 2951–2954 (1992)
2. Senet, P., Lambin, P., Lucas, A.A.: Standing-wave optical phonons confined in ultrathin over-
layers of ionic materials. Phys. Rev. Lett. 74, 570–573 (1995)
3. deGennes, P.G., Papoular, M., Polarisation, matière et rayonnement. In: Volume in honor of
Alfred Kastler, Presse Univ Fr, Paris (1969)
4. Gō, N.: Shape of the conformational energy surface near the global minimum and low-
frequency vibrations in the native conformation of globular proteins. Biopolymers 17,
1373–1379 (1977)
5. Petitcolas, W.L., Dowley, M.W.: Acoustical phonon spectra of biological polymers. Nature
212, 400–401 (1966)
6. Keskin, O., Jernigan, R.L., Bahar, I.: Proteins with similar architecture exhibit similar large-
scale dynamic behavior. Biophys. J. 78, 2093–2106 (2000)
7. Atilgan, A.R., Durell, S.R., Jernigan, R.L., Demirel, M.C., Keskin, O., Bahar, I.: Anisotropy
of fluctuation dynamics of proteins with an elastic network. Biophys. J. 80, 505–515 (2001)
8. Lamb, H.: On the vibration of an elastic sphere. Proc. London Math. Soc. 13, 189–212 (1881)
9. Koizumi, H., Tachibana, M., Kojima, K.: Elastic constants in tetragonal hen egg-white
lysozyme crystals containing large amount of water. Phys. Rev. E 79, 061917 (2009)
10. Bellissent-Funel, M.-C., Teixeira, J., Chen, S.H., Dorner, B., Middendorf, H.D., Crespi, H.L.:
Low-frequency collective mode in dry and hydrated proteins. Biophys. J. 56, 713–716 (1989)
11. Edwards, C., Palmer, S.B., Emsley, P., Helliwell, J.R., Glover, I.D., Harris, G.W., Moss, D.S.:
Thermal motion in protein crystals estimated using laser-generated ultrasound and Young’s
modulus measurements. Acta Cryst. A 46, 315–320 (1990)
12. Tachibana, M., Kojima, K., Ikuyama, R., Kobayashi, Y., Ataka, M.: Sound velocity and
dynamic elastic constants of lysozyme single crystals. Chem. Phys. Lett. 332, 259–264 (2000)
Raman and Infrared Spectra of Acoustical, Functional Modes … 535

13. McCammon, J.A., Gelin, B.R., Karplus, M.: The hinge-bending mode in lysozyme. Nature
262, 325–326 (1976)
14. Gō, N., Noguti, T., Nishikawa, T.: Dynamics of a small globular protein in terms of low-
frequency vibrational modes. Proc. Natl. Acad. Sci. U S A 80, 3696–3700 (1983)
15. Brooks, B., Karplus, M.: Harmonic dynamics of proteins: normal mode and fluctuations in
bovine pancreatic trypsin inhibitor. Proc. Natl. Acad. Sci. U S A 80, 6571–6575 (1983)
16. Levitt, M., Sander, C., Stern, P.S.: Protein normal mode dynamics: trypsin inhibitor, crambin,
ribonuclease and lysozyme. J. Mol. Biol. 181, 423–447 (1985)
17. Brooks, B., Karplus, M.: Normal modes for specific motions of macromolecules: application
to hinge-bending mode of lysozyme. Proc. Natl. Acad. Sci. U S A 82, 4995–4999 (1985)
18. Dykeman, E.C., Sankey, O.F.: Normal mode analysis and applications in biological physics.
J. Phys.: Condens. Matter 22, 423202 (2010)
19. Hayward, S., Berendsen, H.J.C.: Systematic analysis of domain motions in proteins from con-
formational change: New results on citratesynthase and T4 lysozyme. Proteins 30, 144–154
(1998)
20. Gerstein, M., Lesk, A.M., Chothia, C.: Structural mechanisms for domain movements in
proteins. Biochemistry 33, 6739–6748 (1994)
21. Gerstein, M., Krebs, W.A.: A database of macromolecular motions. Nucleic Acids Res. 26,
4280–4290 (1998)
22. Gō, M., Gō, N.: Fluctuations of alpha-helix. Biopolymers 15, 1119–1127 (1976)
23. Gō, N., Scheraga, H.A.: Analysis of the contribution of internal vibrations to the statistical
weights of equilibrium conformations of macromolecules. J. Chem. Phys. 51, 4751–4767
(1969)
24. Cui, Q., Li, G., Ma, J., Karplus, M.: A normal mode analysis of structural plasticity in the
biomolecular motor F1-ATPase. J. Mol. Biol. 340, 345–372 (2004)
25. Gaillard, T., Dejaegere, A., Stote, R.H.: Dynamics of beta3 integrin I-like and hybrid domains:
insight from simulations on the mechanism of transition between open and closed forms.
Proteins 76, 977–994 (2009)
26. McCammon, J.A.: Protein dynamics. Rep. Prog. Phys. 47, 1–46 (1984)
27. Bennett, W.S., Huber, R.: Structural and functional aspects of domain motions in proteins.
CRCCR Rev. Bioch. Mol. 15, 291–384 (1984)
28. Karplus, M., Petsko, G.A.: Molecular dynamics simulation in biology. Nature 347, 631–639
(1990)
29. Berendsen, H.J.C., Hayward, S.: Collective protein dynamics in relation to function. Curr.
Opin. Struct. Biol. 10, 165–169 (2000)
30. Tama, F., Sanejouand, Y.H.: Conformational change of proteins arising from normal mode
calculations. Protein Eng. 14, 1–6 (2001)
31. Rod, T.H., Radkiewicz, J.L., Brooks, C.L.: Correlated motion and effect of distal mutations
in dihydrofolate reductase. Proc. Natl. Acad. Sci. U S A 100, 6980–6985 (2003)
32. Tobi, D., Bahar, I.: Structural changes involved in protein binding correlate with intrinsic
motions of proteins in the unbound state. Proc. Natl. Acad. Sci. U S A 102, 18908–18913
(2005)
33. Dobbins, S.E., Lesk, V.I., Sternberg, M.J.E.: Insights into protein flexibility: the relationship
between normal modes and conformational change upon protein-protein docking. Proc. Natl.
Acad. Sci. U S A 105, 10390–10395 (2008)
34. Bakan, A., Bahar, I.: The intrinsic dynamics of enzymes plays a dominant role in determining
the structural changes induces upon inhibitor binding. Proc. Natl. Acad. Sci. U S A 106,
14349–14354 (2009)
35. Benkovic, S.J., Hammes-Schiffer, S.: Enzyme motions inside and out. Science 312, 208–209
(2006)
36. Nashine, V.C., Hammes-Schiffer, S., Benkovic, S.J.: Coupled motions in enzyme catalysis.
Curr. Opin. Chem. Biol. 14, 644–651 (2010)
37. Henzler-Wildman, K., Kern, D.: Dynamic personalities of proteins. Nature 450, 964–971
(2007)
536 A. Nicolaï et al.

38. Zwier, M.C., Chong, L.T.: Reaching biological timescales with all-atom molecular dynamics
simulations. Curr. Opin. Pharm. 10, 745–752 (2010)
39. Nicolaï, A., Delarue, P., Senet, P.: Theoretical insights into sub-terahertz acoustic vibrations
of proteins measured in single molecule experiments. J. Phys. Chem. Lett. 24(7), 5128–5136
(2016)
40. Nicolaï, A., Senet, P., Delarue, P., Ripoll, D.R.: Human inducible Hsp70: structures, dynamics,
and interdomain communication from all-atom molecular dynamics simulations. J. Chem.
Theory Comput. 6, 2501–2519 (2010)
41. Nicolaï, A., Senet, P., Delarue, P.: Conformational dynamics of full-length inducible human
Hsp70 derived from microsecond molecular dynamics simulations in explicit solvent. J.
Biomol. Struct. Dyn. (2012) (in press)
42. Noguti, T., Gō, N.: Structural basis of hierarchical multiple substrates of a protein. IV: rear-
rangements in atom packing and local determinations. Proteins 5, 125–131 (1989)
43. Hayward, S., Kitao, A., Gō, N.: Harmonicity and anharmonicity in protein dynamics: a normal
mode analysis and principal component analysis. Proteins 23, 177–186 (1995)
44. Ma, J., Karplus, M.: Ligand-induced conformational changes in ras p21, a normal mode and
energy minimization analysis. J. Mol. Biol. 274, 114–131 (1997)
45. Ma, J., Karplus, M.: The allosteric mechanism of the chaperone GroEL: a dynamic analysis.
Proc. Natl. Acad. Sci. U S A 95, 8502–8507 (1998)
46. Gaillard, T., Martin, E., San Sebastian, E., Cossio, F.P., Lopez, X., Dejaegere, A., Stote, R.H.:
Comparative normal mode analysis of LFA-1 integrin I-domains. J. Mol. Biol. 374, 231–249
(2007)
47. Houdusse, A., Karplus, M., Cecchini, M.: Allosteric communication in myosin V: from small
conformational changes to large directed movements. PLoS Comput. Biol. 4(8), e1000129
(2008)
48. Durand, P., Trinquier, G., Sanejouand, Y.: New approach for determining low-frequency
normal modes in macromolecules. Biopolymers 34, 759–771 (1994)
49. Tama, F., Gadea, F.X., Marques, O., Sanejouand, Y.H.: Building-block approach for deter-
mining low-frequency normal modes of macromolecules. Proteins 41, 1–7 (2000)
50. Tirion, M.M.: Low-amplitude elastic motions in proteins from a single-parameter atomic
analysis. Phys. Rev. Lett. 77, 1905–1908 (1996)
51. Hinsen, K.: Analysis of domain motions by approximate normal mode calculations. Proteins
33, 417–429 (1998)
52. Bahar, I., Atilgan, A.R., Erman, B.: Direct evaluation of thermal fluctuations in proteins using
a single-parameter harmonic potential. Fold Des. 2, 173–181 (1997)
53. Navizet, I., Lavery, R., Jernigan, R.L.: Myosin flexibility: structural domains and collective
vibrations. Proteins 54, 384–393 (2004)
54. Bahar, I., Rader, A.J.: Coarse-grained normal mode analysis in structural biology. Curr. Opin.
Struct. Biol. 15, 586–592 (2005)
55. Yang, L., Song, G., Jernigan, R.L.: How well we can understand large-scale protein motions
using normal modes of elastic network model. Biophys. J. 83, 1620–1630 (2007)
56. Ferraro, J.R.: Introductory Raman Spectroscopy, 2nd edn. Academic Press, Boston, Amster-
dam (2002)
57. Krishtal, A., Senet, P., Van Alsenoy, C.: Local softness, softness dipole, and polariz- abilities
of functional groups: application to the side chains of the 20 amino acids. J. Chem. Phys. 131,
044312 (2009)
58. Kutzner, C., Van der Spoel, D., Lindahl, E., Hess B.: GROMACS 4: algorithms for highly
efficient, load-balanced, and scalable molecular simulation. J. Chem. Theory Comput. 4,
435–447 (2008)
59. Born, M., Huang, K.: Dynamical theory of crystal lattice. In: Texts in the Physical Sciences.
Oxford Classic (1998)
60. Frauenfelder, H.F., Parak, F., Young, R.D.: Conformational substates in proteins. Ann. Rev.
Biophys. Chem. 17, 451–479 (1988)
Raman and Infrared Spectra of Acoustical, Functional Modes … 537

61. Senet, P., Maisuradze, G.G., Foulie, C., Delarue, P., Scheraga, H.A.: How main-chain of
proteins explore the free-energy landscape in native states. Proc. Natl. Acad. Sci. U S A 105,
19708–19713 (2008)
62. Cote, Y., Senet, P., Delarue, P., Maisuradze, G.G., Scheraga, H.A.: Anomalous diffusion and
dynamical correlation between the side chains and the main chain of proteins in their native
states. Proc. Natl. Acad. Sci. U S A 109, 10346–10351 (2012)
63. Cote, Y., Senet, P., Delarue, P., Maisuradze, G.G., Scheraga, H.A.: Nonexponential decay of
internal rotation correlation functions of native proteins and self-similar structural fluctuations.
Proc. Natl. Acad. Sci. U S A 107, 19844–19849 (2010)
64. Kitao, A., Hayward, S., Go, N.: Energy-landscape of a native protein: jumping-among-minima
model. Proteins 33, 496–517 (1998)
65. Wales, D.: Energy Landscapes. Cambridge University Press, Cambridge (2003)
66. Kitao, A., Go, N.: Investigating protein dynamics in collective coordinate space. Curr. Opin.
Struct. Biol. 9, 164–169 (1999)
67. Vinh, N.Q., Allen, S.J., Plaxco, K.W.: DIelectric spectroscopy of proteins as a quantitative
experimental test of computational models of their low-frequency harmonic motions. J. Am.
Chem. Soc. 133, 8942–8947 (2011)
68. Middendorf, H.D.: Biophysical applications of quasi-elastic and inelastic neutron scattering.
Ann. Rev. Biophys. Bioeng. 13, 425–451 (1984)
69. Middendorf, H.D., Hayward, R.L., Parker, S.F., Bradshaw, J., Miller, A.: Vibrational neutron
spectroscopy of collagen and model polypeptides. Biophys. J. 69, 660–673 (1995)
70. Harney, T., James, D., Miller, A., White, J.W.: Phonons and the elastic moduli of collagen
and muscle. Nature 267, 285–287 (1977)
71. Cusak, S., Doster, W.: Temperature dependence of the low-frequency dynamics of myoglobin.
Measurement of the vibrational frequency distribution by inelastic neutron scattering. Bio-
phys. J. 58, 243–251 (1990)
72. Berney, C.V., Renugopalakrishnan, V., Bhatnagar, R.S.: Collagen. An inelastic neutron-
scattering study of low-frequency vibrational modes. Biophys. J. 52, 343–345 (1987)
73. Bartunik, H.D.: Intramolecular low-frequency vibrations in lysozyme by neutron time-of-
flight spectroscopy. Biopolymers 21, 43–50 (1982)
74. Middendorf, H.D.: Neutron studies of the dynamics of globular proteins. Phys. B 182, 415–420
(1992)
75. Diehl, M., Doster, W., Petry, W., Schober, H.: Water-coupled low-frequency modes of myo-
globin and lysozyme observed by inelastic neutron scattering. Biophys. J. 73, 2726–2732
(1997)
76. Lushnikov, S.G., Svaindze, A.V., Sashin, I.L.: Vibrational density of states of hen egg white
lysozyme. JETP Lett. 82, 31–35 (2005)
77. Paciaroni, A., Orecchini, A., Haertlein, M., Moulin, M., Conti Nibali, V., De Francesco, A.,
Petrillo, C., Sacchetti, F.: Vibrational collective dynamics of dry proteins in the terahertz
region. J. Phys. Chem. B 116, 3861–3865 (2012)
78. Brown, K.G., Erfurth, S.C., Small, E.W., Petitcolas, W.L.: Conformationally dependent low-
frequency motions of proteins by laser Raman spectroscopy. Proc. Natl. Acad. Sci. U S A 69,
1467–1469 (1972)
79. Genzel, L., Keilmann, F., Martin, T.P., Winterling, G., Yacoby, Y., Fröhlich, H., Makinen,
M.W.: Low-frequency Raman spectra of lysozyme. Biopolymers 15, 219–225 (1976)
80. Painter, P.C., Mosher, L.E., Rhoads, C.: Low-frequency modes in Raman spectra of proteins.
Biolpolymers 21, 1469–1472 (1982)
81. Urabe, H., Sugawara, Y., Ataka, M., Rupprecht, A.: Low-frequency Raman spectra of
lysozyme crystals and oriented DNA films: dynamics of crystal water. Biophys. J. 74,
1533–1540 (1998)
82. Hédoux, A., Ionov, R., Willart, J.F., Lerbret, A., Affouard, F., Guinet, Y., Descamps, M.,
Prévost, D., Paccou, L., Danéde, F.: Evidence of a two-stage thermal denaturation process in
lysozyme: a Raman scaterring and differential scanning calorimetry investigation. J. Chem.
Phys. 124, 014703 (2006)
538 A. Nicolaï et al.

83. Crupi, C., D’Angelo, G., Wanderlingh, U., Vasi, C.: Raman spectroscopic and low-temperature
calorimetric investigation of the low-energy vibrational dynamics of hen egg-lysozyme. Phi-
los. Mag. 91, 1956–1965 (2011)
84. Sassi, P., Perticaroli, S., Comez, L., Lupi, L., Paolantoni, M., Fioretto, D., Morresi, A.:
Reversible and irreversible denaturation processes in globular proteins: from collective to
molecular spectroscopic analysis. J. Raman Spectrosc. 43, 273–279 (2012)
85. Shuker, R., Gamon, R.W.: Raman-scattering selection rule breaking and the density of states
in amorphous materials. Phys. Rev. Lett. 25, 222–225 (1970)
86. Zorn, R.: The boson peak demystified? Physics 4, 44 (2011)
87. Chumakov, A.I., Monaco, G., Crichton, W.A., Bosak, A., Rüffer, R., Meyer, A., Kargl, F.,
Comez, L., Fioretto, D., Giefers, H., Roitsch, S., Wortmann, G., Manghnani, M.H., Hushur,
A., Williams, Q., Balogh, J., Parliński, K., Jochym, P., Piekarz, P.: Equivalence of the boson
peak in glasses to the transverse acoustic van Hove singularity in crystals. Phys. Rev. Lett.
106, 225501 (2011)
88. Leyser, H., Doster, W., Diehl, M.: Far-infrared emission by boson peak vibrations in a globular
protein. Phys. Rev. Lett. 82, 2987–2989 (1999)
89. Tarek, M., Tobias, D.J.: Effects of solvent packing on side chain and backbone contributions
to the protein boson peak. J. Chem. Phys. 115, 1607–1612 (2001)
90. Doster, W., Cusak, S., Petry, W.: Dynamical transition of myoglobin revealed by inelastic
neutron scattering. Nature 337, 754–756 (1989)
91. McCammon, J.A., Karplus, M., Gelin, B.R.: Dynamics of folded proteins. Nature 267,
585–590 (1977)
92. Moeller, K.D., Williams, G.P., Steinhauser, S., Hirschmugl, C., Smith, J.C.: Hydration-
dependent far-infrared absorption in lysozyme detected using synchrotron radiation. Biophys.
J. 61, 276–280 (1992)
93. Das, G.: Principal component analysis based methodology to distinguish protein SERS spec-
tra. J. Mol. Struct. 993, 500–505 (2011)
94. De Angelis, F., Gentile, F., Mecarini, F., Das, G., Moretti, M., Candeloro, P., Coluccio, M.L.,
Cojoc, G., Accardo, A., Liberale, C., Zaccaria, R.P., Perozziello, G., Tirinato, L., Toma, A.,
Cuda, G., Cingolani, R., Di Fabrizio, E.: Breaking the diffusion limit with super-hydrophobic
delivery of molecules to plasmonic nanofocusing SERS structures. Nat. Photonics 5, 682
(2012)
95. Oladepo, S.A., Xiong, K., Hong, Z.M., Asher, S.A., Handen, J., Lednev, I.K.: UV resonance
Raman investigations of peptide and protein structure dynamics. Chem. Rev. 112, 2604–2628
(2012)
96. Li, H., Nafie, L.A.: Simultaneous acquisition of all four forms of circular polarization Raman
optical activity: results for α-pinene and lysozyme. J. Raman Spectrosc. 43, 89–94 (2012)
97. Wheaton, S., Gelfand, R.M., Gordon, R.: Probing the Raman-active acoustic vibrations of
nanoparticles with extraordinary spectral resolution. Nat. Photonics 9, 68–72 (2015)
98. Li, F., Jin, L., Xu, Z., Zhang, S.: Electrostrictive effect in ferroelectrics: an alternative approach
to improve piezoelectricity. Appl. Phys. Rev. 1, 011103 (2014)
99. Achar, B.N.N., Barsch, G.R., Cross, L.E.: Static shell model calculation of electrostriction
and third order elastic coefficients of perovskite oxides. Ferroelectrics 37, 495–498 (1981)
100. Schade, A.L., Caroline, L.: Raw hen egg white and the role of iron in growth inhibition of
shigella dysenteriae, staphylococcus aureus, escherichia coli, and saccharomyces cerevisiae.
Science 100, 14–15 (1944)
101. Mizutani, K., Mikami, B., Aibara, S., Hirose, M.: Structure of aluminium-bound ovotransfer-
rin at 2.15 angstroms resolution. Acta Crystallogr. D 61, 1636–1642 (2005)
102. Lindahl, E., Hess, B., van der Spoel, D.: Gromacs 3.0: a package fro molecular simulation
and trajectory analysis. J. Mol. Mod. 7, 306–317 (2001)
103. Lindorff-Larsen, K., Piana, S., Palmo, K., Maragakis, P., Klepeis, J.L., Dror, R.O., Shaw, D.E.:
Improved side-chain torsion potentials for the amber Ff99SB protein force field. Proteins 78,
1950–1958 (2010)
Raman and Infrared Spectra of Acoustical, Functional Modes … 539

104. Hartl, F.U., Hayer-Hartl, M.: Molecular chaperones in the cytosol: from nascent chain to
folded protein. Science 295, 1852–1858 (2002)
105. Bukau, B., Deuerling, E., Pfund, C., Craig, E.A.: Getting newly synthesized proteins into
shape. Cell 101, 119–122 (2000)
106. Young, J.C., Agashe, V.R., Siegers, K., Hartl, F.U.: Pathways of chaperone-mediated protein
folding in the cytosol. Nat. Rev. Mol. Cell Biol. 5, 781–791 (2004)
107. Saibil, H.R.: Chaperones machines in action. Curr. Opin. Struct. Biol. 18, 35–42 (2008)
108. Selkoe, D.J.: Folding proteins in fatal ways. Nature 426, 900–904 (2003)
109. Garrido, C., Brunet, M., Didelot, C., Zermati, Y., Schmitt, E., Kroemer, G.: Heat shock proteins
27 and 70: anti-apoptic proteins with tumorigenic properties. Cell Cycle 5, 2592–2601 (2006)
110. Buchburger, A., Theyssen, H., Schröder, H., McCarty, J.S., Virgallita, G., Milkereit, P., Rein-
stein, J., Bukau, B.: Nucleotide-induced conformational changes in the ATPase and substrate
binding domains of the DnaK chaperone provide evidence for interdomain communication.
J. Biol. Chem. 270, 16903–16910 (1995)
111. Brehmer, D., Rudiger, S., Gassler, C.S., Klostermeier, D., Packschies, L., Reinstein, J., Mayer,
M.P., Bukau, B.: Tuning of chaperone activity of Hsp70 proteins by modulation of nucleotide
exchange. Nature 8, 427–432 (2001)
112. Liu, Q., Hendrickson, W.A.: Insights into Hsp70 chaperone activity from a crystal structure
of the yeast Hsp110 Sse1. Cell 131, 106–120 (2007)
113. Polier, S., Dragovic, Z., Hartl, F.U., Bracher, A.: Structural basis for the cooperation of Hsp70
and Hsp110 chaperones in protein folding. Cell 131, 106–120 (2008)
114. Schuermann, P.J., Jiang, J.W., Cuellar, J., Llorca, O., Wang, L.P., Gimenez, L.E., Jin, S.P.,
Taylor, A.B., Demeler, B., Morano, K.A., Hartl, P.J., Valpuesta, J.M., Lafer, E.M., Sousa, R.:
Structure of the Hsp110: Hsc70 nucleotide exchange machine. Mol. Cell 31, 232–243 (2008)
115. Wilbanks, S.M., Chen, L., Tsuruta, H., Hodgson, K.O., McKay, D.B.: Solution small-angle
X-ray scattering study of the molecular chaperone Hsc70 and its subfragments. Biochem 34,
12095–12106 (1995)
116. Shi, L., Kataka, M., Fink, A.L.: Conformational characterization of DnaK and its complexes
by small-angle X-ray scattering. Biochem 35, 3297–3308 (1996)
117. Bertelsen, E.B., Chang, L., Gestwicki, J.E., Zuiderweg, E.R.P.: Solution conformation of
wild-type E. coli Hsp70 (DnaK) chaperone complexed with ADP and substrate. Proc. Natl.
Acad. Sci. U S A 106, 8471–8476 (2009)
118. Golas, E., Maisuradze, G.G., Senet, P., Oldziej, S., Czaplewski, C., Scheraga, H.A., Liwo,
A.: Simulation of the opening and closing of Hsp70 chaperones by coarse-grained molecular
dynamics. J. Chem. Theory Comput. 8, 1750–1764 (2012)
119. Berendsen, H.J.C., Postma, J.P.M., van Gunsteren, W.F., Hermans, J.: Interaction models for
water in relation to protein hydration. In: Pullman, B. (ed.), pp. 331–338. D. Reidel
120. Scott, W.R.P., Hünenberger, P.H., Tironi, I.G., Mark, A.E., Billeter, S.R., Fennen, J., Torda,
A.E., Huber, T., Krüger, P., van Gunsteren, W.F.: The GROMOS biomolecular simulation
program package. J. Phys. Chem. A 103, 3596–3607 (1999)
121. Nicolaï, A., Barakat, F., Delarue, P., Senet, P.: Fingerprints of conformational states of human
Hsp70 at sub-THz frequencies. ACS Omega 6(1), 1067–1074 (2016)
122. Cecchini, M., Houdusse, A., Karplus, M.: Allosteric communication in myosin V: from small
conformational changes to large directed movements. PLoS Comput. Biol. 4, e1000129 (2008)
123. Swain, J.F., Dinler, G., Sivendran, R., Montgomery, D.L., Stotz, M., Gierasch, L.M.: Hsp70
chaperone ligands control domain association via an allosteric mechanism mediated by the
interdomain linker. Mol. Cell 26, 27–39 (2007)
124. Bhattacharya, A., Kurochkin, A.V., Yip, G.N.B., Zhang, Y., Bertelsen, E.B., Zuiderweg,
E.R.P.: Allostery in Hsp70 chaperones is transduced by subdomain rotations. J. Mol. Biol.
388, 475–490 (2009)
125. Zhuravleva, A., Gierasch, L.M.: Allosteric signal transmission in the nucleotide-binding
domain of 70-kDa heat shock protein (Hsp70) molecular chaperones. Proc. Natl. Acad. Sci.
U S A 108, 6987–6992 (2011)
Explicit-Solvent All-Atom Molecular
Dynamics of Peptide Aggregation

Maksim Kouza, Andrzej Kolinski, Irina Alexandra Buhimschi


and Andrzej Kloczkowski

Abstract Recent advances in computational technology have allowed us to sim-


ulate biomolecular processes on timescales that begin to reach the rates of peptide
aggregation phenomena. Molecular dynamics simulations have evolved into a mature
technique to the extent that they can be employed as a highly productive tool to gain
meaningful insights into the structure, dynamics and molecular mechanisms of pro-
tein aggregation. In this chapter, we describe the basics of explicit solvent all-atom
molecular dynamics simulations and its applications for studying early stages of
aggregation processes of two short pentapeptides: KLVFF and FVFLM, related to
Alzheimer’s disease and preeclampsia, respectively. We focus on certain important
problems in the field of protein aggregation that explicit solvent all-atom molecular
dynamics simulation studies could resolve. This includes how fibril formation rates
depend on a number of factors such as the presence of short peptides and popula-
tion of fibril-prone conformations. Specific applications of atomistic simulations in
explicit solvent to address these two issues are discussed.

M. Kouza (B) · A. Kolinski


Faculty of Chemistry, University of Warsaw, Pasteura 1, 02-093 Warsaw, Poland
e-mail: mkouza@chem.uw.edu.pl
A. Kloczkowski
Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children’s
Hospital, Columbus, OH 43215, USA
I. A. Buhimschi
Center for Perinatal Research, The Research Institute at Nationwide Children’s Hospital,
Columbus, OH 43215, USA
I. A. Buhimschi · A. Kloczkowski
Department of Pediatrics, The Ohio State University College of Medicine,
Columbus, OH 43215, USA

© Springer Nature Switzerland AG 2019 541


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_16
542 M. Kouza et al.

1 Introduction

Proteins are biomolecules that play key roles in every cell of the human body. The
biological functions of proteins include catalyzing chemical reactions, muscle con-
traction (titin), providing structural support, transport of ions (hemoglobin), trans-
mission of information between specific cells and organs (hormones), activity in the
immune system (antibodies), passage of molecules across cell membranes, etc. The
long process of biological evolution has designed proteins in such a way that under
normal physiological conditions (pH ≈ 7, T  ~300 K, atmospheric pressure) most
of them (except intrinsically disordered ones) fold into unique three-dimensional
structures. Only in these native folded structures can proteins be stable and biolog-
ically active. Proteins unfold to more extended conformations if the conditions are
changed or upon application of mechanical force or denaturant agents, such as urea
or guanidinium chloride. If the physiological conditions are restored, most proteins
refold spontaneously to their native states [2].
Protein folding and unfolding processes are of utmost importance for control-
ling biological activity and targeting proteins to their cellular destinations. However,
when things go wrong along those processes, proteins might misfold and escape
the sophisticated system of cellular quality control. Fibril formation resulting from
protein misfolding and aggregation is a hallmark of several protein conformational
disorders (also known as protein misfolding diseases) including Alzheimer’s, Hunt-
ington’s and Parkinson’s diseases [61, 77]. Apart from those widely-known dis-
eases, recent experimental and theoretical evidence has shown that preeclampsia, a
pregnancy-specific disorder, shares pathophysiological features with recognized pro-
teinopathies [15, 42, 87]; the cause of amyotrophic lateral sclerosis (ALS) is related
to the misfolding and aggregation of superoxide dismutase 1 (SOD1) protein [11,
70].
The growing awareness that protein aggregation is linked to a number of protein
conformational disorders attracts much attention of researchers. It is due to the enor-
mous medical importance of aggregation phenomena. Better understanding of protein
aggregation processes offers an opportunity to develop medical tools to alleviate the
suffering of millions of individuals with aggregation-related diseases [25]. Despite
some progress in understanding this complicated process at the basic level, many
important questions are yet to be answered. What are the general factors governing
fibril formation rates? How does the presence of small peptides disrupt the aggre-
gation pathway of proteins which are hallmarks of neurodegenerative diseases? Can
the development of effective inhibitors be facilitated? To address these questions,
we employ classical molecular dynamics simulations, shown to be of paramount
importance to our understanding of the structure-dynamics-function relationship in
biomolecules and molecular complexes.
Molecular Dynamics (MD) simulations are a computational method for studying
atoms and molecules that move according to the laws of classical mechanics. The
energy of interactions between atoms can be modeled with a variety of empirical
force-fields, which typically include bonded and non-bonded terms. The non-bonded
Explicit-Solvent All-Atom Molecular Dynamics … 543

terms represent the Lenard-Jones potential, electrostatic interaction and interaction


with water. The dynamics of the system is obtained by integrating the equations
of motion with a short time step of 1–2 fs. Unlike the experiments where limited
information only can be obtained on the properties of the system, the trajectories
generated by MD simulations provide an unprecedented level of insights into the
structure, dynamics, and interactions of the model system. The properties of the
model system which are accessible through experiment can be validated by exper-
imental results. Experimentally inaccessible properties—which are usually of the
highest interest—are used to create new ideas and hypotheses. MD simulations not
only have the ability to complement and guide experiments, but also have a strik-
ingly predictive power in deciphering the underlying mechanisms of a wide range
of biological processes including unfolding, folding, docking and aggregation of
biomolecules.
The first MD simulation for studying the interactions of hard spheres was reported
by Alder and Wainwright in 1957 [1]. Two decades later, the first biomolecule,
the bovine pancreatic trypsin inhibitor (BPTI) protein, was investigated using this
technique [59]. Since then the application of MD to study biomolecules such as
solvated proteins, protein-DNA complexes as well as lipid systems and address the
variety of processes including kinetics and thermodynamics of protein folding and
aggregation, has been increasing steadily. MD is now considered the gold standard
for studying structure, dynamics and function of biomolecules.
It is important to note that biomolecules exhibit a wide range of time scales over
which specific processes take place. For example, local motions which involve atomic
fluctuation, side chain motion and loop motion occur in the length scale of 0.01–5 Å
and the time involved in such process is of the order of 10−15 –10−12 s. The motion of
a helix, protein domain or subunit falls under rigid body motion whose typical length
scales are between 1 and 10 Å and the time involved in such motion is between 10−9
and 10−6 s. Large-scale motion consists of helix-coil transitions or folding/unfolding
transitions which are more than 5 Å, and the time scale involved is about 10−7 –101
s. Typical time scales for protein folding vary from microseconds to hours [50]. To
mechanically unfold a protein with a length of 102 nm, a time of one second is needed
using a pulling speed of about 102 nm/s [73]. In mechanical refolding experiments
in which the refolding time grows with force exponentially, protein refolds a few
orders of magnitude slower compared to the protein in the absence of external force
[49]. Finally, time scales typical for oligomerization or aggregation of proteins range
from milliseconds to many years, whereas length scales vary from a nanometer sized
peptide to micrometer long aggregated fibril structures [63].
The most detailed characterization of unfolding, folding and aggregation path-
ways and conformational ensembles requires the use of all-atom force-fields. How-
ever, simulations of proteins in explicit solvent need an enormous computational
effort due to the size and time scales involved. Using one of the powerful super-
computers (Anton) dedicated to atomistic MD simulations, it is possible to reach
the millisecond range and simulate the folding process of relatively small and fast-
folding proteins [52]. Note that general-purpose computer clusters typically struggle
to surpass the microsecond milestone. To break computationally the fundamental
544 M. Kouza et al.

barrier of experimental time scales unreachable by classical simulation tools, the


following three approaches are commonly used:
(i) Coarse-grained (CG) models [40, 51, 75, 94], standalone or in combination with
atomic-level molecular dynamics (MD) [10, 91], are used for more efficient simu-
lations. CG models reduce the complexity of each amino acid representation by one
pseudoatom [78] or a group of pseudoatoms [40, 41, 53]. The bead usually reflects
the size of a specific fragment and sometimes its geometry (e.g., CABS [41], UNRES
[53] and MARTINI models [57]). For a more detailed review of coarse-grained pro-
tein models and their application, a number of excellent reviews published recently
are referred to [40, 62].
(ii) Approaches with the aim of enhancing the sampling of biomolecules such as
umbrella sampling [12], metadynamics [60], multi-canonical ensemble [6] and the
replica exchange method [32, 80]. The latter was first developed for spin systems
[36] and subsequently was applied for biomolecules in Monte Carlo and Molecular
Dynamics simulations [32, 80]. In the replica exchange method two configurations
with energies E 1 and E 2 are exchanged between temperatures T 1 and T 2 with prob-
ability ex p βE , where the inverse temperature is given by β  1/k B T . These
exchange moves generate for each replica a random walk in temperature space that
enables escape out of local minima. This procedure speeds up the equilibration of the
system of interest up to ten-fold compared to the standard computation without the
replica exchange method. Over the last twenty years the replica exchange method has
become the most widely applied approach to enhance the sampling of biomolecules
[8, 45, 46, 67, 70, 79] and has been implemented in many of the most widely-
used MD packages including CHARMM, Amber, Gromacs and NAMD [13, 17, 69,
71]. However, practical applications of the replica-exchange method in explicit sol-
vent simulations remain challenging due to a high computational cost especially for
middle- and large-sized proteins. The main difficulty is that an increasing system size
requires an increasing number of replicas needed to cover the necessary temperature
range. This, in turn, slows down the replica’s walk in the temperature space so even
longer simulations are required to obtain sufficient statistics for the calculation of
thermodynamic quantities. Various strategies to minimize the number of required
replicas and more efficient replica-exchange schemes include those by Kouza and
Hansmann [45], Wang et al. [93] and Yasar et al. [100].
(iii) Short peptide fragments derived from parent proteins as models to study fibril
formation by MD simulations in explicit solvent. Such peptides form fibrils with
similar structural and physical properties as their parent proteins [30, 58, 82]. Because
simulations of the aggregation of full-length proteins (30–100 amino acid long) linked
to aggregation diseases are prohibitively long, their short peptide fragments have been
extensively used to get insights into the fundamental properties of the structure of
early aggregates, and to decipher the mechanism of fibril formation [4, 29, 43, 65,
83].
In this chapter, we will first describe the basics of explicit solvent all-atom molec-
ular dynamics simulations. Subsequently, we will outline their prospective applica-
tions to study the oligomerization of short peptides, KLVFF and FVFLM. Finally,
Explicit-Solvent All-Atom Molecular Dynamics … 545

we will discuss the application of all-atom MD simulations to study stabilities of


different constructs of β-amyloid and to make the plausible models of Aβ42 fibrils.

2 Molecular Dynamics Simulations

Gromacs is one of the most widely-used programs for all-atom MD simulations


[34]. It is fast, flexible and free, licensed under the GNU General Public License.
Gromacs supports a large number of publicly available force-fields including OPLS,
AMBER99SB, CHARMM, GROMOS43a1 [13, 35, 39, 76]. While presenting results
obtained for the aggregation of amyloidogenic peptides using different force-fields
available in Gromacs, we will discuss the general functional form of the all-atom
force-field in more detail.

2.1 All-Atom Models

Although force-fields differ in their parameters, the general functional form of any
force field consists of two terms

E total  E bonded + E non-bonded (1)

where E bonded describes bonded interactions that act only within molecules and
E non - bonded involves non-bonded interactions between and within molecules. The
bonded potential term includes 2-, 3- and 4-body interactions of covalently bonded
atoms, while the non-bonded potential term involves Lennard-Jones and Coulomb
interactions. For a more detailed description of all-atom force-field functional forms,
the book of Frenkel and Smit [26] is a good reference.

2.2 Water Model

Explicit solvent effects have been shown to be important in protein mechanical


unfolding, folding and aggregation [47, 48, 72, 83]. To derive the most detailed char-
acterization of pathways and intermediate conformations during unfolding, folding
and aggregation processes, not only all-atom protein force fields, but also explicit
solvation models have to be used. SPC (Simple Point Charge) [5] and TIP3P [38]
are examples of simple water models, implemented in explicit-solvent all-atom MD
simulations in Gromacs software. The water molecule has three centers of concen-
trated charge: the partial positive charge on the hydrogen atoms is balanced by an
appropriately negative charge located on the oxygen atom. The oxygen atom also
has the Lennard-Jones parameters for computing intermolecular interactions between
546 M. Kouza et al.

different molecules. Van der Waals interactions involving hydrogen atoms are not
calculated.

2.3 Basic MD Algorithm

Once the energy function is chosen and the model system is built, the next step is
computation of forces exerted on atoms. The force acting on each atom is calculated
as a negative derivative of potential energy with respect to the atom coordinates. Once
the forces exerted on the atoms are obtained, positions and velocities of each atom
are updated according to the classical Newton’s law of motion. To avoid numerically
unstable results, the equations of motion are integrated with a time step, which
is limited by the fastest movements in the molecule. The small time step of 1 or
2 fs typically used for explicit solvent all-atom simulations constitutes the main
bottleneck in the practical applications of MD simulations. To reach experimentally
relevant timescales even for proteins that fold fast (microseconds to milliseconds),
iterations in the MD algorithm have to be repeated 109 –1012 times and thus pose
a significant challenge for explicit solvent atomistic simulations. Coarse-grained
models, which speed up the computation at the cost of structural accuracy, achieve
millisecond simulations and beyond [40].

2.4 Periodic Boundary Conditions

In all simulation systems, periodic boundary conditions are implemented to eliminate


finite size effects, and to simulate the system of a fixed manageable size instead of
an infinitely large system. By surrounding the original simulation box containing
both the protein and water molecules with copies of identical boxes (called images),
atoms can leave the box from one side and reenter the box from the opposite side.
To compute interactions between different particles, the minimum image convention
approach is used. This method implies that a particle interacts only with particles that
are separated by less than a half of the simulation box length. Or in other words, to
avoid multiple counting of the same interactions, only particle interactions with the
closest image in the system are calculated (Fig. 1). Beyond the threshold distance,
the remaining forces are ignored to reduce the computational cost of the simulation.
Periodic boundary conditions are very effective to model the system of interest in
a reasonable period of time. However, there is an important issue that has to be con-
sidered. The simulation box has to be small enough to model the system of interest
in a practical time, but large enough to avoid periodic artifacts arising from interac-
tions of a protein with its own image in a neighboring box. For example, if a protein
undergoes partial unfolding, image violations may occur due to detachment of the
N-terminal strand from the beta-sheet, which is followed by interactions between
C-terminal and N-terminal portions. As a consequence, the effects of periodic arti-
Explicit-Solvent All-Atom Molecular Dynamics … 547

Fig. 1 Outline of the periodic boundary condition in 2D. The red cell is surrounded by its replicas
to fill the space. If a particle leaves the red box from one side, it re-enters the same box from the
opposite side, so the total number of particles in the cell remains unchanged. The minimum image
convention implies that a particle interacts with the closest image of the remaining particles in the
system. An example of the smallest value of the relative distance between two particles is shown
by a black arrow

facts are too large to be ignored and underlying dynamics of the process of interest
becomes physically meaningless. Making the simulation box smaller to reduce the
computational costs might come at a high price of wasting resources and rerunning
the simulations.
The remaining specific MD details are as follows. Electrostatic interactions were
computed using the particle mesh Ewald method [23]. The non-bonded interaction
pair-list was updated every 10 fs using a cutoff of 1.5 nm. All covalent bonds were
constrained by the LINCS algorithm [33] with a relative tolerance of 10−4 . Initial
velocities of the atoms were generated from the Maxwell distribution at 300 K.
Temperature of 300 K was controlled using a v-rescale thermostat [16]. The equations
of motion were integrated using a leap-frog algorithm with a time step of 2 fs.

3 Recent Application Examples of MD Simulation


for Protein Aggregation

The examples of applications of all-atom MD simulations in explicit solvent are our


recent simulations of two short peptides KLVFF and FVFLM. The KLVFF peptide is
a fragment of the 42 amino acid form of β-amyloid protein linked to Alzheimer’s dis-
ease, while the FVFLM peptide derived from SERPINA1 protein is suspected to be
involved in the pathogenesis of preeclampsia (PE) [15, 42]. Recent experiments have
demonstrated that short KLVFF-containing peptides form amyloid fibrils similar to
those formed by their full-length parent proteins [3, 30]. The presence of protein
aggregates has been identified in urine and placenta of women with preeclampsia;
548 M. Kouza et al.

however, the structure of protein aggregates has not yet been resolved. Our recent
computational analysis by publicly available algorithms [18, 28, 44, 81] of the col-
lection of aggregated proteins and peptides extracted from urine of pregnant women
diagnosed with preeclampsia (referred to by our group as the preeclampsia mis-
foldome) [14, 37] predicted the short FVFLM peptide of SERPINA1 protein as
highly amyloidogenic (Kouza et al. unpublished data). The small size of KLVFF and
FVFLM peptide, or of similar peptides found in vivo in states of disturbed proteosta-
sis, makes them good candidates for studying early stages of the aggregation process
by explicit-solvent all-atom MD simulations.
Oligomerization time correlates with the population of fibril-prone conforma-
tions in the monomeric state
We started our investigation of FVFLM and KLVFF fibrillation capacity by per-
forming MD simulation of monomers. Recent theoretical studies have shown that
oligomer formation times are strongly correlated with the population of the fibril-
prone conformation in the monomeric state [51, 64]. The population of fibril-prone
N ∗ conformations in a monomeric state is defined:
 
E
PN ∗  − /Z (2)
kB T

where Z is the partition function and E is the barrier separating native and N ∗
states. The more populated is the N ∗ state the larger is its propensity for aggregation.
For this reason the population of fibril-prone conformations in the monomeric state is
an important factor governing fibril formation rates and it can be used as a measure
of aggregation propensity. With an increasing number of a peptide’s fibril-prone
conformations, fibril formation time decreases.
Figure 2 presents the end-to-end distance as a function of time for FVFLM and
KLVFF monomers. Using the criterion for the fibril-prone conformation, we found
that the population of peptides in the fibril-prone state was ~21% and ~13% for
FVFLM and KLVFF monomers, respectively. Such a significant difference in pop-
ulations of fibril-prone conformations implies that the propensity of FVFLM for
self-assembly is higher than for KLVFF and reflects the difference in fibril formation
times between these peptides.
Subsequently, we compared the stability of FVFLM and KLVFF peptides by gen-
erating the free energy landscapes of the systems as a function of end-to-end distance
(R) and radius of gyration (Rg ), as shown in Fig. 3. The free-energy landscape pro-
file of KLVFF shows three minima, while for FVFLM it is less complex with one
broad minimum. The typical snapshots of representative conformations for local
minima are presented in Fig. 3c. In contrast to FVFLM where mainly pre-extended
and extended configurations are populated, our results indicate that conformations
for KLVFF are more complex and diverse. Remarkably, the compact conformations
with small values of the end-to-end distance in a range of 0.4–0.8 nm were observed
for KLVFF, but not for FVFLM (Fig. 3c). The barrier-free downhill nature of the free
energy profile of FVFLM implies that fibril-prone conformations are much easily
Explicit-Solvent All-Atom Molecular Dynamics … 549

Fig. 2 Time dependence of the end-to-end distance renormalized by Rmax for FVFLM and KLVFF
monomers. Results are averaged in a 40 ps window. Rmax  1.426 nm is the maximum end-to-
end distance obtained in simulations. The green and yellow lines refer to R/Rmax  0.9 and 0.8,
respectively. Reproduced from Ref. [42] with permission from the PCCP Owner Societies

accessible compared to those for KLVFF. This result explains why oligomer forma-
tion time for FVFLM is much shorter than for KLVFF. The time required to form
(F V F L M) (F V F L M)
the FVFLM dimer and trimer is τdimer ≈ 17 ns and τtrimer ≈ 46 ns, which
(K L V F F) (K L V F F)
is shorter than that of KLVFF, τdimer ≈ 23 ns and τtrimer ≈ 100 ns. Thus the
more accessible fibril-prone conformations in the monomeric peptide form are, the
faster its oligomer formation becomes. This result, which implies that the popula-
tion of fibril-prone conformations of monomers can be used to accurately predict its
self-assembly rates into higher ordered structures, is of paramount importance as it
opens up new routes to understanding the aggregation process at a single-monomer
level.
Short peptides as inhibitors of fibril formation
One of the principal goals in treating neurodegenerative diseases is to devise strategies
to inhibit fibril formation [27, 99]. One of the possible ways for the prevention and
treatment of Alzheimer’s disease is to design and use molecular inhibitors that inhibit
β-secretase and γ-secretase responsible for production of beta amyloid [31]. Another
powerful strategy is to prevent the aggregation of Aβ proteins by the presence of
short peptides which may occupy the self-recognition site of the parent proteins
thereby obstructing the aggregation process [19, 85]. On one hand, short peptides
form amyloid fibrils similar to their protein precursors. On the other hand, a mixture
of short peptides and their parent proteins can block the binding sites responsible for
amyloid aggregation and thus prevent aggregation. This strategy seems to be very
promising due to its potential use as protection against aggregation [85]. Several
previous reports identified short peptides including KLVFF and LPFDD that can
disrupt the fibrillation of full-length beta amyloid protein [19, 30, 85, 88].
As FVFLM peptides have been shown to form dimers and trimers faster than
KLVFF, an intriguing question that arises is whether FVFLM can bind more effec-
tively to beta amyloid protein than KLVFF does. Or in other words, could we propose
even more effective inhibitors compared to the known KLVFF or LPFDD?
550 M. Kouza et al.

Fig. 3 Free energy landscape for monomer KLVFF (a) and FVFLM (b) as a function of radius
of gyration and end-to-end distance. Surfaces are shown with contour lines indicating the relative
0.75 kB T slope of the surface. c Typical snapshots for local minima are marked by 1, 2, 3 and 4.
Reproduced from Ref. [42] with permission from the PCCP Owner Societies

To address this question, we studied the influence of the FVFLM peptide on the
kinetics of Aβ16–20 oligomerization using all-atom simulation. The initial configura-
tion of the system of two KLVFF peptides and one FVFLM was created by randomly
placing these peptides in a periodic box far enough that no peptide-peptide interac-
tions were present. Starting from this configuration we carried out eight independent
simulations and monitored the kinetics of dimerization and trimerization.
In Fig. 4 we show the dependence of the number of hydrogen bonds between
monomers. We defined the dimer as formed when three or more backbone hydrogen
bonds are made between monomers. Using this criterion, we found a significantly
higher probability of dimer formation between KLVFF and FVFLM peptides than
between KLVFF and itself. From the data in Fig. 4, we found that the kinetics of
FVFLM binding to KLVFF was faster compared to KLVFF binding to itself. This
suggests unambiguously the FVFLM capability of binding the β-amyloid aggregation
hot-spot (KLVFF). The peptides incorporating the KLVFF sequence have been shown
to bind full-length β-amyloid and block the KLVFF sequence in β-amyloid, which is
critical for amyloid aggregation [19, 74, 84, 85]. Our results suggest that FVFLM can
be used as a recognition sequence to interact not only with SERPINA1, the parent
protein of peptides aggregated in preeclampsia, but also with the KLVFF sequence
in β-amyloid. Interestingly, both β-amyloid and SERPINA1 immunoreactivity were
detected in the aggregates found in the urine of women with preeclampsia [15]. Based
Explicit-Solvent All-Atom Molecular Dynamics … 551

Fig. 4 Time dependence of the number of backbone hydrogen bonds between monomers. The
green curve represents hydrogen bonds between Aβ16–20 peptides (KLVFF), while the blue and
magenta curves show those between FVFLM and one of Aβ16–20 peptides (KLVFF). Snapshots
showing FVFLM and Aβ16–20 peptides are in blue and red colors, respectively. Reproduced from
Ref. [42] with permission from the PCCP Owner Societies

on these results, we suggest that FVFLM-like peptides could be used for the efficient
inhibition of β-amyloid (or other pro-amyloidogenic proteins) oligomerization and
aggregation.
The effects of mutations in fibril formation
The amino acid sequence determines protein propensity for folding and aggregation.
The role of sequence in aggregation may be better understood by studying mutations
which can alter aggregation pathways, rates and structure [21]. Bhavaraju and Hans-
mann [9] compared the stability of wild type and four mutants (R61N, G68D, A84T
and D82I) of the immunoglobulin light-chain protein. It was shown that amyloid for-
mation is triggered by the dissociation of dimers and transition of monomers from
their native state into fibril-prone states. Dimer stabilization by binding to dimer inter-
face or stabilization of monomer’s ground state have been suggested as the strategies
for the drug design targeting light-chain associated systematic amyloidosis [9].
Another important example is the β-amyloid protein. The region involving
residues from 16 to 23 in β-amyloid has been shown to play a crucial role in its
fibril formation. Numerous experimental and computational studies have been per-
formed for various mutations such as the Flemish (A21G), Arctic (E22G), Dutch
(E22Q), Italian (E22K), Iowa (D23 N) and Osaka (E22) variants among many
others [7, 22, 86]. Little attention has been focused on C- and N-terminal residues.
However, recent experiments demonstrated that mutations in those regions influ-
ence the kinetics of fibril formation. The G33A, G33I and G37L mutants as well
as English (H6R), Taiwanese (D7H) and Tottori (D7N) of β-amyloid can modulate
protein aggregation rates and pathways [20, 55, 66]. For example, A2V mutation
was found to greatly increase the Aβ40 fibril formation rates, but the mixture of the
Aβ40 and its A2V mutant peptides protects against amyloidogenesis [14, 24]. Using
traditional MD simulations in explicit solvent, Li and co-workers [89, 90] repro-
552 M. Kouza et al.

duced and complemented experimental findings on impacts of various mutations of


Aβ peptides.
Plausible models of β-amyloid fibrils
A significant challenge for drug design is to understand which fibril structures and
pathways are more directly related to disease. There have been a number of studies
revealing high-resolution structures of Aβ42 and Aβ40 fibrils [54, 56, 68, 92, 98].
While recently resolved structure of Aβ42 fibrils (Fig. 5b) relates to the triple-β-
strand motif [98], the previously found structures of Aβ40 fibrils (Fig. 5a) suggested
U-shaped β-strand-turn-β-strand motif of the fibril conformation [54, 68]. Using MD
simulations Xi et al. studied the stability and properties of triple-β-stranded Aβ1–42
fibril motif [97]. The simulation results provided evidence that stability of Aβ1–42
fibril structure depends on hydrophobic contacts involving the C-terminal residues
I41 and A42, while the salt bridge K28–A42 is not required for stabilization of fibril
structure. The authors indicated that, unlike in Aβ40 fibrils where only U-shaped,
strand-bend-strand conformation is observed, each Aβ42 molecule can also adopt the
triple-stranded S-shaped geometry that allows the fibril to get stabilized by hydrogen
bonds connecting the first β strands [97]. Subsequently, the new variants of the ring-
like and out-of-register models (Fig. 5c, d) of triple-stranded S-shaped Aβ42 fibrils
have been recently proposed [95, 96].

4 Conclusions

We presented atomistic MD simulations in explicit solvent that confirm the ability


of KLVFF and FVFLM peptides to aggregate. Our data indicate that the FVFLM
peptides aggregate more rapidly and form more mechanically stable oligomers than
KLVFF peptides [42, 101]. We have also presented approaches that might open up
new avenues to clearly decrypt the aggregation process and provide quantitative
design blocks to suppress aggregation, which in turn are not only applicable to
preeclampsia but also to other neurodegenerative diseases, specifically Alzheimer’s
disease. We reviewed recent important applications of standard MD simulations to
study novel aspects of aggregation such as the role of mutations and polymorphism
in protein aggregation.
Because of the enormous medical importance of aggregation phenomena, protein
aggregation studies are expected to be the major focus of both academic and phar-
maceutical research. The ability of simplified coarse-grained models to reach larger
timescales is critical for efficient simulations of complex biological systems. How-
ever, in the systems in which solvent plays an important role and the highest level of
atomistic details is required, all-atom MD simulations in explicit solvent alone or in
combination with coarse-grained models implemented into multi-scale protocols are
better options for simulation. An appropriate future goal is to improve such multi-
scale methods that should expand the time and size limits of simulation and enable
better understanding of the biological functions of biomolecules. In the near future,
Explicit-Solvent All-Atom Molecular Dynamics … 553

Fig. 5 Polymorphism in β-amyloid fibrils. Experimentally resolved U-shaped Aβ1–40 (a) and S-
shaped Aβ11–42 (b) fibril structures. Representatives structures of proposed out-of-register model
(c) and ring-like model (d) of Aβ1–42 fibrils

such simulations will have the capacity and resources to tackle the dangerous and
deadly structures of amyloid oligomers and aggregates.

Acknowledgements The authors thank Girik Malik for critical reading of the manuscript. M. K.
acknowledges the Polish Ministry of Science and Higher Education for financial support through
“Mobilnosc Plus” Program No. 1287/MOB/IV/2015/0. A. Kol. and M. K. would like to acknowl-
edge support from the National Science Center grant [MAESTRO 2014/14/A/ST6/00088]. IAB
acknowledges support from the Eunice Kennedy Shriver National Institute of Child Health and
Human Development (NICHD) R01HD084628 and The Research Institute at Nationwide Chil-
dren’s Hospital’s John E. Fisher Endowed Chair for Neonatal and Perinatal Research. A. Klo.
acknowledges support from National Science Foundation grant DBI 1661391, and Bridge funds
provided by The Research Institute at Nationwide Children’s Hospital. This research was sup-
ported in part by the High Performance Computing Facility at The Research Institute at Nationwide
Children’s Hospital.
554 M. Kouza et al.

References

1. Alder, B.J., Wainwright, T.E.: Phase transition for a hard sphere system. J. Chem. Phys. 27(5),
1208–1209 (1957)
2. Anfinsen, C.B.: Principles that govern folding of protein chains. Science 181(4096), 223–230
(1973)
3. Balbach, J.J., Ishii, Y., Antzutkin, O.N., Leapman, R.D., Rizzo, N.W., Dyda, F., Reed,
J., Tycko, R.: Amyloid fibril formation by Abeta(16–22), a seven-residue fragment of the
Alzheimer’s beta-amyloid peptide, and structural characterization by solid state NMR. Bio-
chemistry 39(45), 13748–13759 (2000)
4. Barz, B., Wales, D.J., Strodel, B.: A kinetic approach to the sequence-aggregation relationship
in disease-related protein assembly. J. Phys. Chem. B 118(4), 1003–1011 (2014)
5. Berendsen, H.J.C, Postma, J.P.M., van Gunsteren, W.F., Hermans, J.: Interaction models for
water in relation to protein hydration. Intermolecular Forces 14, 331–442 (1981)
6. Berg, B.A., Neuhaus, T.: Multicanonical algorithms for 1st order phase-transitions. Phys. Lett.
B 267(2), 249–253 (1991)
7. Berhanu, W.M., Alred, E.J., Hansmann, U.H.E.: Stability of Osaka mutant and wild-type fibril
models. J. Phys. Chem. B 119(41), 13063–13070 (2015)
8. Bernhardt, N.A., Xi, W.H., Wang, W., Hansmann, U.H.E.: Simulating protein fold switching
by replica exchange with tunneling (vol 12, pg 5656, 2016). J. Chem. Theory Comput. 13(1),
393–394 (2017)
9. Bhavaraju, M., Hansmann, U.H.E.: Effect of single point mutations in a form of systemic
amyloidosis. Protein Sci. 24(9), 1451–1462 (2015)
10. Blaszczyk, M., Kurcinski, M., Kouza, M., Wieteska, L., Debinski, A., Kolinski, A., Kmiecik,
S.: Modeling of protein-peptide interactions using the CABS-dock web server for binding
site search and flexible docking. Methods 93, 72–83 (2016)
11. Blokhuis, A.M., Groen, E.J.N., Koppers, M., van den Berg, L.H., Pasterkamp, R.J.: Protein
aggregation in amyotrophic lateral sclerosis. Acta Neuropathol. 125(6), 777–794 (2013)
12. Boczko, E.M., Brooks, C.L.: First-Principles calculation of the folding free-energy of a 3-helix
bundle protein. Science 269(5222), 393–396 (1995)
13. Brooks, B.R., Bruccoleri, R.E., Olafson, B.D., States, D.J., Swaminathan, S., Karplus, M.:
Charmm—A program for macromolecular energy, minimization, and dynamics calculations.
J. Comput. Chem. 4(2), 187–217 (1983)
14. Buhimschi, I., Jing, H.W., Axe, M., Ray, W., Zhao, G.M., Huang, C.S., Song, Y., Wysocki, V.,
Buhimschi, C.: Shotgun proteomics of the urine misfoldome identifies molecular signatures
of preeclampsia subphenotypes. Am. J. Obstet. Gynecol. 212(1), S34 (2015)
15. Buhimschi, I.A., Nayeri, U.A., Zhao, G., Shook, L.L., Pensalfini, A., Funai, E.F., Bernstein,
I.M., Glabe, C.G., Buhimschi, C.S.: Protein misfolding, congophilia, oligomerization, and
defective amyloid processing in preeclampsia. Sci. Transl. Med. 6(245), 245–292 (2014)
16. Bussi, G., Donadio, D., Parrinello, M.: Canonical sampling through velocity rescaling. J.
Chem. Phys. 126(1), 014101 (2007)
17. Case, D.A., Cheatham, T.E., Darden, T., Gohlke, H., Luo, R., Merz, K.M., Onufriev, A.,
Simmerling, C., Wang, B., Woods, R.J.: The Amber biomolecular simulation programs. J.
Comput. Chem. 26(16), 1668–1688 (2005)
18. Castillo, V., Grana-Montes, R., Sabate, R., Ventura, S.: Prediction of the aggregation propen-
sity of proteins from the primary sequence: aggregation properties of proteomes. Biotechnol.
J. 6(6), 674–685 (2011)
19. Chafekar, S.M., Malda, H., Merkx, M., Meijer, E.W., Viertl, D., Lashuel, H.A., Baas, F.,
Scheper, W.: Branched KLVFF tetramers strongly potentiate inhibition of beta-amyloid aggre-
gation. ChemBioChem 8(15), 1857–1864 (2007)
20. Chen, W.T., Hong, C.J., Lin, Y.T., Chang, W.H., Huang, H.T., Liao, J.Y., Chang, Y.J., Hsieh,
Y.F., Cheng, C.Y., Liu, H.C., Chen, Y.R., Cheng, I.H.: Amyloid-beta (Abeta) D7H mutation
increases oligomeric Abeta42 and alters properties of Abeta-zinc/copper assemblies. PLoS
ONE 7(4), e35807 (2012)
Explicit-Solvent All-Atom Molecular Dynamics … 555

21. Chiti, F., Dobson, C.M.: Protein misfolding, amyloid formation, and human disease: a sum-
mary of progress over the last decade. Annu. Rev. Biochem. 86(86), 27–68 (2017)
22. Coskuner, O., Wise-Scira, O., Perry, G., Kitahara, T.: The structures of the E22 delta mutant-
type amyloid-beta alloforms and the impact of E22 delta mutation on the structures of the
wild-type amyloid-beta alloforms. ACS Chem. Neurosci. 4(2), 310–320 (2013)
23. Darden, T., York, D., Pedersen, L.: Particle mesh Ewald—An N.log(N) method for Ewald
sums in large systems. J. Chem. Phys. 98(12), 10089–10092 (1993)
24. Di Fede, G., Catania, M., Morbin, M., Rossi, G., Suardi, S., Mazzoleni, G., Merlin, M.,
Giovagnoli, A.R., Prioni, S., Erbetta, A., Falcone, C., Gobbi, M., Colombo, L., Bastone, A.,
Beeg, M., Manzoni, C., Francescucci, B., Spagnoli, A., Cantu, L., Del Favero, E., Levy, E.,
Salmona, M., Tagliavini, F.: A recessive mutation in the APP gene with dominant-negative
effect on amyloidogenesis. Science 323(5920), 1473–1477 (2009)
25. Dobson, C.M.: Protein folding and misfolding. Nature 426(6968), 884–890 (2003)
26. Frenkel, D., Smit, B.: Understanding Molecular Simulation: From Algorithms to Applications.
Elsevier (1996)
27. Frydman-Marom, A., Rechter, M., Shefler, I., Bram, Y., Shalev, D.E., Gazit, E.: Cognitive-
performance recovery of Alzheimer’s disease model mice by modulation of early soluble
amyloidal assemblies. Angew. Chem. Int. Ed. Engl. 48(11), 1981–1986 (2009)
28. Garbuzynskiy, S.O., Lobanov, M.Y., Galzitskaya, O.V.: FoldAmyloid: a method of prediction
of amyloidogenic regions from protein sequence. Bioinformatics 26(3), 326–332 (2010)
29. Gazit, E.: Self assembly of short aromatic peptides into amyloid fibrils and related nanostruc-
tures. Prion 1(1), 32–35 (2007)
30. Gordon, D.J., Tappe, R., Meredith, S.C.: Design and characterization of a membrane per-
meable N-methyl amino acid-containing peptide that inhibits Abeta(1–40) fibrillogenesis. J.
Peptide Res. 60(1), 37–55 (2002)
31. Hamaguchi, T., Ono, K., Yamada, M.: Anti-amyloidogenic therapies: strategies for prevention
and treatment of Alzheimer’s disease. Cell. Mol. Life Sci. 63(13), 1538–1552 (2006)
32. Hansmann, U.H.E.: Parallel tempering algorithm for conformational studies of biological
molecules. Chem. Phys. Lett. 281(1–3), 140–150 (1997)
33. Hess, B., Bekker, H., Berendsen, H.J.C., Fraaije, J.G.E.M.: LINCS: a linear constraint solver
for molecular simulations. J. Comput. Chem. 18(12), 1463–1472 (1997)
34. Hess, B., Kutzner, C., van der Spoel, D., Lindahl, E.: GROMACS 4: algorithms for highly
efficient, load-balanced, and scalable molecular simulation. J. Chem. Theory Comput. 4(3),
435–447 (2008)
35. Hornak, V., Abel, R., Okur, A., Strockbine, B., Roitberg, A., Simmerling, C.: Comparison
of multiple amber force fields and development of improved protein backbone parameters.
Proteins-Struct. Funct. Bioinf. 65(3), 712–725 (2006)
36. Hukushima, K., Nemoto, K.: Exchange Monte Carlo method and application to spin glass
simulations. J. Phys. Soc. Jpn. 65(6), 1604–1608 (1996)
37. Jing, H.W., Zhao, G.M., Axe, M., Buhimschi, C.S., Wysocki, V., Buhimschi, I.A.: Protein
enrichment using Congo red (CR) affinity enhances characterization of the urine misfoldome
in preeclampsia (PE). Am. J. Obstet. Gynecol. 214(1), S408 (2016)
38. Jorgensen, W.L., Chandrasekhar, J., Madura, J.D., Impey, R.W., Klein, M.L.: Comparison of
simple potential functions for simulating liquid water. J. Chem. Phys. 79(2), 926–935 (1983)
39. Jorgensen, W.L., Tiradorives, J.: The opls potential functions for proteins-energy minimiza-
tions for crystals of cyclic-peptides and crambin. J. Am. Chem. Soc. 110(6), 1657–1666
(1988)
40. Kmiecik, S., Gront, D., Kolinski, M., Wieteska, L., Dawid, A.E., Kolinski, A.: Coarse-grained
protein models and their applications. Chem. Rev. 116(14), 7898–7936 (2016)
41. Kolinski, A.: Protein modeling and structure prediction with a reduced representation. Acta
Biochim. Pol. 51(2), 349–371 (2004)
42. Kouza, M., Banerji, A., Kolinski, A., Buhimschi, I.A., Kloczkowski, A.: Oligomerization of
FVFLM peptides and their ability to inhibit beta amyloid peptides aggregation: consideration
as a possible model. Phys. Chem. Chem. Phys. 19(4), 2990–2999 (2017)
556 M. Kouza et al.

43. Kouza, M., Co, N.T., Nguyen, P.H., Kolinski, A., Li, M.S.: Preformed template fluctuations
promote fibril formation: Insights from lattice and all-atom models. J. Chem. Phys. 142(14),
04B610_1 (2015)
44. Kouza, M., Faraggi, E., Kolinski, A., Kloczkowski, A.: The GOR method of protein secondary
structure prediction, and its application as protein aggregation prediction tool. In: Zhou, Y.,
Kloczkowski, A., Faraggi, E., Yang, Y. (eds.) Prediction of Protein Secondary Structure. vol.
1484, pp. 7–24. Humana Press, New York (2017)
45. Kouza, M., Hansmann, U.H.E.: Velocity scaling for optimizing replica exchange molecular
dynamics. J. Chem. Phys. 134(4), 01B630 (2011)
46. Kouza, M., Hu, C.K., Li, M.S.: New force replica exchange method and protein folding
pathways probed by force-clamp technique. J. Chem. Phys. 128(4), 01B618 (2008)
47. Kouza, M., Hu, C.K., Li, M.S., Kolinski, A.: A structure-based model fails to probe the
mechanical unfolding pathways of the titin I27 domain. Journal of Chemical Physics 139(6),
08B615 (2013)
48. Kouza, M., Hu, C.K., Zung, H., Li, M.S.: Protein mechanical unfolding: Importance of non-
native interactions. J. Chem. Phys. 131(21), 12B608 (2009)
49. Kouza, M., Lan, P.D., Gabovich, A.M., Kolinski, A., Li, M.S.: Switch from thermal to force-
driven pathways of protein refolding. J. Chem. Phys. 146(13), 135101 (2017)
50. Kubelka, J., Hofrichter, J., Eaton, W.A.: The protein folding ‘speed limit’. Curr. Opin. Struct.
Biol. 14(1), 76–88 (2004)
51. Li, M.S., Co, N.T., Reddy, G., Hu, C.K., Straub, J.E., Thirumalai, D.: Factors governing
fibrillogenesis of polypeptide chains revealed by lattice models. Phys. Rev. Lett. 105(21),
218101 (2010)
52. Lindorff-Larsen, K., Maragakis, P., Piana, S., Shaw, D.E.: Picosecond to millisecond structural
dynamics in human ubiquitin. J. Phys. Chem. B 120(33), 8313–8320 (2016)
53. Liwo, A., He, Y., Scheraga, H.A.: Coarse-grained force field: general folding theory. Phys.
Chem. Chem. Phys. 13(38), 16890–16901 (2011)
54. Lu, J.X., Qiang, W., Yau, W.M., Schwieters, C.D., Meredith, S.C., Tycko, R.: Molecular
structure of beta-amyloid fibrils in Alzheimer’s disease brain tissue. Cell 154(6), 1257–1268
(2013)
55. Lu, Y., Wei, G.H., Derreumaux, P.: Effects of G33A and G33I mutations on the structures
of monomer and dimer of the amyloid-beta fragment 29–42 by replica exchange molecular
dynamics simulations. J. Phys. Chem. B 115(5), 1282–1288 (2011)
56. Luhrs, T., Ritter, C., Adrian, M., Riek-Loher, D., Bohrmann, B., Doeli, H., Schubert, D.,
Riek, R.: 3D structure of Alzheimer’s amyloid-beta(1–42) fibrils. Proc. Natl. Acad. Sci. U S
A 102(48), 17342–17347 (2005)
57. Marrink, S.J., Risselada, H.J., Yefimov, S., Tieleman, D.P., de Vries, A.H.: The MARTINI
force field: coarse grained model for biomolecular simulations. J. Phys. Chem. B 111(27),
7812–7824 (2007)
58. Mazor, Y., Gilead, S., Benhar, I., Gazit, E.: Identification and characterization of a novel
molecular-recognition and self-assembly domain within the islet amyloid polypeptide. J. Mol.
Biol. 322(5), 1013–1024 (2002)
59. Mccammon, J.A., Gelin, B.R., Karplus, M.: Dyn. Folded Proteins. Nature 267(5612), 585–590
(1977)
60. Micheletti, C., Laio, A., Parrinello, M.: Reconstructing the density of states by history-
dependent metadynamics. Phys. Rev. Lett. 92(17), 170601 (2004)
61. Moreno-Gonzalez, I., Soto, C.: Misfolded protein aggregates: mechanisms, structures and
potential for disease transmission. Semin. Cell Dev. Biol. 22(5), 482–487 (2011)
62. Morriss-Andrews, A., Shea, J.E.: Simulations of protein aggregation: insights from atomistic
and coarse-grained models. J. Phys. Chem. Lett. 5(11), 1899–1908 (2014)
63. Morriss-Andrews, A., Shea, J.E.: Computational studies of protein aggregation: methods and
applications. Annu. Rev. Phys. Chem. 66(66), 643–666 (2015)
64. Nam, H.B., Kouza, M., Hoang, Z., Li, M.S.; Relationship between population of the fibril-
prone conformation in the monomeric state and oligomer formation times of peptides: Insights
from all-atom simulations. J. Chem. Phys. 132(16), 04B613 (2010)
Explicit-Solvent All-Atom Molecular Dynamics … 557

65. Nguyen, P.H., Li, M.S., Stock, G., Straub, J.E., Thirumalai, D.: Monomer adds to preformed
structured oligomers of Abeta-peptides by a two-stage dock-lock mechanism. Proc. Natl.
Acad. Sci. U S A 104(1), 111–116 (2007)
66. Ono, K., Condron, M.M., Teplow, D.B.: Effects of the English (H6R) and Tottori (D7N)
familial Alzheimer disease mutations on amyloid beta-protein assembly and toxicity. J. Biol.
Chem. 285(30), 23184–23195 (2010)
67. Peter, E.K., Pivkin, I.V., Shea, J.E.: A canonical replica exchange molecular dynamics imple-
mentation with normal pressure in each replica. J. Chem. Phys. 145(4), 044903 (2016)
68. Petkova, A.T., Yau, W.M., Tycko, R.: Experimental constraints on quaternary structure in
Alzheimer’s beta-amyloid fibrils. Biochemistry 45(2), 498–512 (2006)
69. Phillips, J.C., Braun, R., Wang, W., Gumbart, J., Tajkhorshid, E., Villa, E., Chipot, C., Skeel,
R.D., Kale, L., Schulten, K.: Scalable molecular dynamics with NAMD. J. Comput. Chem.
26(16), 1781–1802 (2005)
70. Proctor, E.A., Fee, L., Tao, Y.Z., Redler, R.L., Fay, J.M., Zhang, Y.L., Lv, Z.J., Mercer,
I.P., Deshmukh, M., Lyubchenko, Y.L., Dokholyan, N.V.: Nonnative SOD1 trimer is toxic
to motor neurons in a model of amyotrophic lateral sclerosis. Proc. Natl. Acad. Sci. U S A
113(3), 614–619 (2016)
71. Pronk, S., Pall, S., Schulz, R., Larsson, P., Bjelkmar, P., Apostolov, R., Shirts, M.R., Smith,
J.C., Kasson, P.M., van der Spoel, D., Hess, B., Lindahl, E.: GROMACS 4.5: a high-throughput
and highly parallel open source molecular simulation toolkit. Bioinformatics 29(7), 845–854
(2013)
72. Rhee, Y.M., Sorin, E.J., Jayachandran, G., Lindahl, E., Pande, V.S.: Simulations of the role
of water in the protein-folding mechanism. Proc. Natl. Acad. Sci. U S A 101(17), 6456–6461
(2004)
73. Rief, M., Gautel, M., Oesterhelt, F., Fernandez, J.M., Gaub, H.E.: Reversible unfolding of
individual titin immunoglobulin domains by AFM. Science 276(5315), 1109–1112 (1997)
74. Rojas, A.V., Liwo, A., Scheraga, H.A.: A study of the alpha-helical intermediate preceding
the aggregation of the amino-terminal fragment of the beta amyloid peptide (Abeta(1–28)).
J. Phys. Chem. B 115(44), 12978–12983 (2011)
75. Scheraga, H.A., Khalili, M., Liwo, A.: Protein-folding dynamics: overview of molecular
simulation techniques. Annu. Rev. Phys. Chem. 58, 57–83 (2007)
76. Scott, W.R.P., Hunenberger, P.H., Tironi, I.G., Mark, A.E., Billeter, S.R., Fennen, J., Torda,
A.E., Huber, T., Kruger, P., van Gunsteren, W.F.: The GROMOS biomolecular simulation
program package. J. Phys. Chem. A 103(19), 3596–3607 (1999)
77. Selkoe, D.J.: Alzheimer’s disease: genes, proteins, and therapy. Physiol. Rev. 81(2), 741–766
(2001)
78. Shakhnovich, E.: Protein folding thermodynamics and dynamics: Where physics, chemistry,
and biology meet. Chem. Rev. 106(5), 1559–1588 (2006)
79. Siwy, C.M., Lockhart, C., Klimov, D.K.: Is the conformational ensemble of Alzheimer’s
Abeta 10–40 peptide force field dependent? Plos Computat. Biol. 13(1), e1005314 (2017)
80. Sugita, Y., Okamoto, Y.: Replica-exchange molecular dynamics method for protein folding.
Chem. Phys. Lett. 314(1–2), 141–151 (1999)
81. Tartaglia, G.G., Vendruscolo, M.: The Zyggregator method for predicting protein aggregation
propensities. Chem. Soc. Rev. 37(7), 1395–1401 (2008)
82. Tenidis, K., Waldner, M., Bernhagen, J., Fischle, W., Bergmann, M., Weber, M., Merkle,
M.L., Voelter, W., Brunner, H., Kapurniotu, A.: Identification of a penta- and hexapeptide of
islet amyloid polypeptide (IAPP) with amyloidogenic and cytotoxic properties. J. Mol. Biol.
295(4), 1055–1071 (2000)
83. Thirumalai, D., Reddy, G., Straub, J.E.: Role of water in protein aggregation and amyloid
polymorphism. Acc. Chem. Res. 45(1), 83–92 (2012)
84. Tjernberg, L.O., Lilliehook, C., Callaway, D.J.E., Naslund, J., Hahne, S., Thyberg, J., Tere-
nius, L., Nordstedt, C.: Controlling amyloid beta-peptide fibril formation with protease-stable
ligands (vol 272, pg 12601, 1997). J. Biol. Chem. 272(28), 17894–17895 (1997)
558 M. Kouza et al.

85. Tjernberg, L.O., Naslund, J., Lindqvist, F., Johansson, J., Karlstrom, A.R., Thyberg, J., Tere-
nius, L., Nordstedt, C.: Arrest of beta-amyloid fibril formation by a pentapeptide ligand. J.
Biol. Chem. 271(15), 8545–8548 (1996)
86. Tomiyama, T., Nagata, T., Shimada, H., Teraoka, R., Fukushima, A., Kanemitsu, H., Takuma,
H., Kuwano, R., Imagawa, M., Ataka, S., Wada, Y., Yoshioka, E., Nishizaki, T., Watanabe, Y.,
Mori, H.: A new amyloid mu variant favoring oligomerization in Alzheimer’s-type dementia.
Ann. Neurol. 63(3), 377–387 (2008)
87. Tong, M., Cheng, S.B., Chen, Q., DeSousa, J., Stone, P.R., James, J.L., Chamley, L.W.,
Sharma, S.: Aggregated transthyretin is specifically packaged into placental nano-vesicles in
preeclampsia. Sci. Rep. 7, 6694 (2017)
88. Viet, M.H., Ngo, S.T., Lam, N.S., Li, M.S.: Inhibition of aggregation of amyloid peptides by
beta-sheet breaker peptides and their binding affinity. J. Phys. Chem. B 115(22), 7433–7446
(2011)
89. Viet, M.H., Nguyen, P.H., Derreumaux, P., Li, M.S.: Effect of the English familial disease
mutation (H6R) on the monomers and dimers of Abeta40 and Abeta42. ACS Chem. Neurosci.
5(8), 646–657 (2014)
90. Viet, M.H., Nguyen, P.H., Ngo, S.T., Li, M.S., Derreumaux, P.: Effect of the Tottori familial
disease mutation (D7N) on the monomers and dimers of Abeta40 and Abeta42. ACS Chem.
Neurosci. 4(11), 1446–1457 (2013)
91. Wabik, J., Kmiecik, S., Gront, D., Kouza, M., Kolinski, A.: Combining coarse-grained protein
models with replica-exchange all-atom molecular dynamics. Int. J. Mol. Sci. 14(5), 9893–9905
(2013)
92. Walti, M.A., Ravotti, F., Arai, H., Glabe, C.G., Wall, J.S., Bockmann, A., Guntert, P., Meier,
B.H., Riek, R.: Atomic-resolution structure of a disease-relevant Abeta(1–42) amyloid fibril.
Proc. Natl. Acad. Sci. U S A 113(34), E4976–E4984 (2016)
93. Wang, J.N., Zhu, W.L., Li, G.H., Hansmann, U.H.E.: Velocity-scaling optimized replica
exchange molecular dynamics of proteins in a hybrid explicit/implicit solvent. J. Chem. Phys.
135(8), 084115 (2011)
94. Wu, C., Shea, J.E.: Coarse-grained models for protein aggregation. Curr. Opin. Struct. Biol.
21(2), 209–220 (2011)
95. Xi, W.H., Hansmann, U.H.E.: Ring-like N-fold models of Abeta(42) fibrils. Sci. Rep. 7, 40787
(2017)
96. Xi, W.H., Vanderford, E.K., Hansmann, U.H.E.: Out-of-register Abeta(42) assemblies as
models for neurotoxic oligomers and fibrils. J. Chem. Theory Comput. 14(2), 1099–1110
(2018)
97. Xi, W.H., Wang, W.H., Abbott, G., Hansmann, U.H.E.: Stability of a recently found triple-
beta-stranded Abeta 1–42 fibril motif. J. Phys. Chem. B 120(20), 4548–4557 (2016)
98. Xiao, Y.L., Ma, B.Y., McElheny, D., Parthasarathy, S., Long, F., Hoshi, M., Nussinov, R.,
Ishii, Y.: Abeta(1–42) fibril structure illuminates self-recognition and replication of amyloid
in Alzheimer’s disease. Nat. Struct. Mol. Biol. 22(6), 499 (2015)
99. Yan, L.M., Velkova, A., Tatarek-Nossol, M., Andreetto, E., Kapurniotu, A.: LAPP mimic
blocks Abeta cytotoxic self-assembly: cross-suppression of amyloid toxicity of Abeta and
IAPP suggests a molecular link between Alzheimer’s disease and type II diabetes. Angew.
Chem. Int. Ed. 46(8), 1246–1252 (2007)
100. Yasar, F., Bernhardt, N.A., Hansmann, U.H.E.: Replica-exchange-with-tunneling for fast
exploration of protein landscapes. J. Chem. Phys. 143(22), 224102 (2015)
101. Kouza, M., Co, N.T., Li, M.S., Kmiecik, S., Kolinski, A., Kloczkowski, A., Buhimschi, I.A.:
Kinetics and mechanical stability of the fibril state control fibril formation time of polypeptide
chains: A computational study. J. Chem. Phys. 148, 215106 (2018)
Part IV
Use of Structural Database or
Experimental Information in Modeling
Protein Structure and Dynamics
Bioinformatical Approaches
to Unstructured/Disordered Proteins
and Their Complexes

Bálint Mészáros, Zsuzsanna Dosztányi, Erzsébet Fichó, Csaba Magyar


and István Simon

Abstract Intrinsically Unstructured/Disordered Proteins (IUPs/IDPs) exist as


highly flexible conformational ensembles without adopting a stable three-
dimensional structure. Experimental and bioinformatical studies in the past two
decades have shown that these proteins play a central role in various signaling and
regulatory processes. Accordingly, their frequency in higher eukaryotes reaches high
proportions and their malfunction can be connected to a wide variety of diseases.
Recognizing the biological importance of these proteins motivated researchers to
understand various aspects of disordered proteins and protein segments from the
viewpoint of biochemistry, molecular biology and pharmacology. In general, IDPs
are difficult to study experimentally because of the lack of a unique structure in their
isolated form. Nevertheless, taking advantage of ongoing efforts in the collection,
cataloguing, and annotation of known IDPs in publicly available databases, various
bioinformatics tools were developed over the last few years. These methods enable
the further identification and characterization of IDPs using only the amino acid
sequence. In this chapter—after a brief introduction to IDPs in general—we present
a small survey of current methods aimed at identifying disordered proteins or protein
segments, focusing on those that are publicly available as web servers. We also dis-
cuss in more detail approaches that predict disordered regions and specific regions
involved in protein binding by modeling the physical background of protein disor-

B. Mészáros · Z. Dosztányi
MTA-ELTE Momentum Bioinformatics Research Group, Eötvös Loránd University,
Budapest, Hungary
e-mail: bmeszaros@caesar.elte.hu
Z. Dosztányi
e-mail: dosztanyi@caesar.elte.hu
E. Fichó · C. Magyar · I. Simon (B)
Institute of Enzymology, RCNS, HAS, Budapest, Hungary
e-mail: simon.istvan@ttk.mta.hu
E. Fichó
e-mail: ficho.erzsebet@ttk.mta.hu
C. Magyar
e-mail: magyar.csaba@ttk.mta.hu
© Springer Nature Switzerland AG 2019 561
A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_17
562 B. Mészáros et al.

der. Furthermore, we argue that the heterogeneity of disordered segments needs to


be taken into account for a better understanding of protein disorder and the correct
use and interpretation of the output of disorder prediction algorithms.

1 Introduction to Disordered Proteins

In approximately the first 40 years of structural biology, the central model underly-
ing all biochemical studies was that a well-formed structure is a prerequisite for a
protein to carry out its function. This notion motivated a large number of structure-
function studies and led to the structure determination of over 100,000 proteins as of
date. Although some proteins and protein segments were known that either did not
lend themselves to structure determination or had sequence features that were seem-
ingly incompatible with a folded structure (e.g. highly charged, repetitive sequence
regions), these were considered as hallmarks of imperfect experimental conditions
or some exotic rarities of nature.

1.1 Re-assessing the Structure-Function Paradigm

With the explosion of available genome sequences, during the 1990s the known num-
ber of these ‘rarities’ and ‘experimental errors’ grew steadily to the point where they
could no longer be written down on a side note. This forced molecular biologists to
reassess the structure-function paradigm [1]. The world of proteins was extended to
include proteins that do not require a stable, three dimensional structure even under
physiological conditions in order to fulfill their biological role [2–4]. These Intrin-
sically Unstructured/Disordered Proteins (IUPs/IDPs) lack a well defined tertiary
structure in isolation and fluctuate between a multitude of conformations over time
and population. The importance of protein disorder is underlined by the abundance of
partially or fully disordered proteins encoded in higher eukaryotic genomes [5]. Using
bioinformatics methods (discussed in later sections) it was estimated that 30–50%
of eukaryotic proteins contain at least one long disordered segment. The fact that
protein disorder is not a tolerated necessity but provides an evolutionary advantage
is reflected by studies showing the steady increase of the percentage of disordered
proteins in proteomes as organism complexity increases [6, 7]. Furthermore, dis-
ordered proteins are involved in many critical processes [3] such as transcription,
translation, regulation, signal transduction and stress-response, complementing the
functional repertoire of globular proteins [8].
Characterization of IDPs based on their functions shows that disorder can help
these proteins to fulfill their functions in various ways [9, 10]. In accord with the
wide variety of functions associated with it, protein disorder also comes in a variety.
In some cases disordered regions are short and can be found at the terminal regions
of globular domains, such as the disordered N-terminal region of the eIF4E protein.
Bioinformatical Approaches to Unstructured/Disordered Proteins … 563

Similarly, globular domains can also harbor flexible loops that appear as missing
regions in solved structures. Flexible linkers that connect globular domains, such
as zinc fingers, represent another type of localized disorder. In another scenario,
especially in complex organisms, protein disorder often encompasses larger, domain
sized regions. These regions can exhibit different degrees of flexibility ranging from
the near-random conformation of the ACTR domain of the p160 protein, through the
presence of local transient secondary structural elements—such as in the N-terminal
region of p27—to compact molten globule regions with considerable amount of sec-
ondary structure but without stable tertiary structure, such as the nuclear coactivator
binding domain of the CBP protein.
Given the functional importance of disordered protein regions, their malfunction
is expected to have serious biological consequences. IDPs have been implicated
in various diseases, including neurodegenerative diseases, amyloidosis, diabetes,
cardiovascular diseases and cancer [11–14]. Despite the fact that proteins involved
in these diseases were shown to have a higher disorder content, the exact role of
protein disorder in the diseases themselves are not fully understood. Probably, most
results published to date concern the involvement of IDPs in cancer [15]. BRCA1,
p27, p21 and CBP, are examples of proteins with a significant amount of disorder
that have been associated with various forms of cancer. One of the best characterized
disordered proteins, p53, is known to be directly inactivated in more than 50% of
cancers. At a more general level, the higher proportion of disordered proteins among
cancer associated proteins was also observed [15]. However, it has been shown that
the link between protein disorder and the involvement in cancer is not causative. In
fact, both are strongly correlated with protein function, which links them together
[16]. This clearly calls for a more detailed understanding of the role of protein
disorder in various diseases.
Apart from basic research interests, the connection between protein disorder and
its role in diseases has implications in therapeutics as well. The pharmaceutical
industry is currently struggling to find promising new drug targets, despite substantial
increases in research funding. Drug discovery rates seem to have reached a plateau
or are perhaps even declining, suggesting the need for new strategies. Until recently,
the feasibility of targeting proteins without a well-defined structure was unclear for
the purpose of drug development [17]. There is now, however, a newly sparked
interest in IDPs as potential drug targets [18]. This is supported by the finding of
specific inhibitors to block the interaction between a disordered region of p53 and
the folded MDM2, or between the disordered helix-loop-helix type transcription
factors c-Myc and Max. Recognizing the relevance of these proteins stimulated more
systematic efforts aimed at their structural characterization and the determination of
their mechanisms of action.
564 B. Mészáros et al.

1.2 Coupled Folding and Binding of IDPs

As the above pharmaceutical examples show, the study of the interactions involving
IDPs is of special interest that has relevance not only from a therapeutic viewpoint
but also from a basic research perspective as well. With the exception of a few
known disordered proteins, such as entropic chains (where the biological function is
directly mediated by disorder without interaction, as in the case of the MAP2 projec-
tion domain, titin’s PEVK domain and the nucleoporin complex), most disordered
proteins function by binding specifically to other proteins, DNA or RNA. The lack
of structure in the unbound form has a profound effect on both the binding process
and the properties of the resulting complex [19, 20], albeit this effect varies depend-
ing on the type and structural features of the partner molecule. In the most studied
scenario, the IDP interacts with ordered (globular) protein partner(s) via a process
termed coupled folding and binding. In these cases, the flexibility of the disordered
partner decreases due to the binding. As a result, the resulting complex usually lends
itself to traditional structure determination. The main source of protein structures,
the Protein Data Bank (PDB) [21] contains several hundred (or possibly even more)
of such cases. These examples demonstrate the definitive differences of complexes
involving disordered proteins compared to complexes formed exclusively by ordered
globular proteins. Although in both cases the proteins have a stable structure in the
complex form, many of the distinct properties of bound IDPs give away their inherent
flexibility in their free form [22, 23].
In most cases, disordered segments adopt a largely extended and open conforma-
tion in the complex. Probably one of the most characteristic features of disordered
binding regions is that they are usually well localized in the sequence—in about 70%
of the cases the interacting residues can be mapped to a single continuous region of
residues. These localized interacting regions allow IDPs to have an increased modu-
larity as different binding regions can be incorporated into the same protein without
excessively increasing protein length. These binding regions can be close to each
other or can form mutually exclusive overlapping sites creating molecular switches.
The distinct binding mode of IDPs is also reflected in the physico-chemical nature
of their interfaces. These interfaces are more hydrophobic, and the preferred interac-
tion contacts are also significantly different compared to ordered proteins. As opposed
to the large number of polar-polar interactions at globular interfaces, IDPs tend to
favor hydrophobic-hydrophobic contacts with the partner protein. The increased
importance of hydrophobic interactions during binding is a hallmark of the com-
plexes involving IDPs [24].
Figure 1 shows a protein complex involving three proteins. This complex on one
hand shows a typical interaction between ordered proteins, and on the other hand, also
shows an interaction between ordered proteins and a disordered protein. The shown
solved structure is of the complex between the ordered cyclinA and cyclin dependent
kinase 2 (CDK2) proteins, inhibited by the disordered p27 protein. The interaction
between cyclinA and CDK2 plays an essential role in the control of the transition
between the S and G2 cell-cycle phases of eukaryotic cells. The specific interac-
Bioinformatical Approaches to Unstructured/Disordered Proteins … 565

Fig. 1 Example of interfaces between two ordered proteins and a disordered protein. The ordered
CDK2 and cyclinA are shown in blue and purple surface representations, respectively. The disor-
dered p27 is shown in golden cartoon representation. The figure was generated from the 1jsu PDB
file

tion between the two proteins enables CDK2 to bind ATP due to slight structural
rearrangements emerging during the binding. The interaction surface is dominated
by polar and charged residues and is relatively planar. The interface is of moderate
size compared to the size of the proteins with about 13–14% of the residues of both
cyclin A and CDK2 visible in the structure being involved directly in the binding.
A strikingly different molecular recognition scenario is presented by the disordered
p27 in the complex. The segment of p27 involved in the binding shows only little
helical preferences in the unbound form. However, several regions adopt a well-
defined α-helix upon binding. The group of the most strongly interacting residues of
p27 is dominated by hydrophobic/aromatic residues that fit into hydrophobic clefts
and grooves on the surface of the cyclinA-CDK2 complex. The structure shows that
the interacting region of p27 forms a largely linear binding site in the sense that all
residues of p27 interacting with the ordered partner complex are sequentially close.
This enables p27 to incorporate a significantly larger fraction of its residues into the
interaction, and accordingly over two-thirds of the visible residues of p27 are directly
involved in the binding.
In the case of disordered proteins, the coupling between folding and binding
is not only apparent in the structural properties of the resulting complex, but also
in the thermodynamics and energetics of the binding. Following the basic rules of
thermodynamics, the resulting protein complex corresponds to the state with an
energy minimum. However—as opposed to the interaction of globular proteins—in
566 B. Mészáros et al.

complexes involving IDPs, the loss of entropy during the folding of the disordered
partner plays a much larger role, which results in a weaker overall binding compared
to that of globular proteins. This way the specificity, which is basically independent
of the entropic terms, is uncoupled from binding strength [20]. This enables IDPs
to form specific, yet transient interactions, which are indispensable for regulatory
and signaling processes [3, 9, 10]. The increased rate of association and dissocia-
tion of disordered proteins increase their temporal binding capacity. Furthermore,
disordered proteins are able to incorporate a higher fraction of their surface in the
binding interface, which increases their interaction capacity in a spatial sense as well
[25]. Consequently, disordered proteins in general can mediate a large number of
interactions thus serving as hubs of protein-protein interaction networks [26].

1.3 Mutual Synergistic Folding

In the previously described scenario of coupled folding and binding, the interacting
IDP partner reaches the ordered state via using the ordered protein partner(s) as a
template that drives the folding process. However, some IDPs are able to adopt stable
structures during interactions without a pre-existing folded partner. As a counterpart
to coupled folding and binding, in the case of Mutual Synergistic Folding (MSF)
[27] all interacting partners are IDPs without stable tertiary structures outside of
the complex. During MSF the folding of all participating protein partners happens
at the same time coupled to the interaction in a synergistic manner. The first well-
documented case of MSF was presented in the early 2000s describing the co-folding
of ACTR and CBP [27]. This study presented multiple evidence using structural and
thermodynamic analysis concerning both the unstructured state of the interactors, and
the fact that they combine with high affinity to form a cooperatively folded helical
heterodimer. The complex structure arising from this specific MSF interaction is
shown in Fig. 2.
The inherent difficulty of the experimental analysis of IDPs had a profound effect
on the advancement of the field of MSF studies. While interest was sparked early on
during the development of the IDP field in general, targeted analyses had been scarce
and typically considered only a handful of known examples [22, 28–30]. However,
recent efforts have produced a comprehensive and systematic catalogue of MSF
complexes, serving as grounds for future analysis of this binding mode [31]. While
the detailed understanding of the biophysical and biochemical features of MSF are
yet to be achieved, complex structures consisting exclusively of IDPs offer deep
insights into the underlying mechanisms even at a first glance.
As all constituent proteins in an MSF complex lack a stable structure in their
unbound form, there is no single ordered template upon which the folding occurs.
Instead, the participating IDPs form a stable hydrophobic core together, with typically
all partners donating a sufficient number of hydrophobic residues. The emerging
structures often closely resemble single domain ordered protein structures in terms
of hydrophobic core size, secondary structure content or average contact numbers,
Bioinformatical Approaches to Unstructured/Disordered Proteins … 567

Fig. 2 Example of a protein


complex formed via mutual
synergistic folding. ACTR
and CBP are shown in green
and blue cartoon
representations, respectively.
The figure was generated
from the 1kbh PDB file

albeit being composed of several chains. In accord, individual IDPs involved in MSF
have fairly high hydrophobic contents, on par with that of ordered proteins. Due to
this property, IDPs undergoing MSF present a special class of disorder. The most
universal hallmark of IDPs is their lack of hydrophobic residues and high net charge,
and IDPs capable of MSF defy this quasi-ubiquitous feature. Instead, their disordered
nature stems from the special sequential arrangement of their hydrophobic residues.
Furthermore, while their composition is compatible with folding, their sizes are not;
the sheer number of hydrophobic residues in a single IDP partner is insufficient to the
independent adoption of a folded structure. This shows that the deep understanding
and future possible modulation of the whole spectrum of IDP-mediated interactions
require a targeted, systematic analyses in years to come.

1.4 Experimental Techniques and the Need


for Bioinformatics

The detailed structural and functional characterization of disordered proteins and


their complexes is a challenging task [32]. On one hand, as disordered proteins are
generally involved in regulatory functions, their expression levels are relatively low
on average, making them more difficult to isolate. On the other hand, disordered
regions are more prone to degradation by proteolytic enzymes than well folded pro-
teins. Furthermore, the existing experimental procedures are highly biased towards
ordered proteins, and most techniques provide only indirect information about disor-
der [3]. Consequently, the current list of experimentally verified disordered proteins
568 B. Mészáros et al.

is rather limited with numbers in the hundreds or low thousands. This is especially
alarming in light of the fact that about half of human proteins are estimated to contain
at least one longer disordered segment. This discrepancy faithfully reflects the dif-
ficulties of the experimental identification of disordered proteins. Because of these
difficulties, bioinformatics tools that target the prediction of protein disorder from the
sequence play a very important role in the identification and characterization of IDPs
as only these tools can give us information about their basic properties, evolution
and functions on a large scale.

2 Resources for Disordered Proteins

The experimental difficulties often hindering efficient analyses of IDPs called for the
development of bioinformatics/theoretical approaches. Possibly the most prominent
computational task addressed almost instantly in the field of IDPs is the development
of efficient algorithms for the prediction of protein disorder from the amino acid
sequence. As with all bioinformatics prediction algorithms, IDP predictions present
issues at several different levels. These include the buildup of the prediction algorithm
itself; but the proper choice of training and testing databases and the correct evaluation
of the resulting method are equally important. In the following section we give a
brief summary of the resources enabling the development of IDP-focused prediction
methods; while in the next chapter we give an overview of the basic concepts and
techniques of disorder prediction methods.

2.1 Basic Sequence Properties of IDPs

Disordered proteins have very distinct sequence properties compared to globular


proteins. These differences were already apparent when only a handful of examples
for protein disorder were known. The first analyses of sequences of disordered pro-
teins revealed that in comparison to globular proteins, these proteins are generally
enriched in polar and charged amino acids at the expense of aliphatic and aromatic
amino acids. At closer inspection, however, various subsets of disordered protein
sequences exhibited further variations in their sequential biases. Differences in the
amino acid composition could be observed depending on the experimental method
used to identify disordered regions (e.g. CD, NMR, or X-ray crystallography) [33], or
on the location in the sequence (N- and C-terminal, middle regions) [34]. Shorter and
longer segments of protein disorder also exhibited slightly different amino acid pref-
erences [35]. For example, short disordered regions were more depleted in Ile, Val,
and Leu, while long disordered regions were more enriched in Lys, Glu, and Pro but
were less enriched in Gln. In addition, long disordered regions were depleted in Gly
and Asn, while short disordered regions were enriched in Gly and Asp [36]. Although
these differences were smaller compared to the differences observed between ordered
Bioinformatical Approaches to Unstructured/Disordered Proteins … 569

and disordered proteins in general, they highlighted significant heterogeneity within


the class of disordered proteins.
Beside amino acid compositional bias, another indication of the unusual sequence
properties of disordered proteins is the presence of low complexity regions. These
regions often stand out even by simple visual inspection of the sequence, as they
usually appear as long stretches containing only one or a few amino acids. This is
an indication of low compositional complexity and can be characterized using the
concept of sequence entropy. Compositional complexity measures were introduced
first for the purpose of sequence alignments and searches, and can be viewed as
the earliest attempt to identify non-globular proteins [37, 38]. Globular proteins
have high compositional complexity, very similar to random sequences. In contrast,
certain disordered proteins often contain low complexity segments [26], and the more
biased the amino acid composition of disordered segments, the more likely it is to be
also of low complexity [39]. Nevertheless, the overlap between disordered and low
complexity regions is far from complete: many disordered proteins are practically
indistinguishable from ordered proteins based on their sequence complexity alone,
while low complexity regions can also include ordered structural proteins or proteins
with strong structural propensity, like collagens, coiled coils or other fibrous proteins.

2.2 Databases

For a more detailed understanding of protein disorder, comprehensive databases are


needed. This motivated the establishment of the DisProt database [40], which aims to
collect disordered proteins and protein regions characterized by various experimental
techniques. Entries in this database are collected from the literature and contain at
least one experimentally verified disordered region. Detection methods include X-
ray crystallography, NMR spectroscopy, CD spectroscopy (both far and near UV)
and protease sensitivity, in addition to several other less frequently used experimental
techniques. Apart from the location of the collected disordered regions, the database
also contains annotations of the used detection methods and crosslinks to other
databases. The current version of DisProt [41] contains 2167 IDP regions in 803
proteins.
Another source of disordered proteins and protein regions is the depository for
high resolution structures, the PDB database [21, 42]. Although this database is
expected to be dominated by ordered proteins, indirectly it also contains information
about protein disorder. In protein structures solved by X-ray crystallography, disor-
der is defined by missing electron density. In NMR structures, high conformational
variability across different NMR models is considered as an indication of disorder.
In both cases, disordered residues usually appear within the context of ordered struc-
tures, either as terminal regions or short loops within an otherwise ordered protein.
The length of these disordered regions spans from a single residue to hundreds, but
most often are less than 30 residues long, in contrast to disordered regions in the
DisProt database, which are generally longer. Various comparisons indicate that the
570 B. Mészáros et al.

two databases differ not only in the length of these segments, but encompass two
different flavors of protein disorder.
While data from DisProt and the PDB can highlight disordered protein regions in
general, there are other databases that have a more specialized focus concentrating of
the interacting segments of IDPs. Recent efforts in the systematic collection of inter-
actions between IDPs and protein partners have produced two distinct, yet closely
related databases. The Disordered Binding Sites (DIBS) [43] database collects cases
where an IDP interacts with an ordered protein partner via coupled folding and
binding. In contrast, the Mutual Folding Induced by Binding (MFIB) [31] database
is the repository of complexes formed exclusively by IDPs via mutual synergistic
folding. While the target interactions differ for the two databases, their underlying
approach, their architecture, and the information provided by them are highly sim-
ilar. Both databases contain the complex structures of the listed interactions. These
entries are manually inspected by database curators with a focus on the validity of
the experimental evidence for the ordered/disordered state of constituent proteins
to assure reliability. DIBS and MFIB also provide structural and functional annota-
tions of the complexes and crosslinks to other databases, as well. In addition, DIBS
also collects the dissociation constants of the interactions where available, as well as
the description of potential post-translational modifications modulating the binding
strength.
In contrast to the databases discussed so far, IDEAL (Intrinsically Disordered
proteins with Extensive Annotations and Literature) [44] is a collection of both
generic IDP regions and also disordered binding segments (albeit with a lower
stringency considering experimental verification). The database contains manual
annotations on IDP regions, and also contains annotations about interacting regions,
post-translational modification sites, and structural domain assignments. While the
primary focus of IDEAL used to be the general collection of disordered protein
regions, the newer incarnations of the database feature a new functional class of
IDP regions, called protean segments (ProS). A ProS is defined as a region of the
sequence, which is suspected to be disordered in isolation (although at times lacking
experimental proof) but which is known to be ordered bound to a protein partner.
Such defined protein regions coincide with regions undergoing coupled folding and
binding/mutual synergistic folding; thus, with the introduction of ProS, conceptually
IDEAL can be considered to lie between generic disorder databases and disordered
binding site repositories.
All above databases that incorporate IDP-mediated interactions require a stable
bound structure for IDPs. However, several protein- and nucleic acid-interacting IDPs
were found to retain varying degrees of flexibility in their complex form. In recent
years the number of these so-called fuzzy complexes [45] steadily grew and a dedi-
cated database, FuzDB was established to collect experimentally verified instances of
such interactions [46]. FuzDB currently contains over 100 fuzzy complexes, together
with their structural and biochemical evidence for disorder. The database also pro-
vides interpretation of experimental results, together with additional information
about the interactors (such as regulatory sites generated by alternative splicing or
post-translational modifications).
Bioinformatical Approaches to Unstructured/Disordered Proteins … 571

The foundation of databases discussed so far is provided by the experimental


evidence for the disordered nature of catalogued proteins. However, as experimen-
tal characterization of IDPs is inherently difficult, the number of validated IDPs (in
the low thousands) lags far behind the number of known protein sequences (in the
100 millions). To bridge this gap, several IDP-focused databases opted to include
automatically generated predictions of intrinsic disorder (see the next chapters for
an overview of used prediction methods). Arguably the most central such resource
is the MobiDB database [47]. MobiDB combines experimental information and pre-
diction results, and is able to provide disorder annotations for all protein sequences
included in UniProt. On the experimental information side, MobiDB incorporates
annotations of disorder collected from DisProt and PDB, and derived from publicly
available NMR chemical shift data [48]. Furthermore, MobiDB also integrates infor-
mation of binding regions inside IDPs, taken from previously discussed repositories.
These binding regions are termed linear interacting peptides (LIPs), which concep-
tually correspond to DIBS/MFIB sites or to the protean segments in IDEAL. On
the prediction annotation side, MobiDB includes putative disorder generated by ten
selected disorder prediction methods [49].
Another central resource for protein disorder annotations heavily utilizing pre-
diction algorithms is the Database of Disordered Protein Prediction (D2 P2 ) [50].
The main purpose of D2 P2 is to enable the quick and efficient disorder analysis of
complete proteomes. It provides previously computed disorder prediction outputs for
more than 1000 whole proteomes using 9 algorithms. As we will show in the next
chapter, various prediction methods utilize a wide range of computational approaches
and as a result their run times vary widely. For some slower methods the run time
for evaluating whole proteomes would simply not be feasible and as a result, these
methods would be practically absent from large scale analyses. This hindrance is
addressed by D2 P2 , enabling such studies without the need for unrealistic computing
capacity on behalf of users.
The various datasets including experimental evidence for disorder are essential
components of disorder prediction methods for both optimization and evaluation.
During the development of methods, various sequence properties of a compiled
dataset of disordered proteins is contrasted to a dataset of globular proteins. It is
worth noting that existing datasets of experimentally verified ordered and disordered
regions can contain many mis-classified segments. The source of misclassification
can be crystal contacts, complex formations or binding of cofactors, all of which can
force regions that are flexible in isolation to become structured. Many disordered
regions are characterized by semi-quantitative experiments only, lacking position
specific information, therefore they are even more prone to misclassification. Fur-
thermore, the order/disorder status can also be sensitive to various environmental
conditions [51, 52]. The number of known disordered segments is still relatively
low and sequence databases are likely to contain many more disordered proteins
that are yet uncharacterized. The lack of sufficiently large datasets and the noise in
the assignment of order and disorder represent a serious limitation in developing
accurate prediction methods for protein disorder.
572 B. Mészáros et al.

3 Overview of Protein Disorder Prediction Techniques

The compositional bias of disordered proteins suggests that protein disorder is


encoded in the amino acid sequence similarly to the way the folded structure of
globular proteins is encoded. This enables the prediction of protein disorder from the
amino acid sequence. Currently, more than 60 prediction methods have been pub-
lished. Some methods utilize machine learning approaches while others are based
on simple biophysical considerations. The simplest methods, however, rely on a
single amino acid scale [19, 53, 54]. In general, properties strongly correlating with
hydrophobicity, such as flexibility and coordination number, had the highest discrim-
inatory power among various amino acid properties [55, 56]. Another property, the
tendency of each amino acid to participate in regular secondary structure elements
as opposed to be in coil structures, indirectly also correlates with hydrophobicity and
is utilized in the Globplot method [57]. The increase in the size of datasets allowed
the application of a brute-force approach to directly optimize a specific amino acid
scale to discriminate between the two classes [56]. Although in some cases a single
effect captured by the amino acid scale is sufficient to explain disorder, generally
more sophisticated methods are needed to account for this complex phenomenon.
The field of predicting protein disorder has benefited from the experience of
earlier prediction methods developed for various problems in structural biology. In
the algorithmic sense, the prediction of protein disorder can be viewed as a classic
binary classification problem. Several standard machine learning techniques have
been developed and applied for similar problems, such as the prediction of secondary
structure, solvent accessibility, functional sites, or transmembrane helices. The most
commonly used techniques are support vector machines (SVMs) and neural net-
works. The advantage of machine learning approaches is that they can automatically
distill some basic relationships between the input sequence features and the output
property. In the specific case of the prediction of protein disorder, the novelty of
most methods based on machine learning approaches lies in the representation of
input information, rather than in the algorithms themselves. As an input, usually the
amino acid sequence within a local sequence window is used. In some cases the
amino acid composition or an amino acid propensity within a given window is cal-
culated instead, to reduce the dimensionality of the input data. Some methods also
incorporate information about low complexity segments as it can be an important
component of a certain type of disorder [26, 39].
Additional predicted properties, including secondary structure or solvent acces-
sibility can be also plugged into machine learning techniques [58]. However, the
benefit from these predictions seems to be much smaller than in some other areas
of structure predictions. The likely reason for this is that these methods have been
exclusively trained on ordered proteins, and should be used only with caution for
disordered proteins. For example, predicted secondary structure does not necessarily
contradict protein disorder. Often these regions correspond to transient secondary
structural elements, or—in the case of disordered binding regions—to the confor-
mation adopted in the complex form [59]. In the isolated form, with the exception
Bioinformatical Approaches to Unstructured/Disordered Proteins … 573

of highly specific scenarios [60], predicted secondary structures are not expected to
be stable for disordered proteins.
The incorporation of sequence profiles calculated from evolutionarily related
sequences is also more problematic in the case of disordered proteins. The strong
sequence bias present in these proteins, especially in low complexity segments can
distort the result of sequence similarity searches. Generally, disordered proteins are
evolutionarily less conserved [61], but the dynamic behavior and the associated
molecular function can be preserved even in the absence of apparent sequence con-
servation [62]. As a result, alignments are a less reliable source of information for
disordered protein segments. Although several methods use evolutionary information
in the prediction, it leads to a smaller boost in the performance of disorder prediction
methods than observed for example in the case of secondary structure prediction
methods [63].
Most prediction methods provide predictions at the per residue basis. The per-
formance of disorder predictions can be evaluated using the Matthews correlation
coefficient (MCC), balanced accuracy (ACC) that weighs the performance on the
positive and negative datasets based on the respective size of the datasets, and the
area under the receiver operating characteristic (ROC) curve (AUC, with possible
values ranging from 0.5 for random predictions to 1.0 for perfect predictors). Since
2002, the performance of various disorder prediction methods has been critically
assessed at the CASP (Critical Assessment of Protein Structure Prediction) experi-
ments [64–69]. CASP evaluations are restricted to residues with missing X-ray coor-
dinates and there is no similar blind testing for long disordered regions. According to
the CASP10 assessment, top disorder prediction methods can reach 0.90 AUC [70]
and around 70% ACC (evaluation of disorder prediction methods was discontinued
after CASP10 due to insufficient amount of data on disordered residues). Testing on
disordered regions culled from the DisProt usually place different methods at the top.
On these datasets, methods can discriminate between ordered and disordered seg-
ments with around 80% accuracy at the per residue basis [63, 71–73]. A recent novel
benchmarking dataset collected in the new release of the DisProt database confirmed
that disorder predictors work quite well, especially for long disordered segments.
However, a large fraction of such regions still goes virtually undetected [73]. Gen-
erally, the performance of disorder predictors critically depends on the dataset used
for testing, or more generally, the type of disorder studied. It is also influenced by
the evaluation criteria. Nevertheless, modern disorder prediction methods can be
considered quite reliable in general.

3.1 Machine Learning Methods

Comprehensive reviews of published methods appeared in the literature recently [70,


73, 74]. The exhaustive enumeration of all present algorithms is beyond the scope
of this chapter, instead our aim is to cover the basic approaches in this field. We
focus on those methods which are publicly available via web servers or standalone
574 B. Mészáros et al.

programs, and provide residue based predictions. A summary of these methods can
be found in Table 1 at the end of the section.
The first member of the PONDR-family of disorder prediction methods was
PONDR VL-XT [39]. The training set of this method was composed of variously
characterized long (>30 residues) disordered regions [75], and two additional training
sets of X-ray-characterized terminal regions, one for the amino-terminus and one for
the carboxy-terminus [34]. The method uses the amino acid compositions, attributes
derived from compositions such as sequence complexity, and attributes derived from
compositions via some function or scale such as hydropathy, net charge, etc. The
attributes were selected by analyzing their discriminatory power, their orthogonality,
and based on their effect on the performance. Then, the various types of attributes
were weighted and combined via artificial neural networks (ANNs). The resulting
method was found particularly useful to pinpoint certain regions that are candidates
for undergoing disorder-to-order transitions [76, 77].
PONDR-VL3-BA [78, 79] also uses an artificial neural network but the training
dataset was much larger compared to that of VL-XT. The input is formed by 18
amino acid frequencies, the average flexibility and sequence complexity, calculated
within a window of 41 residues. Sequence profiles generated by PSI-BLAST [80] can
also be added as an input attribute to improve the accuracy of predicting disordered
regions. Similarly to VL-XT, a neural network with a fully connected hidden layer of
ten neurons was trained on the specific datasets and it outputs a value for the central
amino acid in the window. These predictions are augmented by a specific predictor
that was trained to recognize the boundary between ordered and disordered regions.
Based on this, the closest maximum prediction from the boundary predictor became
the new boundary between the ordered and disordered regions.
DisEMBL, another computational tool for predicting disordered/unstructured
regions was developed by Linding et al [81]. Because of some uncertainties in the
definition of protein disorder, they developed three separate neural network based pre-
dictors using alternative definitions of disorder. These correspond to missing residues
indicated by REMARK 465 in the PDB files, residues with high B-factor (hot loops)
and residues within loops and coils. The differences in these three predictors under-
lined the distinct features of each group. By investigating the relationships between
the different disorder definitions, they found that hot loops showed less correlation
with coils and more with the missing residues.
Using an original approach, RONN (Regional Order Neural Network) [82] recog-
nizes disordered segments based on their similarity to well-characterized prototype
sequences with known disordered status. In this method, sub-sequences of a query
sequence are aligned to all prototype segments, and the similarity to these sequence
fragments is calculated using a standard mutation matrix. The resulting homology
scores are converted into distances and are used to train a modified version of radial
basis function networks called a bio-basis function neural network.
Along with artificial neural networks, the most widely used class of standard
machine learning algorithms are support vector machines (SVMs). SVMs have sev-
eral advantages over neural networks as they are less prone to overfitting, can be
trained more efficiently and handle noisy datasets better. SVMs can also handle
Bioinformatical Approaches to Unstructured/Disordered Proteins … 575

Table 1 Summary of the 16 analyzed disorder prediction methods


Name of method Training dataset for Algorithm Input data of the
disorder algorithm
PONDRVL-XT [39] XT: Missing residues Neural network Amino acid
in X-ray structures frequencies, amino
(terminal regions) VL: acid propensities
Variously
characterized long
disordered segments
PONDR-VL3-BA Variously Neural network Amino acid
[88] characterized long frequencies, amino
disordered segments acid propensities,
sequence complexity
DisEMBL [80] Missing residues in Neural network Single sequence
X-ray structures
RONN [82] Missing residues in Bio-basis function Single sequence
X-ray structures neural network
DISOPRED3 [83] Missing residues in SVM and neural PSI-BLAST PSSM
X-ray structures network
DeepCNF-D [87] X-ray structures Conditional neural Amino acid
fields propensities, sequence
complexity
DISpro [58] Missing residues in 1D recursive neural PSI-BLAST PSSM,
X-ray structures network secondary structure,
solvent accessibility
prediction
OnD-CRF [86] Missing residues in Conditional random Single sequence,
X-ray structures fields secondary structure
prediction
PONDR-VSL2B [63, Missing residues in SVM Amino acid
88] X-ray structures and propensities, sequence
DisProt complexity,
(PSI-BLAST PSSM)
(secondary structure
prediction)
ESpritz [89] Missing residues in Neural network Single sequence
X-ray structure, NMR
mobile regions and
DisProt
PONDR-FIT [90] X-ray structures and Meta-server Predicted disorder
DisProt
MFDp2 [97] Structures from PDB Meta-server Predicted disorder,
and DisProt various sequence
features, disorder
content
DisCop [94] Ordered/disordered Regression Predicted disorder
annotations from PDB
and DisProt
(continued)
576 B. Mészáros et al.

Table 1 (continued)
Name of method Training dataset for Algorithm Input data of the
disorder algorithm
DISOclust3 [84, 99] Structures from PDB Consensus Structural models,
predicted disorder
IUPred [92] None Biophysical model Amino acid
composition
IsUnstruct [100] Biophysical model Amino acid
propensities
Column 2 shows the dataset on which the methods were trained, column 3 shows the basic imple-
mented algorithm and column 4 shows the quantities the algorithm uses to calculate the final
prediction score. Abbreviations: SVM Support vector machine; PSSM Position specific scoring
matrix

unbalanced datasets, which is the case for disordered residues defined based on miss-
ing residues, as these usually comprise only 10% of all residues. The first method
utilizing SVMs for the prediction of disorder was implemented in DISOPRED2 [7].
This method was trained on a large dataset of missing residues of high resolution
structures. Separate models were created for N- and C-terminal regions besides the
model for the middle regions of the sequences. The input of the predictions is a
sequence profile for each protein, generated using a PSI-BLAST search [80] against
a filtered sequence database. One of the keys of the high accuracy of DISOPRED2
was that it was trained by placing larger cost on false positive predictions. The latest
version of this method is DISOPRED3 [83], which has a two-layer design. The first
layer uses three models to predict disorder: DISOPRED2, a neural network based
method trained on long disordered regions, and a model based on nearest neighbour
prediction of disorder. These predictions are combined by a second layer using a
neural network that helps to increase the accuracy of the predictions. DISOPRED3
is also capable of predicting disordered binding sites using SVM based techniques.
DISOclust3 also relies on the DISOPRED3 predictions but it also incorporates
structural information for the prediction of disorder. The main premise of this
approach is that structured residues are conserved in three-dimensional space across
multiple structural models. Residues missing or exhibiting high variations in certain
positions across the models are highly likely to be disordered. These predictions are
combined with the results generated by DISOPRED3. DISOclust3 is now part of the
IntFOLD [84] platform.
In the case of feed forward neural networks and SVMs, the prediction for each
residue is independent of the prediction for other residues. In contrast, recurrent
networks can also propagate data from later processing stages to earlier stages. Such
technique is used in DISpro [58]. It employs a one-dimensional recursive neural
network that combines the flexibility of a Bayesian model with the fast and convenient
parameterization of neural networks. The method also incorporates evolutionary
information as well as predicted secondary structure and solvent accessibility. Instead
of using a fixed window size, the prediction at each position depends on the entire
Bioinformatical Approaches to Unstructured/Disordered Proteins … 577

sequence through a recursive network of neighboring positions. DISpro is part of the


SCRATCH server [85], a protein structure and structural feature prediction server.
Another approach that can take into account the predicted disorder tendency of
neighboring positions is called OnD-CRF [86], which utilizes conditional random
fields for the prediction of protein disorder. The method relies on features generated
from the amino acid sequence and from secondary structure prediction. The training
data set was derived from high-resolution crystal structures that lack coordinates for
those amino acids that are considered to be disordered, and the performance was
optimized with respect to the area under the ROC curve.
Conceptually, deep networks (DNs) are similar to neural networks but contain
more layers and are trained in a slightly different manner. DeepCNF-D [87] com-
bines the advantages of both conditional neural fields (CNF) and deep convolutional
neural networks. It captures long-range sequence information by CNFs and exploits
interdependency between adjacent order/disorder labels, but also assigns different
weights for each label during training and prediction to solve the label imbalance
issue that was known as a long-standing problem in order/disorder prediction.
The methods described so far are all specific to one type of protein disorder
only, represented either by the DisProt [40] dataset or missing residues of X-ray
structures. Their performance tested on the other dataset resulted in significantly
lower efficiencies. This problem was first addressed by the PONDR VSL2 method
[63, 88]. It is composed of two separate predictors optimized for short and long (>30
residues) disordered regions that are combined by an independent meta-predictor.
Linear SVM was chosen as the learning algorithm, because it has similar performance
but better generalization ability compared to other techniques. The input of all three
methods are composed of various amino acid propensities, sequence complexity, and
optionally sequence profiles and secondary structure predictions, calculated within
a sliding local window. At the first level, the two methods predict short and long
disordered segments. A third predictor then determines the optimal weight to combine
the output of the two composite predictors. This architecture ensured that PONDR
VSL2 has a more balanced performance on disordered segments of various lengths.
The machine learning predictor of ESpritz [87, 89] also combines different pre-
diction methods trained on three different flavors on protein disorder. The three
training sets come from DisProt, missing residues of X-ray structures and mobile
regions from NMR ensembles, and are used to optimize a bidirectional recursive
neural work (BRNN) for the prediction of protein disorder. ESpritz can produce fast
and accurate sequence-only predictions, annotating entire genomes in the order of
hours—which can be especially useful for high-throughput processing.
Several meta-servers rely on already existing approaches, which capture different
features of protein disorder. The PONDR-FIT [90] method combines the outputs
of six predictors, PONDR VL-XT, PONDR-VSL2 [88], PONDR-VL3-BA [78, 79],
FoldIndex [91], IUPred [92] and TopIDP [56], for a single, improved prediction using
a neural network based approach.
MFDp2 [93] and disCoP [94] are meta-predictors that were developed by the same
research group. MFDp2 improves on the earlier version of the method by combining
it with two additional features, an alignment engine of annotated disordered proteins
578 B. Mészáros et al.

and a predictor of disorder content, DisCon [95]. DisCoP uses a regression model to
produce a new disorder prediction from seven methods (DisProt and X-ray versions
of EsPritz, CSpritz [96], SPINE-D [97], DISOPRED2, MD [72] and DISOclust)
selected empirically to maximize predictive performance. It was shown that the
consensus-based method offers a better performance compared to other predictors.
Meta approaches that integrate the results of several prediction methods have
been very successful in various areas of structure predictions [98] and appeared for
the prediction of protein disorder as well. These methods achieve improved perfor-
mance by decreasing the noise of individual predictors. Since individual disorder
prediction methods are often specific to certain types of protein disorder, their com-
bination could cover more aspects of disorder. The last round of CASP experiment
were clearly dominated by meta-predictors [69]. Nevertheless, there is still an urgent
need for specialized predictors that can accurately capture certain types of disorder.
Although these predictors might be inferior to meta-predictors in certain evaluations,
they provide more insights into the structural and even the functional properties of
disordered regions.

3.2 Incorporating Physical Principles into Disorder


Prediction

As opposed to the application of various ‘black box-like’ machine learning algo-


rithms, the prediction of protein disorder can be approached with the direct imple-
mentation of physical principles governing the process of protein folding. It was
suggested that disordered proteins can be identified based on the combination of low
hydrophobicity and high net charge [19, 53]. The rationale behind this approach is
that high net charge leads to charge-charge repulsion and low hydrophobicity means
less driving force for a compact structure. This algorithm was implemented in the
FoldIndex algorithm [91] to provide a position specific prediction. A similar concept
is behind the FoldUnfold method [54]. It predicts proteins disorder based on the
expected average number of contacts per residue. These values are taken from a sin-
gle amino acid propensity scale that encodes the average number of contacts for the
20 amino acid residues in a dataset of globular proteins. Another physicochemical-
based approach is IsUnstruct [100, 101]. In this model, the energy of each residue
depends on the type of the residue, its state and neighborhood states. The approach
is based on the Ising model combining it with dynamic programming, with replacing
the interaction term between neighbors with a penalty for a state change (the energy
of border). The method was trained on short disordered regions from PDB, however
it can find long regions too [100].
Taking one step further, modeling of residue-residue interactions can be incorpo-
rated into the prediction of protein disorder. A prime example of the more sophisti-
cated physics-based methods is the IUPred algorithm [92, 102]. This method captures
the essential cause of protein non-folding: if a residue in a protein is not able to form
Bioinformatical Approaches to Unstructured/Disordered Proteins … 579

enough favorable intrachain contacts, it will not adopt a stable position in the 3D
structure of the chain. If such residues are clustered along a segment of a protein or
the whole protein, then this segment or the entire protein will be disordered.
The implementation of the above principle in IUPred is done taking an energetics
point of view. For globular proteins, the contribution of interresidue interactions
to total energy is often approximated by low-resolution force fields, or statistical
potentials, which are energy-like quantities derived from globular proteins based on
the observed amino acid pairing frequencies [103]. In deriving the actual potentials,
different principles have been applied. The resulting empirical energy functions are
well suited to assess the quality of structural models and have been used for fold
recognition or threading but also in docking, ab initio folding, or predicting protein
stability. Their success in a wide range of applications suggests the existence of a
common set of interactions, simultaneously favored in all native—as opposed to
alternate—structures.
In the case of IUPred, a dedicated statistical potential is optimized to estimate
the pairwise interaction energies between residues. The total pairwise energy E of
a protein in its native state is the sum of the energies of all the pairwise residue-
residue interactions in the protein. E is the function of the conformation as well
as the amino acid sequence, as they define the list of residue-residue interactions
that have a contribution to the total energy. This total energy can be calculated by
taking all contacts in the protein, and weighting them by the corresponding interaction
energies. The interaction energy between any two types or amino acids can be inferred
by calculating the frequency of interactions between these two types in a dataset of
known protein structures. These frequencies are transformed into interaction energies
using the Boltzmann hypothesis [104] and are described by the 20 by 20 interaction
energy matrix of amino acid pairs, M. Hence, the pairwise energy content calculated
based on the structure can be written as:

E calculated  Mi j C i j (1)
i, j

where M ij is the interaction energy between amino acid types i and j, and C ij is the
number of interactions between residues of types i and j in the given conformation.
This energy calculation, however, assumes the knowledge of the 3D structure of
the protein and as such, is not directly applicable to proteins whose structure can
not be determined. To come around this problem, a novel estimation scheme was
established and implemented in IUPred to enable the estimation of the E interaction
energy without the structure, using the protein sequence alone. The rationale behind
this approach is that the energy contribution of a residue depends not only on its
amino acid type, but also on its potential partners in the sequence. It is assumed that
if the sequence contains more amino acid residues that can form favorable contacts
with the given residue, its expected energy contribution will be more favorable. The
simplest approximating formula for the specific estimated pairwise energy can be
expressed with a quadratic formula as:
580 B. Mészáros et al.

E estimated  L Pi j f i f j (2)
i, j

where L is the length of the protein, f i is the normalized frequency of residues of


type i and P is the energy estimator matrix. The elements of P are optimized on
a set of globular proteins using the least squares method in order to minimize the
difference between E calculated and E estimated . Equation (2) gives an estimate for the
energy of the whole protein, however, it can be naturally modified to calculate the
pairwise energy of single residues as well. For this, it has to be considered that in
multi-domain proteins the residues belonging to different domains do not interact.
For this reason, for each residue the amino acid frequencies are only calculated in
the sequential neighborhood roughly corresponding to the average domain size. The
width of this sequence window is marked by w0 and is set to 100 residues to each side.
To estimate the interaction energy of residue k (of type j), Eq. (2) can be modified:


20
E kj  Pi j f ik (w0 ) (3)
i1

where f ik (w0 ) is the fraction of residues of type i in the w0 neighborhood of residue


k. (Note that lower indices stand for amino acid type, while upper indices stand for
position in the chain.) Formula (3) enables the estimation of the intrachain interaction
energies of each residue directly from the amino acid sequence. Generally, residues
with less favorable predicted energies are more likely to be disordered. Testing on 559
globular and 129 disordered proteins [92] showed that this energy estimation scheme
is accurate enough to achieve a high true positive rate (fraction of disordered residues
correctly predicted) of 76% while maintaining a sufficiently low false positive rate
(fraction of ordered residues incorrectly predicted) of 5%—a standard choice of
type II error in prediction methods. The strength of the construction of the method
is that its parameters are derived from a globular protein dataset without the use of
specific datasets of disordered proteins. As globular protein datasets are considerably
larger than that of disordered proteins, this grants the method substantial stability
compared to methods where a large number of parameters are trained on a limited
and sometimes ambiguous disordered protein dataset.
The above energy estimation method is implemented in IUPred. The method is
accessible via a web server [105] (http://iupred.enzim.hu or http://iupred.elte.hu).
For the ease of interpretation, the calculated energies are converted into probability
values, indicating the probability of each residue being disordered. Figure 3 shows an
example output of the IUPred server for the human Wiskott-Aldrich protein (WASp).
WASp is a 502 residue long protein that is entirely disordered with the exception of
the ordered WH1 domain spanning the 39–148 region. The assigned probabilities are
in accordance with the known structural information as the calculated probabilities on
the ordered domain lie below 0.5 marking order (low probability of being disordered)
and above 0.5 for the rest of the protein (high probability of being disordered).
Bioinformatical Approaches to Unstructured/Disordered Proteins … 581

Fig. 3 Screenshot of the IUPred server output for the human Wiskott-Aldrich protein. The horizon-
tal axis represents the protein chain and the vertical axis represents the probability of each residue
to be disordered. Residues with values above 0.5 are predicted to be disordered and values below
0.5 indicate an ordered structure

4 Prediction of Disordered Binding Regions

As discussed in Sects. 1.2 and 1.3, many disordered proteins carry out important
functions via binding to other proteins that involves coupled folding and binding or
mutual synergistic folding. Due to their specific functional and structural properties,
these binding regions have distinct properties compared to both globular proteins and
disordered proteins in general, and these properties—in principle—enable the con-
struction of prediction algorithms to recognize them from the protein sequence. While
there are many algorithms for predicting IDPs, apparently the choice of methods for
predicting regions undergoing disorder-to-order transition upon protein binding is
rather limited.
The first publicly available method for the prediction of disordered binding regions
undergoing coupled folding and binding was ANCHOR [6]. ANCHOR aims to cap-
ture the basic biophysical properties of disordered binding segments. The essential
feature of these regions is that they exist in a disordered state in isolation, but they can
favorably interact with a globular protein and adopt a rigid conformation upon bind-
ing. In this model the combination of the high disordered tendency of the sequential
environment, and high energetic gain by interacting with a globular protein partner
indicates the presence of a disordered binding region. The implementation of these
principles follows the basic idea behind IUPred, and these criteria for the presence
of a disordered binding region are quantified with the use of estimated energies.
The testing of ANCHOR showed that the predictor recognizes around 70% of
disordered binding regions, while falsely predicting only 5% of residues in ordered
proteins. As the available dataset for experimentally verified disordered protein com-
plexes is limited in size, the benefit of using physical models instead of machine learn-
ing algorithms is evident. Another strength of ANCHOR comes from the fact that
the efficiency of the prediction is largely independent of the amino acid composition
of the query protein. For example, acidic binding regions, such as certain calmod-
ulin binding sites, are recovered with approximately the same success rate as proline
582 B. Mészáros et al.

rich binding regions, such as SH2 and SH3 domain binding sites, or hydrophobic
sites, such as the MDM2 binding region of p53. Furthermore, the goodness of the
prediction is also independent of the conformation the binding region adopts in the
bound conformation. This independency also shows the generality of ANCHOR.
The method combines the transparency of simplified biophysical models with the
usability of bioinformatical approaches.
The predictions obtained with IUPred and ANCHOR are demonstrated through
the example of the human calcium/calmodulin-dependent protein kinase IV (UniProt
ID: Q16566), shown on Fig. 4. The plot was generated with the online version
of ANCHOR [106], available at http://anchor.enzim.hu/ and http://anchor.elte.hu/.
Calcium/calmodulin-dependent kinase IV binds to calmodulin near its C-terminal
end (residues 322–341). This patch is correctly identified using ANCHOR as shown
in the figure. The binding region can also be identified based on one of the subclasses
of calmodulin binding motifs, namely the basic 1-8-14 binding motif consisting of
three positively charged residues followed by three hydrophobic ones in the 1st, 8th
and 14th position C-terminal from the positive sequence patch. The location of this
motif is also indicated on the figure.
Although IUPred and ANCHOR rely on the same approach and use the same
interaction energy prediction scheme, their outputs are distinctively different. How-
ever, IUPred also reacts to the presence of disordered binding regions: as can be seen
from the example presented on Fig. 4, disordered binding regions tend to appear
more ordered than their surrounding disordered protein segments. This tendency
is not exclusive to IUPred, many other disorder prediction outputs reflect binding
regions in a similar way. In the case of PONDR VL-XT the presence of these ‘dips’
in the prediction profile was exploited to construct a disordered binding region pre-
diction algorithm [76, 77]. In this framework, regions undergoing a coupled folding
and binding process adopting an α-helical conformation in their bound form were
targeted. These regions, termed α-MoRFs (molecular recognition features) were pre-
dicted using the local drops in the prediction score as an input to a neural network that
was trained on known examples of α-helical binding sites. The neural network then

Fig. 4 Output of the ANCHOR prediction server for calcium/calmodulin-dependent protein kinase
IV. The plot shows the predicted disordered binding regions in blue with the output of the general
disorder prediction method IUPred in red and the location of the calmodulin binding motif in orange
Bioinformatical Approaches to Unstructured/Disordered Proteins … 583

tries to discriminate the potential binding regions using various sequence features,
including disorder, secondary structure predictions and amino acid indices.
The construction of the PONDR-based α-MoRF prediction algorithm marked
the introduction of machine learning approaches into the field of disordered bind-
ing site prediction. This line of research has been actively pursued in recent years
yielding novel prediction algorithms with increasing efficiency. The first publicly
available MoRF prediction algorithm was MoRFpred, which is able to predict bind-
ing regions of IDPs regardless of their bound structures [107]. MoRFpred utilizes an
SVM architecture with various sequence features—evolutionary profiles, predicted
disorder, relative solvent accessibility, physicochemical properties—as input. The
latest incarnation of the MoRF family of disordered binding site prediction methods
is available at the MoRFchibi SYSTEM site [108]. The basis of the suite is the MoR-
Fchibi method, which is a significantly improved version of MoRFpred and can also
be easily integrated into custom bioinformatics analysis pipelines. The suite also
offers MoRFchibiWeb, which utilizes a meta design, meaning it predicts putative
MoRF annotation computed by MoRFchibi, while improving the predictive perfor-
mance. The server offers a third variant of the method, MoRFchibiLIGHT, which is
more lightweight and run-time optimized version of the algorithm, best suited for
large-scale computation tasks.

5 Linear Motifs

As discussed in the previous section and Sects. 1.2 and 1.3, the study of protein-
protein interactions formed by disordered proteins is based on structural considera-
tions. However, the study of interactions between protein domains and short, linear
protein regions—a description which fits most cases of IDPs undergoing coupled
folding and binding—has a distinctly separate approach as well, with the use of
linear motifs.

5.1 Defining and Using Linear Motifs

Linear motifs, also referred to as short linear motifs (SLiMs) or minimotifs, are
short functional sites typically found in disordered protein regions [109]. In the
framework of linear motifs, the interaction is not described focusing on the short
disordered partner, but the larger one, which is usually a protein domain. It was
found for many domains such as SH2/SH3, 14-3-3, WW and kinase domains that their
interacting partners—albeit in many cases not being homologues—share a limited
number (typically between 2 and 10) of common residues in the short interaction
region [110, 111]. Apart from these residues, the binding region also incorporates
other, flexible positions that can contain various amino acids without disrupting
the binding [112]. Figure 5 shows the example of nuclear receptors that are able
584 B. Mészáros et al.

Fig. 5 The figure shows the known interaction partners of nuclear receptors that all bind using the
same binding mode. The upper left structure shows a solved complex structure (based on PDB entry
1m2z) between a small region of the human NCOA2 nuclear receptor coactivator (shown in red and
yellow) and a glucocorticoid receptor (shown in blue). Although the actual sequences around the
binding region do not share a high level of similarity, they all contain three key leucine residues.
These three amino acids interspersed and flanked by flexible positions constitute the consensus
LIG_NRBOX motif (shown in red in the structure and the partner sequences)

to bind a large variety of protein partners. Although most partner proteins are not
homologues, they all share three key leucine residues at their interacting sites. During
the interaction, the region that binds to the receptor forms an α-helix and the three
leucines form a hydrophobic patch on the surface of the helix. This patch in turn
recognizes the appropriate complementary hydrophobic region of the interface of
the receptor, and anchors the helix to the binding groove. The consensus sequence
of the binding region is xLxxLLx, where x can stand for any amino acid, except for
proline, as it would disrupt the helix formation. This motif is called LIG_NRBOX and
ligands of many nuclear receptors are able to recognize their receptor partners via this
sequence pattern. The theory of linear motifs, used to describe such interactions, is
based on the assumption that these common residues (constituting the motif) mediate
the binding largely independent of the other regions of the protein they are embedded
in, functioning autonomously. However, in many cases the role of the context was
shown to be larger than originally expected [113].
The majority of protein-protein interaction mediating linear motifs were described
in eukaryotes. Currently the largest and most comprehensive available database of
these motifs is the Eukaryotic Linear Motif (ELM) database [114]. Motifs are cat-
egorized according to the type of interaction partners and functions (cleavage sites,
degradation sites, docking sites, ligand binding sites, post-translational modification
sites and targeting signals). Although the majority of these motifs were described in
Bioinformatical Approaches to Unstructured/Disordered Proteins … 585

eukaryotic proteins, some of them can be expected to occur in proteins of bacteria


and archaea too. Furthermore, instances of the retinoblastoma protein-, the SH3- and
the 14-3-3 interacting motifs, among others, were identified in various viruses as
well [115].
Linear motifs not only serve as a simplified description of a protein-protein inter-
action mode, but serve as a prediction algorithm too. Consensus motifs can be readily
used to search for binding partners of a given domain in unknown protein sequences
using basic pattern matches. The strength of this method—besides its simplicity—is
that it automatically gives information about the possible interacting partner. How-
ever, these patterns usually consist of only a few fixed residues, and therefore most
motifs are weakly defined, meaning that matches can arise purely by chance with a
relatively high probability [116]. As a result, naïve motif searches are hindered by
the massive amount of false positive hits. This is partially the result of the incomplete
description sequence patterns offer. Inside a living cell, the functionality of linear
motifs is modulated by structural, spatial and temporal control [117]. Furthermore,
the proper structural context of a motif (such as being accessible, flexible and capable
of forming the secondary structure necessary to fit into the binding cleft of the target
domain) is crucial for its biological relevance and motif definitions do not include
any such information.

5.2 Linear Motifs and Disordered Binding Regions

The disordered binding region and the linear motif concepts describe molecular
interactions on different bases: the former focusing on the structure (or the initial
lack and the formation of it) and the latter approaching the problem through the
sequence. However, the interactions described by the two concepts share a high
degree of similarity. In both cases the interaction is confined to a relatively short, linear
sequence region in one of the partners. Furthermore, most experimentally described
linear motif instances were found in disordered protein regions. Accordingly, in many
cases, such as the binding of p53 to MDM2 and the N terminal region of p27 binding
to the cyclinA-CDK2 complex, the same interaction was categorized as an example
of both linear motif mediated binding and of disordered binding regions. Through
many common examples, both the binding of disordered proteins and linear motifs
have been shown to play vital roles in eukaryotic regulation and signaling [117], as
well as serving as target points for viruses [115]. Apart from individual examples,
the connection between protein disorder and motif regulation has been also shown
at a more general level [118].
Despite the very different approaches used to describe interactions via disordered
binding and linear motifs, the two fields not only share a large number of com-
mon examples but also struggle with essentially the same problems. Probably the
most serious bottleneck in both cases is the low number of experimentally verified
examples. About 50% of human proteins are predicted to contain at least one larger
disordered region, and it was shown that the primary reason for the emergence of
586 B. Mészáros et al.

these regions is to harbor binding regions [6]. In contrast, the number of experi-
mentally verified disordered regions collected in the DisProt database is in the low
thousands [41] and the number of known disordered binding regions is even less
[43]. Parallelly, a moderate estimate places the number of individual motif mediated
interactions in the human proteome alone above 35,000 [119]. Despite this high esti-
mated occurrence, the number of experimentally verified, true motif instances in all
eukaryotic proteins described in the ELM database has only reached 3000. While it
is clear that the two concepts—linear motifs and disordered binding regions—could
be used in connection to strengthen each other’s predictions, this connection between
the two fields is yet to be established in detail.

6 Using Predictions on Disordered Proteins—A Practical


Guide

6.1 How to Use Disorder Prediction Methods

Disorder prediction methods can be used in two different ways. On one hand, they can
be used in large scale studies where many proteins are analyzed. These projects usu-
ally aim to uncover statistically meaningful differences between classes of proteins,
for example considering proteomes of different organisms, with regard to disorder
content. In this scenario usually only longer, contiguous disordered segments are
considered, and short runs (typically below 20 or 30) of residues predicted to be
disordered are filtered out. In this setup, methods that are trained to recognize longer
stretches of disordered residues, such as IUPred, PONDR VL3-BA or MFDp2 clearly
have an advantage. Practically all state-of-the-art methods assign to each residue a
continuous score, which represents the probability of it being disordered. However,
when using these methods, this score is converted to a binary classification. Residues
with scores above a predetermined threshold are classified as disordered, and residues
with lower scores are assigned an ordered status. It is worth noting, however, that
various methods are optimized for different false prediction rates—usually in the
2–15% range—and the predetermined cutoff is set accordingly. Although in com-
parative studies, where the basic questions are similar to “which of these groups of
proteins contains more disorder” or “how does the disorder content of proteomes
change during evolution” this does not affect the final results to a great extent, it
should be kept in mind that the actual numbers depend on the choice of algorithm.
The other typical use of disorder prediction methods is the analysis of individual
proteins. In these cases the different false positive rates of various methods presents a
problem that needs to be addressed, as the choice of method clearly affects the results.
Although this in theory can be circumvented by re-calibrating various methods on
a standardized dataset, this solution is not feasible for casual users. Furthermore,
the fact that various methods are optimized for various typical lengths of disorder
presents an additional level of difficulty when choosing a single method to use.
Bioinformatical Approaches to Unstructured/Disordered Proteins … 587

These considerations point towards the combined use of disorder prediction methods
when investigating individual protein sequences. A good starting point can be the
application of methods sensitive to larger, contiguous regions of order/disorder to
establish the basic structural composition of the protein in question. As a next step,
methods capable of detecting more localized disorder regions—such as OnD-CRF,
DisEMBL or DISOPRED3—can be applied.
Probably one of the most difficult tasks from the viewpoint of successful disor-
der prediction is presented by partial or transient structural elements. In the case of
stable, globular domains, or highly flexible disordered regions without a strong struc-
tural preference, most methods tend to show good agreement. However, considering
regions with partial or transient structure, such as molten globules, coiled-coil regions
or some disordered binding regions, almost all methods react to the underlying struc-
tural preferences with a lowered prediction score [120]. This type of behavior and
the resulting lack of a clear consensus prediction is highly characteristic of these
structurally ambiguous regions and for the experienced researcher these can serve
as dead giveaways. However, in the successful identification of the nature of the
underlying structural reasons, dedicated predictions—such as ANCHOR for identi-
fying disordered binding regions or COILS [121] for the identification of coiled-coil
regions—are indispensable.
In the next section we present a case study, where the reaction of various prediction
methods are demonstrated for ordered, disordered and disordered binding regions of
the human p53 protein.

6.2 Bringing It All Together—An Application to p53

In this section we show an application of the principles described in the previous


section through the example of human p53. p53 is a 393 residue long tumor sup-
pressor protein involved in the control of cell-cycle and apoptosis. The protein has a
relatively complex architecture containing a central, ordered DNA binding domain
(DBD) and two long disordered regions on both sides of the DBD, harboring sev-
eral binding regions and a tetramerization region. As both the binding regions and
the tetramerization region are disordered in isolation but can adopt a structure upon
binding, there is no single good answer for these regions from the perspective of
disorder predictions.
Figure 6 shows the output of 15 select prediction methods on the full length of
p53. In the central, ordered domain (spanning residues 102–292, marked with red
box) virtually all methods agree, assigning a relatively low score to the majority of
the domain, indicating the presence of a long region with high structural content. This
prediction is in accordance with the results obtained from the secondary structure
prediction PSIPRED [122], indicating numerous β-strands in the domain region.
The validity of the predicted ordered region and the type of assigned secondary
structures can be ascertained through the solved structure of the DBD. The predicted
588 B. Mészáros et al.

Fig. 6 Predictions for human p53 (UniProt AC: P04637). In the case of OnD-CRF and DISO-
PRED3 the original prediction scores were rescaled linearly to be directly comparable with other
methods. Disordered predictions were sorted top to bottom by decreasing average predicted disorder
tendency. The central, ordered DNA binding domain (DBD) is shown in red and experimentally
verified disordered binding regions (TAD1/2, tetramerization- and C-terminal regulatory domains)
are shown in green. The rest of the protein is disordered and is shown in white. Underneath the
disorder prediction outputs, the known biologically relevant linear motifs are shown with black and
grey boxes for ligand binding and sub-cellular localization target motifs, respectively. The middle
line (Predicted secondary structure) shows the secondary structure prediction by PSIPRED, with
black and striped boxes indicating predicted α-helical and β structures, respectively. The bottom
line shows the disordered binding site prediction by ANCHOR. Shading of the boxes corresponds
to the overall confidence of the predicted binding region, with darker shades indicating a higher
confidence
Bioinformatical Approaches to Unstructured/Disordered Proteins … 589

secondary structures correspond to the experimentally determined structure with a


relatively high precision.
The outputs of various disorder prediction methods on the N-terminal disordered
region (encompassing residues 1–101) are much more heterogeneous in compari-
son. This region is primarily occupied by the transactivation domain (TAD) that is
essential for the activation of various p53 target proteins, as well as for the proper
regulation of p53 turnover. In accord, the TAD (which, despite its name, is not a
structural domain but is fully disordered in isolation) mediates interactions with a
high number of partner proteins. Based on these interactions, the TAD can be fur-
ther subdivided into two distinct regions, TAD1 and TAD2. TAD1 is responsible for
the controlled degradation of p53 by mediating attachment to the SWIB domain of
ubiquitin ligases. Furthermore, TAD1 can also bind to the Taz2 domain of p300.
While p300 can also recognize TAD2 via and independent interaction, TAD2 can
also bind to and activate a range of other protein partners, including RNA polymerase
II, HMGB1 or GTF2H1. Although TAD1 and TAD2 are thought to operate largely
independently recognizing various domains, they can also form a single joint inter-
action surface, which binds to a disordered segment of CBP via mutual synergistic
folding. Basically, all methods react to the presence of the binding regions with a
lower score, however, to a varying degree. Some methods, such as VSL2B, VL3-BA,
PONDR-FIT, DISOclust or DISOPRED3 essentially predict the whole region to be
disordered. On the other extreme, DisEMBL predicts the majority of the TAD region
to be ordered. However, some methods, such as IUPred, RONN, ESpritz and OnD-
CRF react to the presence of transient structure by assigning a score very close to
0.5, which effectively corresponds to a ‘non-prediction’: these methods realize that
in the binary framework of ‘ordered or disordered’ they cannot correctly classify
these regions. Some methods, such as VSL2B or MFDp2 give a general indication
of the underlying structural tendency by giving one extended dip covering the whole
interacting region. Others, such as DISOPRED3, DISOclust or IUPred give two
distinct dips corresponding to the TAD1 and TAD2 regions. This behavior is also
characteristic of VL-XT being highly sensitive to local structure, which shows in that
it scores TAD1 with a significantly lower score than TAD2, similarly to IsUnstruct
and DeepCNF-D. It is worth noting that the MDM2 binding site of TAD1 has a
slight α-helical tendency even in the unbound form, and this helix is stabilized via
the interaction. This structural tendency is also shown by the secondary structure
prediction by PSI-PRED [123]. Furthermore, the MDM2 binding region also con-
tains the MDM2 interaction linear motif, giving further support to the predictions and
hinting at the interaction partner. However, the strongest prediction-level evidence
hinting at the presence of binding regions (as opposed to a coiled-coil region or a
short collapsed structure) is the high-confidence predictions of ANCHOR covering
the core of the whole TAD region.
The C-terminal disordered region (from 293 to 393) is structurally reminiscent of
the N-terminal region. It is generally disordered and contains multiple, overlapping
binding regions. As in the case of the N-terminal region, there is a high consen-
sus between different prediction methods concerning the non-interacting disordered
regions. In the tetramerization region (residues 325–356) all methods exhibit a lower
590 B. Mészáros et al.

score, but again—similarly to the N-terminal binding regions—to a highly varying


degree. The assigned scores range from the clearly disordered predictions of VSL2 to
the low scores of DeepCNF-D predicting the region to be ordered. However, IUPred,
RONN, ESpritz, OnD-CRF, DISOclust and DISOPRED3 again give a score close to
0.5 indicating their justified inability to give a definite prediction. The presence of a
binding region is again supported by the high confidence ANCHOR prediction, and
the PSIPRED prediction gives an indication of the mainly α-helical structure adopted
in the bound form. Apart from the tetramerization site, the C-terminal region also
contains a regulatory binding region that is able to bind to a multitude of different
partners acting as a molecular switch. The prediction algorithms consistently react to
this region in a fashion similar to the previous binding sites, albeit to a lesser extent.
The more pronounced structural preference of the cyclin binding site (embedded in
this binding region) can also be seen in certain prediction outputs; the presence of the
cyclin binding linear motif and the positive ANCHOR prediction all provide further
support to the presence of this interaction. However, this combined binding region
lacks any predicted secondary structures, which faithfully reflects the fact that this
region is able to bind to a high number of partner proteins in all three basic secondary
structures (α, β and irregular).
The example of p53 shows that the outputs of individual disorder prediction meth-
ods can be misleading or difficult to interpret on their own. However, the combination
of various methods coupled with other types of structural/functional predictions—-
such as secondary structure prediction, linear motif searches or disordered binding
site prediction by ANCHOR—, can give a detailed and reliable profile for proteins
with even highly complex structural features. This example faithfully reflects that
upon studying a single protein, the combination and proper interpretation of various
predictors can go a long way.

Acknowledgements This work was supported by grants Hungarian Research and Developments
Fund (OTKA K108798 for Z.D. and K115698 for I.S.), the “Momentum” grant from the Hungarian
Academy of Sciences (LP2014-18) for Z.D. The János Bolyai Research Scholarship of the Hun-
garian Academy of Sciences for C.M. is also gratefully acknowledged. We would like to thank to
Mark Adamsbaum for his critical reading of the manuscript.

References

1. Wright, P.E., Dyson, H.J.: Intrinsically unstructured proteins: re-assessing the protein
structure-function paradigm. J. Mol. Biol. 293, 321–331 (1999). https://doi.org/10.1006/jmbi.
1999.3110
2. Dunker, A.K., Lawson, J.D., Brown, C.J., et al.: Intrinsically disordered protein. J. Mol. Graph.
Model. 19, 26–59 (2001)
3. Dyson, H.J., Wright, P.E.: Intrinsically unstructured proteins and their functions. Nat. Rev.
Mol. Cell Biol. 6, 197–208 (2005). https://doi.org/10.1038/nrm1589
4. Tompa, P.: Intrinsically unstructured proteins. Trends Biochem. Sci. 27, 527–533 (2002)
5. Dunker, A.K., Obradovic, Z., Romero, P., et al.: Intrinsic protein disorder in complete
genomes. Genome Inform Ser Workshop Genome Inform 11, 161–171 (2000)
Bioinformatical Approaches to Unstructured/Disordered Proteins … 591

6. Mészáros, B., Simon, I., Dosztányi, Z.: Prediction of protein binding regions in disordered pro-
teins. PLoS Comput. Biol. 5, e1000376 (2009). https://doi.org/10.1371/journal.pcbi.1000376
7. Ward, J.J., Sodhi, J.S., McGuffin, L.J., et al.: Prediction and functional analysis of native
disorder in proteins from the three kingdoms of life. J. Mol. Biol. 337, 635–645 (2004).
https://doi.org/10.1016/j.jmb.2004.02.002
8. Xie, H., Vucetic, S., Iakoucheva, L.M., et al.: Functional anthology of intrinsic disorder. 1.
Biological processes and functions of proteins with long disordered regions. J. Proteome Res.
6, 1882–1898 (2007). https://doi.org/10.1021/pr060392u
9. Tompa, P.: The interplay between structure and function in intrinsically unstructured proteins.
FEBS Lett. 579, 3346–3354 (2005). https://doi.org/10.1016/j.febslet.2005.03.072
10. Galea, C.A., Wang, Y., Sivakolundu, S.G., Kriwacki, R.W.: Regulation of cell division by
intrinsically unstructured proteins: intrinsic flexibility, modularity, and signaling conduits†.
Biochemistry 47, 7598–7609 (2008). https://doi.org/10.1021/bi8006803
11. Uversky, V.N., Oldfield, C.J., Dunker, A.K.: Intrinsically disordered proteins in human dis-
eases: introducing the D2 concept. Annu Rev Biophys 37, 215–246 (2008). https://doi.org/
10.1146/annurev.biophys.37.032807.125924
12. Cheng, Y., LeGall, T., Oldfield, C.J., et al.: Abundance of intrinsic disorder in protein asso-
ciated with cardiovascular disease†. Biochemistry 45, 10448–10460 (2006). https://doi.org/
10.1021/bi060981d
13. Uversky, V.N., Vladimir, Uversky N.: Intrinsic disorder in proteins associated with neurode-
generative diseases. Front Biosci. 14, 5188 (2009). https://doi.org/10.2741/3594
14. Uversky, V.N., Oldfield, C.J., Midic, U., et al.: Unfoldomics of human diseases: linking
protein intrinsic disorder with diseases. BMC Genom. 10(Suppl 1), S7 (2009). https://doi.
org/10.1186/1471-2164-10-S1-S7
15. Iakoucheva, L.M., Brown, C.J., Lawson, J.D., et al.: Intrinsic disorder in cell-signaling and
cancer-associated proteins. J. Mol. Biol. 323, 573–584 (2002)
16. Pajkos, M., Mészáros, B., Simon, I., Dosztányi, Z.: Is there a biological cost of protein
disorder? Analysis of cancer-associated mutations. Mol. BioSyst. 8, 296–307 (2012). https://
doi.org/10.1039/c1mb05246b
17. Cheng, Y., LeGall, T., Oldfield, C.J., et al.: Rational drug design via intrinsically disordered
protein. Trends Biotechnol. 24, 435–442 (2006). https://doi.org/10.1016/j.tibtech.2006.07.
005
18. Metallo, S.J.: Intrinsically disordered proteins are potential drug targets. Curr. Opin. Chem.
Biol. 14, 481–488 (2010). https://doi.org/10.1016/j.cbpa.2010.06.169
19. Uversky, V.N.: Natively unfolded proteins: a point where biology waits for physics. Protein
Sci. 11, 739–756 (2002). https://doi.org/10.1110/ps.4210102
20. Dyson, H.J., Wright, P.E.: Coupling of folding and binding for unstructured proteins. Curr.
Opin. Struct. Biol. 12, 54–60 (2002)
21. Berman, H.M.: The protein data bank. Nucleic Acids Res. 28, 235–242 (2000). https://doi.
org/10.1093/nar/28.1.235
22. Gunasekaran, K., Tsai, C.-J., Nussinov, R.: Analysis of ordered and disordered protein com-
plexes reveals structural features discriminating between stable and unstable monomers. J.
Mol. Biol. 341, 1327–1341 (2004). https://doi.org/10.1016/j.jmb.2004.07.002
23. Mészáros, B., Tompa, P., Simon, I., Dosztányi, Z.: Molecular principles of the interactions of
disordered proteins. J. Mol. Biol. 372, 549–561 (2007). https://doi.org/10.1016/j.jmb.2007.
07.004
24. Wright, P.E., Jane Dyson, H.: Linking folding and binding. Curr. Opin. Struct. Biol. 19, 31–38
(2009). https://doi.org/10.1016/j.sbi.2008.12.003
25. Uversky, V.N., Oldfield, C.J., Dunker, A.K.: Showing your ID: intrinsic disorder as an ID for
recognition, regulation and cell signaling. J. Mol. Recognit. 18, 343–384 (2005). https://doi.
org/10.1002/jmr.747
26. Dosztányi, Z., Chen, J., Dunker, A.K., et al.: Disorder and sequence repeats in hub proteins
and their implications for network evolution. J. Proteome Res. 5, 2985–2995 (2006). https://
doi.org/10.1021/pr060171o
592 B. Mészáros et al.

27. Demarest, S.J., Martinez-Yamout, M., Chung, J., et al.: Mutual synergistic folding in recruit-
ment of CBP/p300 by p160 nuclear receptor coactivators. Nature 415, 549–553 (2002). https://
doi.org/10.1038/415549a
28. Rumfeldt, J.A.O., Galvagnion, C., Vassall, K.A., Meiering, E.M.: Conformational stability
and folding mechanisms of dimeric proteins. Prog. Biophys. Mol. Biol. 98, 61–84 (2008).
https://doi.org/10.1016/j.pbiomolbio.2008.05.004
29. Tsai, C.-J., Nussinov, R.: Hydrophobic folding units at protein-protein interfaces: implications
to protein folding and to protein-protein association. Protein Sci. 6, 1426–1437 (1997). https://
doi.org/10.1002/pro.5560060707
30. Nussinov, R., Xu, D., Tsai, C.-J.: Mechanism and evolution of protein dimerization. Protein
Sci. 7, 533–544 (1998). https://doi.org/10.1002/pro.5560070301
31. Fichó, E., Reményi, I., Simon, I., Mészáros, B.: MFIB: a repository of protein complexes
with mutual folding induced by binding. Bioinformatics 33, 3682–3684 (2017). https://doi.
org/10.1093/bioinformatics/btx486
32. Bracken, C., Iakoucheva, L.M., Romero, P.R., Dunker, A.K.: Combining prediction, compu-
tation and experiment for the characterization of protein disorder. Curr. Opin. Struct. Biol.
14, 570–576 (2004). https://doi.org/10.1016/j.sbi.2004.08.003
33. Garner, E., Cannon, P., Romero, P., et al.: Predicting disordered regions from amino acid
sequence: common themes despite differing structural characterization. Genome Inform Ser
Workshop Genome Inform 9, 201–213 (1998)
34. Li, X., Romero, P., Rani, M., et al.: Predicting protein disorder for N-, C-, and internal regions.
Genome Inform Ser Workshop Genome Inform 10, 30–40 (1999)
35. Radivojac, P., Obradovic, Z., Smith, D.K., et al.: Protein flexibility and intrinsic disorder.
Protein Sci. 13, 71–80 (2004). https://doi.org/10.1110/ps.03128904
36. He, B., Wang, K., Liu, Y., et al.: Predicting intrinsic disorder in proteins: an overview. Cell
Res. 19, 929–949 (2009). https://doi.org/10.1038/cr.2009.87
37. Wootton, J.C.: Non-globular domains in protein sequences: automated segmentation using
complexity measures. Comput. Chem. 18, 269–285 (1994)
38. Wootton, J.C., Federhen, S.: Analysis of compositionally biased regions in sequence
databases. Methods Enzymol. 266, 554–571 (1996)
39. Romero, P., Obradovic, Z., Li, X., et al.: Sequence complexity of disordered protein. Proteins
Struct. Funct. Genet. 42, 38–48 (2000). https://doi.org/10.1002/1097-0134(20010101)42:1%
3c38:aid-prot50%3e3.0.co;2-3
40. Vucetic, S., Obradovic, Z., Vacic, V., et al.: DisProt: a database of protein disorder. Bioinfor-
matics 21, 137–140 (2005). https://doi.org/10.1093/bioinformatics/bth476
41. Piovesan, D., Tabaro, F., Mičetić, I., et al.: DisProt 7.0: a major update of the database of
disordered proteins. Nucleic Acids Res. 45, D219–D227 (2017). https://doi.org/10.1093/nar/
gkw1056
42. Dutta, S., Burkhardt, K., Young, J., et al.: Data deposition and annotation at the worldwide
protein data bank. Mol. Biotechnol. 42, 1–13 (2009). https://doi.org/10.1007/s12033-008-
9127-7
43. Schad, E., Fichó, E., Pancsa, R., et al.: DIBS: a repository of disordered binding sites mediating
interactions with ordered proteins. Bioinformatics 34, 535–537 (2018). https://doi.org/10.
1093/bioinformatics/btx640
44. Fukuchi, S., Sakamoto, S., Nobe, Y., et al.: IDEAL: intrinsically disordered proteins with
extensive annotations and literature. Nucleic Acids Res. 40, D507–D511 (2012). https://doi.
org/10.1093/nar/gkr884
45. Tompa, P., Fuxreiter, M.: Fuzzy complexes: polymorphism and structural disorder in protein-
protein interactions. Trends Biochem. Sci. 33, 2–8 (2008). https://doi.org/10.1016/j.tibs.2007.
10.003
46. Miskei, M., Antal, C., Fuxreiter, M.: FuzDB: database of fuzzy complexes, a tool to develop
stochastic structure-function relationships for protein complexes and higher-order assemblies.
Nucleic Acids Res. 45, D228–D235 (2017). https://doi.org/10.1093/nar/gkw1019
Bioinformatical Approaches to Unstructured/Disordered Proteins … 593

47. Piovesan, D., Tabaro, F., Paladin, L., et al.: MobiDB 3.0: more annotations for intrinsic disor-
der, conformational diversity and interactions in proteins. Nucleic Acids Res. 46, D471–D476
(2017). https://doi.org/10.1093/nar/gkx1071
48. Ulrich, E.L., Akutsu, H., Doreleijers, J.F., et al.: BioMagResBank. Nucleic Acids Res. 36,
D402–D408 (2007). https://doi.org/10.1093/nar/gkm957
49. Necci, M., Piovesan, D., Dosztányi, Z., Tosatto, S.C.E.: MobiDB-lite: fast and highly specific
consensus prediction of intrinsic disorder in proteins. Bioinformatics 33, 1402–1404 (2017).
https://doi.org/10.1093/bioinformatics/btx015
50. Oates, M.E., Romero, P., Ishida, T., et al.: D2 P2 : database of disordered protein predictions.
Nucleic Acids Res. 41, D508–D516 (2013). https://doi.org/10.1093/nar/gks1226
51. Mohan, A., Uversky, V.N., Radivojac, P.: Influence of sequence changes and environment on
intrinsically disordered proteins. PLoS Comput. Biol. 5, e1000497 (2009). https://doi.org/10.
1371/journal.pcbi.1000497
52. De Biasio, A., Guarnaccia, C., Popovic, M., et al.: Prevalence of intrinsic disorder in the
intracellular region of human single-pass type I proteins: the case of the notch ligand Delta-4.
J. Proteome Res. 7, 2496–2506 (2008). https://doi.org/10.1021/pr800063u
53. Uversky, V.N., Gillespie, J.R., Fink, A.L.: Why are “natively unfolded” proteins unstructured
under physiologic conditions? Proteins 41, 415–427 (2000)
54. Galzitskaya, O.V., Garbuzynskiy, S.O., Lobanov, M.Y.: FoldUnfold: web server for the pre-
diction of disordered regions in protein chain. Bioinformatics 22, 2948–2949 (2006). https://
doi.org/10.1093/bioinformatics/btl504
55. Xie, Q., Arnold, G.E., Romero, P., et al.: The sequence attribute method for determining
relationships between sequence and protein disorder. Genome Inform Ser Workshop Genome
Inform 9, 193–200 (1998)
56. Campen, A., Williams, R.M., Brown, C.J., et al.: TOP-IDP-scale: a new amino acid scale
measuring propensity for intrinsic disorder. Protein Pept. Lett. 15, 956–963 (2008)
57. Linding, R., Russell, R.B., Neduva, V., Gibson, T.J.: GlobPlot: exploring protein sequences
for globularity and disorder. Nucleic Acids Res. 31, 3701–3708 (2003)
58. Cheng, J., Sweredoski, M.J., Baldi, P.: Accurate prediction of protein disordered regions by
mining protein structure data. Data Min. Knowl. Discov. 11, 213–222 (2005). https://doi.org/
10.1007/s10618-005-0001-y
59. Fuxreiter, M., Simon, I., Friedrich, P., Tompa, P.: Preformed structural elements feature in part-
ner recognition by intrinsically unstructured proteins. J. Mol. Biol. 338, 1015–1026 (2004).
https://doi.org/10.1016/j.jmb.2004.03.017
60. Süveges, D., Gáspári, Z., Tóth, G., Nyitray, L.: Charged single alpha-helix: a versatile protein
structural motif. Proteins 74, 905–916 (2009). https://doi.org/10.1002/prot.22183
61. Brown, C.J., Takayama, S., Campen, A.M., et al.: Evolutionary rate heterogeneity in proteins
with long disordered regions. J. Mol. Evol. 55, 104–110 (2002). https://doi.org/10.1007/
s00239-001-2309-6
62. Daughdrill, G.W., Narayanaswami, P., Gilmore, S.H., et al.: Dynamic behavior of an intrinsi-
cally unstructured linker domain is conserved in the face of negligible amino acid sequence
conservation. J. Mol. Evol. 65, 277–288 (2007). https://doi.org/10.1007/s00239-007-9011-2
63. Peng, K., Radivojac, P., Vucetic, S., et al.: Length-dependent prediction of protein intrinsic
disorder. BMC Bioinform. 7, 208 (2006). https://doi.org/10.1186/1471-2105-7-208
64. Melamud, E., Moult, J.: Evaluation of disorder predictions in CASP5. Proteins 53(Suppl 6),
561–565 (2003). https://doi.org/10.1002/prot.10533
65. Jin, Y., Dunbrack Jr., R.L.: Assessment of disorder predictions in CASP6. Proteins 61(Suppl
7), 167–175 (2005). https://doi.org/10.1002/prot.20734
66. Bordoli, L., Kiefer, F., Schwede, T.: Assessment of disorder predictions in CASP7. Proteins
69(Suppl 8), 129–136 (2007). https://doi.org/10.1002/prot.21671
67. Noivirt-Brik, O., Prilusky, J., Sussman, J.L.: Assessment of disorder predictions in CASP8.
Proteins 77(Suppl 9), 210–216 (2009). https://doi.org/10.1002/prot.22586
68. Monastyrskyy, B., Fidelis, K., Moult, J., et al.: Evaluation of disorder predictions in CASP9.
Proteins 79(Suppl 10), 107–118 (2011). https://doi.org/10.1002/prot.23161
594 B. Mészáros et al.

69. Monastyrskyy, B., Kryshtafovych, A., Moult, J., et al.: Assessment of protein disorder region
predictions in CASP10. Proteins 82(Suppl 2), 127–137 (2014). https://doi.org/10.1002/prot.
24391
70. Liu, Y., Wang, X., Liu, B.: A comprehensive review and comparison of existing computational
methods for intrinsically disordered protein and region prediction. Brief. Bioinform. (2017).
https://doi.org/10.1093/bib/bbx126
71. Dosztányi, Z., Sándor, M., Tompa, P., Simon, I.: Prediction of protein disorder at the domain
level. Curr. Protein Pept. Sci. 8, 161–171 (2007)
72. Schlessinger, A., Punta, M., Yachdav, G., et al.: Improved disorder prediction by combination
of orthogonal approaches. PLoS ONE 4, e4433 (2009). https://doi.org/10.1371/journal.pone.
0004433
73. Necci, M., Piovesan, D., Dosztányi, Z., et al.: A comprehensive assessment of long intrinsic
protein disorder from the DisProt database. Bioinformatics 34, 445–452 (2018). https://doi.
org/10.1093/bioinformatics/btx590
74. Meng, F., Uversky, V.N., Kurgan, L.: Comprehensive review of methods for prediction of
intrinsic disorder and its molecular functions. Cell. Mol. Life Sci. 74, 3069–3090 (2017).
https://doi.org/10.1007/s00018-017-2555-4
75. Romero, Obradovic, Dunker, K.: Sequence data analysis for long disordered regions prediction
in the calcineurin family. Genome Inform Ser Workshop Genome Inform 8, 110–124 (1997)
76. Oldfield, C.J., Cheng, Y., Cortese, M.S., et al.: Coupled folding and binding with alpha-helix-
forming molecular recognition elements. Biochemistry 44, 12454–12470 (2005). https://doi.
org/10.1021/bi050736e
77. Cheng, Y., Oldfield, C.J., Meng, J., et al.: Mining alpha-helix-forming molecular recognition
features with cross species sequence alignments. Biochemistry 46, 13468–13477 (2007).
https://doi.org/10.1021/bi7012273
78. Radivojac, P., Obradović, Z., Brown, C.J., Dunker, A.K.: Prediction of boundaries between
intrinsically ordered and disordered protein regions. Pac. Symp. Biocomput. 216–227 (2003)
79. Obradovic, Z., Peng, K., Vucetic, S., et al.: Predicting intrinsic disorder from amino acid
sequence. Proteins 53(Suppl 6), 566–572 (2003). https://doi.org/10.1002/prot.10532
80. Schaffer, A.A.: Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements. Nucleic Acids Res. 29, 2994–3005 (2001).
https://doi.org/10.1093/nar/29.14.2994
81. Linding, R., Jensen, L.J., Diella, F., et al.: Protein disorder prediction. Structure 11, 1453–1459
(2003). https://doi.org/10.1016/j.str.2003.10.002
82. Yang, Z.R., Thomson, R., McNeil, P., Esnouf, R.M.: RONN: the bio-basis function neural
network technique applied to the detection of natively disordered regions in proteins. Bioin-
formatics 21, 3369–3376 (2005). https://doi.org/10.1093/bioinformatics/bti534
83. Jones, D.T., Cozzetto, D.: DISOPRED3: precise disordered region predictions with anno-
tated protein-binding activity. Bioinformatics 31, 857–863 (2015). https://doi.org/10.1093/
bioinformatics/btu744
84. McGuffin, L.J., Atkins, J.D., Salehe, B.R., et al.: IntFOLD: an integrated server for mod-
elling protein structures and functions from amino acid sequences. Nucleic Acids Res. 43,
W169–W173 (2015). https://doi.org/10.1093/nar/gkv236
85. Cheng, J., Randall, A.Z., Sweredoski, M.J., Baldi, P.: SCRATCH: a protein structure and
structural feature prediction server. Nucleic Acids Res. 33, W72–W76 (2005). https://doi.org/
10.1093/nar/gki396
86. Wang, L., Sauer, U.H.: OnD-CRF: predicting order and disorder in proteins using [corrected]
conditional random fields. Bioinformatics 24, 1401–1402 (2008). https://doi.org/10.1093/
bioinformatics/btn132
87. Wang, S., Weng, S., Ma, J., Tang, Q.: DeepCNF-D: predicting protein order/disorder regions
by weighted deep convolutional neural fields. Int. J. Mol. Sci. 16, 17315–17330 (2015).
https://doi.org/10.3390/ijms160817315
88. Obradovic, Z., Peng, K., Vucetic, S., et al.: Exploiting heterogeneous sequence properties
improves prediction of protein disorder. Proteins Struct. Funct. Bioinf 61, 176–182 (2005).
https://doi.org/10.1002/prot.20735
Bioinformatical Approaches to Unstructured/Disordered Proteins … 595

89. Walsh, I., Martin, A.J.M., Di Domenico, T., Tosatto, S.C.E.: ESpritz: accurate and fast
prediction of protein disorder. Bioinformatics 28, 503–509 (2012). https://doi.org/10.1093/
bioinformatics/btr682
90. Xue, B., Dunbrack, R.L., Williams, R.W., et al.: PONDR-FIT: a meta-predictor of intrinsically
disordered amino acids. Biochim. Biophys. Acta 1804, 996–1010 (2010). https://doi.org/10.
1016/j.bbapap.2010.01.011
91. Prilusky, J., Felder, C.E., Zeev-Ben-Mordehai, T., et al.: FoldIndex: a simple tool to predict
whether a given protein sequence is intrinsically unfolded. Bioinformatics 21, 3435–3438
(2005). https://doi.org/10.1093/bioinformatics/bti537
92. Dosztányi, Z., Csizmók, V., Tompa, P., Simon, I.: The pairwise energy content estimated from
amino acid composition discriminates between folded and intrinsically unstructured proteins.
J. Mol. Biol. 347, 827–839 (2005). https://doi.org/10.1016/j.jmb.2005.01.071
93. Mizianty, M.J., Peng, Z., Kurgan, L.: MFDp2. Intrinsically Disordered Proteins 1, e24428
(2013). https://doi.org/10.4161/idp.24428
94. Fan, X., Kurgan, L.: Accurate prediction of disorder in protein chains with a comprehensive
and empirically designed consensus. J. Biomol. Struct. Dyn. 32, 448–464 (2014). https://doi.
org/10.1080/07391102.2013.775969
95. Mizianty, M.J., Zhang, T., Xue, B., et al.: In-silico prediction of disorder content using
hybrid sequence representation. BMC Bioinform. 12, 245 (2011). https://doi.org/10.1186/
1471-2105-12-245
96. Walsh, I., Martin, A.J.M., Di Domenico, T., et al.: CSpritz: accurate prediction of protein dis-
order segments with annotation for homology, secondary structure and linear motifs. Nucleic
Acids Res. 39, W190–W196 (2011). https://doi.org/10.1093/nar/gkr411
97. Zhang, T., Faraggi, E., Xue, B., et al.: SPINE-D: accurate prediction of short and long disor-
dered regions by a single neural-network based method. J. Biomol. Struct. Dyn. 29, 799–813
(2012). https://doi.org/10.1080/073911012010525022
98. Bujnicki, J.M., Elofsson, A., Fischer, D., Rychlewski, L.: LiveBench-2: large-scale automated
evaluation of protein structure prediction servers. Proteins Suppl. 5, 184–191 (2001)
99. McGuffin, L.J.: Intrinsic disorder prediction from the analysis of multiple protein fold recog-
nition models. Bioinformatics 24, 1798–1804 (2008). https://doi.org/10.1093/bioinformatics/
btn326
100. Lobanov, M.Y., Galzitskaya, O.V.: The Ising model for prediction of disordered residues from
protein sequence alone. Phys. Biol. 8, 035004 (2011). https://doi.org/10.1088/1478-3975/8/
3/035004
101. Lobanov, M.Y., Sokolovskiy, I.V., Galzitskaya, O.V.: IsUnstruct: prediction of the residue
status to be ordered or disordered in the protein chain by a method based on the Ising model. J.
Biomol. Struct. Dyn. 31, 1034–1043 (2013). https://doi.org/10.1080/07391102.2012.718529
102. Dosztányi, Z.: Prediction of protein disorder based on IUPred. Protein Sci. 27, 331–340
(2018). https://doi.org/10.1002/pro.3334
103. Thomas, P.D., Dill, K.A.: An iterative method for extracting energy-like quantities from
protein structures. Proc. Natl. Acad. Sci. U S A 93, 11628–11633 (1996)
104. Shortle, D.: Propensities, probabilities, and the Boltzmann hypothesis. Protein Sci. 12,
1298–1302 (2003). https://doi.org/10.1110/ps.0306903
105. Dosztanyi, Z., Csizmok, V., Tompa, P., Simon, I.: IUPred: web server for the prediction of
intrinsically unstructured regions of proteins based on estimated energy content. Bioinfor-
matics 21, 3433–3434 (2005). https://doi.org/10.1093/bioinformatics/bti541
106. Dosztányi, Z., Mészáros, B., Simon, I.: ANCHOR: web server for predicting protein binding
regions in disordered proteins. Bioinformatics 25, 2745–2746 (2009). https://doi.org/10.1093/
bioinformatics/btp518
107. Disfani, F.M., Hsu, W.-L., Mizianty, M.J., et al.: MoRFpred, a computational tool for
sequence-based prediction and characterization of short disorder-to-order transitioning
binding regions in proteins. Bioinformatics 28, i75–i83 (2012). https://doi.org/10.1093/
bioinformatics/bts209
596 B. Mészáros et al.

108. Malhis, N., Jacobson, M., Gsponer, J.: http://www.chibi.ubc.ca/faculty/joerg-gsponer/


gsponer-lab/software/morf_chibi/ (2016). Accessed 31 Jan 2018
109. Fuxreiter, M., Tompa, P., Simon, I.: Local structural disorder imparts plasticity on linear
motifs. Bioinformatics 23, 950–956 (2007). https://doi.org/10.1093/bioinformatics/btm035
110. Diella, F., Haslam, N., Chica, C., et al.: Understanding eukaryotic linear motifs and their role
in cell signaling and regulation. Front Biosci. 13, 6580–6603 (2008)
111. Sigrist, C.J.A., Cerutti, L., Hulo, N., et al.: PROSITE: a documented database using patterns
and profiles as motif descriptors. Brief. Bioinform. 3, 265–274 (2002)
112. Neduva, V., Russell, R.B.: Linear motifs: evolutionary interaction switches. FEBS Lett. 579,
3342–3345 (2005). https://doi.org/10.1016/j.febslet.2005.04.005
113. Stein, A., Aloy, P.: Contextual specificity in peptide-mediated protein interactions. PLoS ONE
3, e2524 (2008). https://doi.org/10.1371/journal.pone.0002524
114. Dinkel, H., Van Roey, K., Michael, S., et al.: ELM 2016—Data update and new functionality
of the eukaryotic linear motif resource. Nucleic Acids Res. 44, D294–D300 (2016). https://
doi.org/10.1093/nar/gkv1291
115. Davey, N.E., Travé, G., Gibson, T.J.: How viruses hijack cell regulation. Trends Biochem.
Sci. 36, 159–169 (2011). https://doi.org/10.1016/j.tibs.2010.10.002
116. Davey, N.E., Edwards, R.J., Shields, D.C.: Estimation and efficient computation of the true
probability of recurrence of short linear protein sequence motifs in unrelated proteins. BMC
Bioinform. 11, 14 (2010). https://doi.org/10.1186/1471-2105-11-14
117. Gibson, T.J.: Cell regulation: determined to signal discrete cooperation. Trends Biochem. Sci.
34, 471–482 (2009). https://doi.org/10.1016/j.tibs.2009.06.007
118. Stein, A., Pache, R.A., Bernadó, P., et al.: Dynamic interactions of proteins in complex net-
works: a more structured view. FEBS J. 276, 5390–5405 (2009). https://doi.org/10.1111/j.
1742-4658.2009.07251.x
119. Weatheritt, R.J., Luck, K., Petsalaki, E., et al.: The identification of short linear motif-mediated
interfaces within the human interactome. Bioinformatics 28, 976–982 (2012). https://doi.org/
10.1093/bioinformatics/bts072
120. Dosztányi, Z., Mészáros, B., Simon, I.: Bioinformatical approaches to characterize intrinsi-
cally disordered/unstructured proteins. Brief. Bioinform. 11, 225–243 (2010). https://doi.org/
10.1093/bib/bbp061
121. Lupas, A., Van Dyke, M., Stock, J.: Predicting coiled coils from protein sequences. Science
252, 1162–1164 (1991). https://doi.org/10.1126/science.252.5009.1162
122. McGuffin, L.J., Bryson, K., Jones, D.T.: The PSIPRED protein structure prediction server.
Bioinformatics 16, 404–405 (2000)
123. Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matri-
ces. J. Mol. Biol. 292, 195–202 (1999). https://doi.org/10.1006/jmbi.1999.3091
Theoretical and Computational Aspects
of Protein Structural Alignment

Paweł Daniluk and Bogdan Lesyng

Abstract Computing alignments of proteins based on their structure is one of the


fundamental tasks of bioinformatics. It is crucial in all kinds of comparative analysis
as well as in performing evolutionary and functional classification. Whereas deter-
mination of sequence relationships is well founded in statistical models, there is
still considerable uncertainty over how to describe geometric relationships between
proteins. Continuous growth of structural databases calls for fast and reliable algo-
rithmic methods, enabling one to effectively compute alignments of pairs and larger
sets of protein molecules. Although such methodologies have been developed over
the past two decades, there exist so-called “difficult similarities” which may include
repeats, insertions or deletions, permutations and conformational changes. A brief
overview of existing methodologies with emphasis on the different approaches to
decomposition of structures into smaller fragments is followed by a presentation of a
formalism of local descriptors of protein structures. A formal definition of the prob-
lem of computing optimal alignments accommodating aforementioned difficulties is
presented along with an analysis of the computational complexity of its important
variants. Examples of “difficult similarities” and practical aspects of protein structure
comparison are discussed.

B. Lesyng
Faculty of Physics, Department of Biophysics, University of Warsaw,
Żwirki i Wigury 93, 02-892 Warsaw, Poland
e-mail: lesyng@imdik.pan.pl
P. Daniluk (B) · B. Lesyng
Bioinformatics Laboratory, Mossakowski Medical Research Centre,
Pawinskiego 5, 02-106 Warsaw, Poland
e-mail: pdaniluk@imdik.pan.pl

© Springer Nature Switzerland AG 2019 597


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_18
598 P. Daniluk and B. Lesyng

1 Introduction

Proteins are biopolymers comprising one or more polypeptide chains. There exist
twenty amino-acid residues which occur in proteins encountered in living organisms.
Thus a first approximation (primary structure) of a protein is its sequence, normally
represented as a string of letters from a 20 letter alphabet. Sequences may be com-
pared to reveal genetic, evolutionary relationships between proteins. Sequence com-
parison is a variant of a well researched string matching problem which is usually
solved with an ubiquitous Needleman-Wunsch algorithm [35] or its heuristic coun-
terparts [2, 39]. A polypeptide chain of a protein after synthesis undergoes a process
of folding in which it obtains a well defined characteristic spatial conformation (ter-
tiary structure). Structure is instrumental to the role a given protein performs in a
living organism. With some simplification, one may assume that a residue sequence
determines a spatial structure, which in turn determines a function. Due to the nature
of evolutionary processes it can be observed that structure is a much more conserved
property than sequence. Even remotely homologous proteins usually have similar
tertiary structure. Therefore, comparison of structures, although more difficult, may
provide more information on evolutionary and functional relationships than sequence
analysis alone.
Although several methods for protein structure comparison have been developed
during the past two decades, no single “best of all” method exists, and there are many
known cases of so-called difficult similarities, which cannot be correctly solved by
most methods. Relatively little effort has been put into development of formal theories
of this problem, which would enable a thorough analysis of its properties.
The purpose of this study is to give a brief overview of the existing approaches
and methodologies followed by a formal analysis of several variants of the prob-
lem of computing alignments based on a set of local similarities. Description of a
method based on presented theoretical principles along with a few practical aspects
of comparing protein structures are also provided.
This study is organized as follows. The introduction covers basic definitions,
contains a brief overview of popular methods, outlines potential pitfalls and gives a
short introduction into theory of computational complexity. In the following section
the most popular approaches to defining and comparing structural fragments are
presented. The third and fourth sections are devoted to the problems of computing
an optimal alignment of two or more protein structures and include an analysis of the
computational complexity of several variants of these problems. In the fifth section
we present practical but rarely used techniques which may be useful in similarity
analysis, as well as several case studies.

1.1 Alignments and Superpositions

The notion of an alignment in the context of biological sequences originates from the
concept of introducing gaps into sequences written one below the other, to maximize
the number of columns with identical or similar residues. Alternatively, one may
Theoretical and Computational Aspects … 599

view an alignment as a renumbering of residues such that equivalent residues have


the same number. In actual applications an alignment with sequence identity of 30%
may be considered significant. Thus, under typical conditions it is impossible to
consider alignments which do not preserve order of residues. They would introduce
more noise (false positives) than can be coped with. It is well known, however,
that structures containing segment swaps or circular permutations may have similar
shapes and perform related functions [18, 30]. Similarities of this kind are virtually
undetectable by conventional sequence analysis.
In this study we place emphasis on this particular issue, and diverge from the
traditional understanding of an alignment. We assume that an alignment may be
any correspondence between residues of aligned structures. We will avoid the use
of terms such as “sequence” or “structure” alignment, which may indicate the basis
of the similarity which an alignment is supposed to maximize, but which have no
connection to the mapping of residues once it has been computed.
Whenever spatial objects like protein structures are being compared, visual rep-
resentation of similarity plays a significant role for the end user. Similar molecules
can usually be isometrically transformed, so that distances between corresponding
residues are minimized and superimposed. In this respect superposition is secondary
to the alignment, since it may be computed only after corresponding residues have
been identified. In extreme cases it may even happen that a biologically correct
alignment results in a visually poor superposition.

1.2 Existing Methods

Methodologies of protein structure comparison may be classified into two major cate-
gories – global and local. In the first one an alignment and superposition of molecules
are iteratively improved. Starting with a given alignment, an optimal superposition
is computed, then a new alignment is extracted from the superposition by identifying
pairs of residues spatially close to each other. Such methods are effective, assum-
ing conformational variability is limited and similarity is significant enough for the
process to converge quickly.
Alternatively, computing an alignment may start with identifying a set of local sim-
ilarities, which afterwards serve as building blocks for the global alignment. There are
several methods of decomposing structures into smaller fragments. The most popular
are inter-residue distances (SSAP [37], DALI [21], PAUL [50]), single continuous
segments of the main chain (CE [46]) or secondary structure elements (SSEs) (VAST
[17], SARF [1], MATRAS [26], GANGSTA [19]). Less popular include Delaunay
triangulation (TOPOFIT [22]), spherical polar Fourier representations (3D-BLAST
[31]), and geometric hashing (Cα -match [4]). Local descriptors of protein structures
(see Sect. 2.2) have also been successfully applied (DEDAL [10]). Global alignment
is computed by selecting the largest consistent set of local similarities. Definitions of
consistency and methods for searching the solution space vary. Usually it is required
that correspondences between residues given by two consistent alignments have to
600 P. Daniluk and B. Lesyng

agree on all residues common to both of them. Sometimes additional criteria are used,
such as the similarity of transformations required to superimpose fragments [4] or
the ordering in the protein sequence are used. The search of the solution space is
performed using algorithms for finding isomorphic subgraphs or cliques, clustering,
dynamic programming or other techniques. Some methods use a one-dimensional
representation of structure – where each residue is substituted with a characteriza-
tion of its local features – and use dynamic programming to align such artificial
sequences (e.g. SHEBA [23]). Due to the computational complexity caused by the
combinatoric size of the solution space, solutions containing circular permutations or
segment swaps are disregarded even if the method could find them in theory. Such a
situation takes place with the DALI method and its publicly available implementation
DaliLite [20, 21]. Sometimes spatial distortions are accommodated by introducing
“hinges” (FATCAT [52], FlexProt [43], ProtDeform [41], FlexSnap [42]) (Table 1).
The problem of computing multiple alignments of protein structures is much
harder and less popular. There are two basic approaches to defining and computing
a multiple alignment – searching for a substructure common to all structures com-
pared, or searching for all similarities as long as equivalences between residues are
unambiguous (see Sect. 4.1). Existing methods are often generalizations of methods
of computing pairwise alignments. Based on the similarity of all pairs a binary tree
is built. Its leaves correspond to structures, while nodes to multiple alignments of
structures in its descendants, which are computed in a manner similar to aligning
Table 1 Selected methods for computing alignments of two protein structures
Method name Year Authors Flexible Segment swaps
SSAP [37] 1989 Orengo and Taylor No No
Cα -match [4] 1993 Bachar et al. No Yes
DALI [21] 1993 Holm and Sander No Noa
VAST [17] 1996 Gibrat et al. No No
SARF [1] 1996 Alexandrov No Yes
CE [46] 1998 Shindyalov and No No
Bourne
SHEBA [23] 2000 Jung and Lee No No
MATRAS [26] 2000 Kawabata and No No
Nishikawa
FATCAT [52] 2003 Ye and Godzik Yes No
TOPOFIT [22] 2004 Ilyin et al. No No
FlexProt [43] 2004 Shatsky et al. Yes No
GANGSTA [19] 2008 Guerler and Knapp No Yes
ProtDeform [41] 2009 Rocha et al. Yes No
3D-BLAST [31] 2010 Mavridis and Ritchie No No
FlexSnap [42] 2010 Salem et al. Yes Yes
PAUL [50] 2010 Wohlers et al. No No
DEDAL [10] 2011 Daniluk and Lesyng Yes Yes
a DALI in principle is capable of computing alignments with segment swaps, but the publicly

available implementation (DaliLite) lacks this feature


Theoretical and Computational Aspects … 601

two structures. When computation ends, the root node contains a multiple alignment
of all structures (MUSTANG [28], POSA [53]). Sometimes a strategy similar to
hierarchical clustering is used. Starting with single structures, at each step the two
most similar multiple alignments (or structures) are combined (Matt [33]). There
also exist methods where all structures are considered at the same time. MASS [13]
is based on searching for maximal correspondences between SSEs assuming rigid
global superpositions. On the other hand, MultiProt [44] attempts to align a chosen
pivot structure with all others. This process is repeated for all selections of pivot, and
the best multiple alignment is returned. DAMA [9] – an extension of the DEDAL
method employing an evolutionary algorithm is currently under development.

1.3 Difficult Similarities

In many cases, the similarity between protein structures is either obvious or non-
existent. Nevertheless, there exists a “grey area” of so-called difficult similarities. It
comprises cases where similarity between sequences cannot be detected or is mis-
leading, the evolutionary relationship is not obvious, or where there exist significant
distortions that obscure the similarity. These distortions may include repeats, inser-
tions or deletions, permutations or substantial conformational changes.
Repeating motifs involve a significant combinatorial burden, because in principle
all assignments between occurrences of such a motif should be assessed. This is par-
ticularly challenging in case of the so-called propeller folds, which contain structures
similar to a marine propeller. They are composed of 4–8 blades resulting in up to 8!1
possible assignments of blades and at least 8 equivalent alignments.
Insertions and deletions may be a result of genomic rearrangements. After losing
a segment of a significant length a protein may retain its conformation. Nevertheless
the similarity is obfuscated by size differences, and the fact that some fragments
of the smaller structure usually have a different conformation to fill the gap after
missing residues.
Permutations probably pose the most fundamental challenge since the whole con-
cept of an alignment has to be readjusted. Circular permutations are the most common
example. They may be caused by gene duplication or rearrangements of the protein
chain during folding [49]. Two protein chains are circular permutations of each other
if they can be divided into two subunits (A1 − B1 and A2 − B2 respectively), such
that structures A1 − B1 and A2 − B2 are similar in the traditional sense (without
permutations). More complex rearrangements (e.g. caused by changes of the num-
ber of residues in loop regions) have been observed [18]. Oligomeric structures are
another example of sequence rearrangements. Sometimes proteins composed of sev-
eral chains are similar despite the fact that chain boundaries are placed differently or
that numbers of chains differ (in such a case, chains cannot be compared separately).

1 Factorial: 8! = 1 · 2 · …· 8 = 40320.
602 P. Daniluk and B. Lesyng

Finally protein structures are not rigid. Many functions they perform involve con-
formational changes [12, 16]. Furthermore experimental methods used to determine
tertiary structure usually involve changing environmental conditions to nonphys-
iological, which may distort the studied structure (see Sect. 5.3 for an example).
Conformational variability is especially difficult, because assessing structural sim-
ilarity relies on geometrical data. Distinguishing between “natural” flexibility and
dissimilarity may be challenging even to experts.

1.4 Computational Complexity

This study presents several results concerning the computational complexity of pro-
tein structure alignment. In this section we provide a brief introduction to the theory
of computational complexity.
Traditionally computational complexity theory is applied to so-called decision
problems, which originate from the formalism of recognizing languages by finite-
state automata or Turing machines. In this formalism, instances of a problem are
encoded as words over a certain alphabet, and words corresponding to instances
with positive answers belong to a language recognized by a machine. Decision prob-
lems have a strict form – “For a given instance I determine whether I satisfies a
predicate P(I ).”, which is quite different to an open form of optimization problems
which can be stated as follows “For a given instance I find a solution S which has a
maximal value of property p from all valid solutions of I .”. Any optimization prob-
lem, however, can be transformed to a decision problem of the form “For a given
instance I and a value v does there exist a solution with value of property p greater
than or equal to v.”.
There are two fundamental classes of decision problems. The first one (P) contains
problems which can be solved by a Turing machine in polynomial time. This in
practice means, that a polynomial time algorithm for solving such a problem exists,
and can be implemented on any computer. Such problems are considered tractable or
efficiently solvable since computation time for any instance is limited by a polynomial
function of its size. The second major class (NP) comprises problems which can be
solved in polynomial time by a non-deterministic Turing machine. This informally
means, that given a potential solution to the problem, it is possible to check if it is
valid in polynomial time. All problems from P belong to NP, because if the solution
can be computed in polynomial time, it can also be efficiently checked. There are,
however, problems in NP for which a polynomial time algorithm is not known. Some
of them belong to a subclass of NP-complete problems, which may be deemed as
a collection of the “hardest” problems in NP. It can be proven that, if there exists
a polynomial time solution for any NP-complete problem, all problems in NP also
have a polynomial time solution, and thus P = NP. Until now finding such a solution,
or proving that it does not exist, remains an open problem.
Problems in NP can be “ranked” by their “difficulty”, with NP-complete problems
being the hardest. In order to prove that a given problem P1 is NP-complete, it is
Theoretical and Computational Aspects … 603

enough to prove that it belongs to NP and that it is “harder” than a known NP-
complete problem (P2 ). This is performed by constructing a so-called reduction of
P2 into P1 . A reduction is a recipe for converting all instances of P2 into instances
of P1 preserving the decision result (i.e. accepted instances of P2 are converted to
accepted instances of P1 and vice versa). The reduction has to be performed in
polynomial time. This proves that if P1 is tractable, any instance of P2 also can be
solved in polynomial time by converting it into an instance of P1 and applying an
algorithm for P1 . Therefore, if P1 belongs to P, P2 does also along with all problems
in NP.
More information on computational complexity may be found in the seminal book
[15].

2 Fragment-Based Methods

2.1 Continuous Segments, Segment Pairs

Continuous backbone segments are tremendously popular in all computational appli-


cations of protein structure analysis. They have been successfully applied in protein
structure prediction [8] and are instrumental in structure comparison. They are small
and easy to compare, and thus are a good choice for detecting local similarities which
might serve as starting points for a global alignment. Generally only segments of a
certain length (varying from 5 to 15) are considered.
As long as two segments of the same length are compared, there is no need to con-
sider any non-trivial mapping between their residues. It is sufficient to apply a distance
measure defined on sets of points. Root Mean Square Distance (RMSD) is the metric
of choice. For two sets of points A = {a1 , a2 , . . . , an } and B = {b1 , b2 , . . . , bn } of
size n (ai , bi ∈ R3 ) it is defined as:

n
|ai − (Rbi + T )|2
R M S D(A, B) = min i=1
R - rotation in R3 n
T ∈R3

This formula corresponds to the process of isometric transformation of B such


that the sum of squares of distances between respective points in A and transformed
points in B is minimized. Although it seems at first glance that computing RMSD is
a difficult optimization problem, there exists an efficient algorithm for this (Kabsch
algorithm [24, 25]). Although, reproducing it in detail is beyond the scope of this
study, we would like to identify one of its relevant features. The preliminary step of
the Kabsch algorithm involves computing geometrical centers of A and B as well as
a matrix M:
n
Mi j = aki bk j
k=1
604 P. Daniluk and B. Lesyng

where aki (bki ) denotes the ith element of vector ak (bk ). One can easily see that M and
geometrical centers can be recycled. Extending sets A and B after computing their
RMSD can be easily implemented, greatly reducing computational complexity (e.g.
computing RMSD of segments of length n requires O(n) time, just like computing
distances between all prefixes of A and B). A pair of similar segments is usually
called an aligned fragment pair (AFP).
To our best knowledge all alignment methods using AFP employ some sort of
a global similarity measure. It is necessary because the fact that alignment is built
from APFs does not imply actual similarity of aligned substructures. The inability to
capture spatial relationships between residues distant in the sequence but neighboring
in space is the main drawback of continuous segments. It can be amended by using
fragments encompassing at least two disjoint pieces of backbone.
DALI [21], a popular and highly regarded method, uses pairs of continuous seg-
ments of length 6. A similarity measure is based on the distances between points rep-
resenting residues. If A = {a1 , . . . , a12 } and B = {b1 , . . . , b12 } are residues belong-
ing to certain pairs of hexapeptides in structures A and B, similarity is computed as
follows:
 12 12
 
S= θ − d(ai , a j ) − d(bi , b j )
i=1 j=1

where d(a, b) is an Euclidean distance between points a and b, and θ is the parameter
determining a zero level of similarity. The distance based approach is appealing
because distance maps are invariant under isometric transformations, hence there is
no need to search for a transformation giving the optimal superposition. It is also easy
to implement and fast for small fragments (although its computational complexity is
bound by O(N 2 )).

2.2 Local Descriptors of Protein Structure

Local Descriptor of Protein Structure2 is a small fragment of a protein structure,


which encompasses a physico-chemical environment of a given residue. In principle
it can be defined for any residue in a protein molecule. It is built by identifying residues
the selected (central) residue is in contact with. Then, for each of the identified
residues a 5-residue continuous piece of backbone (element) is added to a descriptor.
Overlapping elements are combined into segments (see Fig. 1).
Substructures chosen with this method may comprise several disjoint segments.
Thus, a descriptor may be viewed as a subset of residues enclosed in an irregular
surface corresponding to the range of physico-chemical interactions of the central
residue with its molecular environment. The radius of a descriptor approximates the
range of residue-residue interactions. In contrast to continuous segments, which are

2 In the following text we will simply call it a descriptor.


Theoretical and Computational Aspects … 605

Fig. 1 A descriptor built for


residue MET70 of an
ASTRAL domain d1lg7a_
comprises 9 contacts (dotted
lines) between the central
residue (red) and residues
being centers of elements.
Some of the 5-residue
elements ovelap and
constitute longer segments
(two β-strands and an
α-helix)

limited to one-dimensional neighborhoods along the protein sequence,local descrip-


tors contain information about the spatial environments of residues. They are com-
plete, in a sense that they contain all interacting residues, not some arbitrarily chosen
ones. The actual shape (content) of a descriptor depends on a definition of a contact.

2.2.1 Definition

We consider two residues to be in contact, if one of the following conditions is


satisfied:
1. dα ≤ 6.5 Å,
2. dβx ≤ 8 Å and dα − dβx ≥ 0.75 Å.
In the above dα denotes the distance between Cα atoms, and dβx – a distance between
Cβx points (see Fig. 2), which are computed by extending a Cα − Cβx vector by 1 Å.3
Such definition of a contact favors residues whose sidechains “point” towards each
other, and is convenient to compute, as it does not depend on sidechain atoms, which
may often be missing.
A protein structure can be viewed as a sequence of residues (a (1) a (2) . . . a (N −1) a (N ) ).
For a given residue a ( p) a descriptor element for a ( p) (El(a ( p) )) is a 5 residue long
segment of a backbone around a p :

3 For glycine we assume Cβx = Cα ; for alanine Cβx = Cβ .


606 P. Daniluk and B. Lesyng

Fig. 2 Histidine with point


Cβx (orange); remaining
atoms: Cα – green, Cβ –
yellow, O – red, N – blue,
remaining coal atoms – white

El(a ( p) ) = a ( p−2) a ( p−1) a ( p) a ( p+1) a ( p+2)

It is convenient to view a descriptor of a residue a as a triple a, C, R, where


C is a set of residues which are in contact with a, and R is a set-theoretical sum of
descriptor elements for a and residues in C. We will say that C is a contact pattern
of a descriptor. One should note that according to this definition residues which are
located close to backbone terminals or gaps do not have descriptor elements, and thus
cannot belong to a contact pattern of a valid descriptor. Nevertheless, such residues
may belong to the set R.

2.2.2 Similarity of Local Descriptors

Let D1 = a1 , C1 , R1  and D2 = a2 , C2 , R2  be two descriptors. We will call any


partial function ϕ : C1 → C2 a mapping of contact patterns C1 and C2 . A mapping of
contact patterns is valid if it can be unambiguously extended to a function ψ : R1 →
R2 such that:
1. If ϕ(a (i) ) = b( j) , then

ψ(El(a (i) )) = ψ(a (i−2) )ψ(a (i−1) )ψ(a (i) )ψ(a (i+1) )ψ(a (i+2) )
= b( j−2) b( j−1) b( j) b( j+1) b( j+2) = El(b( j) )

2. ψ(El(a1 ) = ψ(El(a2 ))
In simple terms, a mapping contains pairs of corresponding contacts. It does not
necessarily cover all contacts in both descriptors, but each contact may have only
one corresponding counterpart in the other descriptor. To be valid a mapping has to
preserve overlapping of elements. Contacts with overlapping elements can be mapped
only to contacts with the same overlap, while non-overlapping contacts may have
only non-overlapping counterparts. We will say that a valid mapping constitutes an
alignment of descriptors. One should note that under this definition an alignment may
Theoretical and Computational Aspects … 607

contain so-called segment swaps (i.e. aligned segments may have different order in
structures they originate from). This is a fundamental difference between traditional
understanding of alignment and our definition.
For two descriptors to be similar, an alignment between them has to exist and
satisfy requirements imposed on its size and the spatial similarity of aligned sub-
structures. The size can be measured with the number of aligned residues, elements
or segments, while spatial similarity may be assessed using a Root Mean Square
Distance (RMSD). This is a two-objective optimization problem, since extending
alignment will most likely increase RMSD between substructures and vice versa.
To reliably solve this problem, we use an extensive search algorithm that finds
alignments satisfying the following conditions:
1. the RMSD of aligned elements must not exceed 1.5 Å,
2. for each pair of aligned elements, the RMSD of substructures consisting of these
elements and respective central elements must not exceed 2.5 Å (i.e. elements
should have the same position relative to the central element),
3. at least half of the segments must be aligned,
4. the RMSD of aligned residues must not exceed 2.5 Å.
The algorithm searches through all alignments satisfying the above conditions.
First, all pairs of elements satisfying conditions 1 and 2 are identified. Then, all
possible assemblies of those pairs are checked for condition 4. If it is not met, they
are reduced by removing the least fitting pairs of elements, until either condition 4
is met or condition 3 is no longer satisfied.

2.2.3 Computational Complexity of Descriptors Comparison

In Sect. 1.4 we have briefly explained the main ideas behind the theory of compu-
tational complexity. We will demonstrate that the problem of assessing descriptor
similarity is NP-complete. We start by providing a formal definition of the decision
problem for finding an optimal descriptor alignment. The definition will be slightly
simpler than the one used in the previous section in order to avoid technical difficul-
ties.

Definition 1 For two descriptors D1 , D2 and constants n and T the Optimal Align-
ment of Descriptors (OAD) problem is to determine whether there exists an align-
ment of D1 and D2 covering no less than n residues such that the RMSD between
aligned residues is not greater than T .

Theorem 1 OAD is NP-complete.

Proof First we notice that it is enough to prove that OAD is NP-complete for one
particular value of T , since, if a problem contains an NP-complete sub-problem it is
NP-complete itself. Thus we will assume that T is large enough for any alignment
to be structurally acceptable (e.g. T = ∞).
608 P. Daniluk and B. Lesyng

The most common way of proving NP-completeness is to define a so-called reduc-


tion of a known NP-complete problem to the one being considered. In this case we
will use a well known 3-PARTITION problem [15, problem SP15].

Definition 2 For a given set A containing 3m elements, a positive integer B and a


function s : A → N such that:
1 1
B < s(a) < B,
a∈A
4 2


s(a) = m B,
a∈A

the problem of 3-PARTITION is to determine whether there exists a partition of A


into m disjoint subsets A1 , A2 , . . . , Am such that:
 
s(a) = B
1≤i≤m a∈Ai

It is easy to see that any subset in such partition must contain exactly three
elements. Our reduction will assign an instance of OAD to any instance of 3-
PARTITION. We will show a method of constructing descriptors D1 and D2 for any
m, B and s. Because we have assumed that the threshold for the RMSD of aligned
residues is infinitely large, we don’t have to deal with providing coordinates. It is
enough to give contact patterns.
A comb of length k (Fig. 3a) is a contact pattern which contains k residues such
that subsequent residues lay one residue apart in the sequence. Let D1 contain 3m
combs (one for each element of A) of lengths given by values of s for corresponding

Fig. 3 a A comb of length


6. b An alignment of three
combs from D1 to a comb in (a)
D2 . c If one comb is aligned
to two combs, at least one
residue remains unaligned

(b)

... ...

(c)
Theoretical and Computational Aspects … 609

elements of A. Let D2 contain m combs of length B + 4. Finally, let n be equal to


the number of residues in D1 – m(2B + 9) + 5.
To prove that this reduction is correct, we have to show that D1 and D2 have a
sufficiently large alignment, if and only if there exists a 3-partition of A. Indeed, if
such a partition exists, each subset of A can be mapped to a set of three combs in
D1 , which can be aligned to a comb of length B + 4 in D2 (see Fig. 3b). Conversely,
if there exists an alignment of D1 and D2 , such that all residues in D1 are aligned,
each comb in D1 is aligned to a part of exactly one comb in D2 . If a comb is aligned
to two separate combs, at least one residue in the first comb has to remain unaligned
(see Fig. 3c), which leads to a contradiction.
Presented reduction can be computed in polynomial time with respect to m B,
which is acceptable because 3-PARTITION is strongly NP-complete (NP-
completeness is preserved regardless of the method of encoding numbers).
This, and the fact that given an alignment of two descriptors it is possible to
compute its size and RMSD in polynomial time, proves that OAD is NP-complete.
The significance of the presented theorem should be properly understood. The
NP-completeness of aligning two descriptors means that most likely4 any algorithm
will require an exponential time depending on the size of descriptors being compared.
Due to limits imposed by the physics of protein molecules the size of a descriptor is
of course strictly limited. Descriptors having more than 15 elements do not exist or
are extremely rare. Therefore, computation time is limited by a constant. We have
also omitted structural aspects of the comparison by choosing an infinite RMSD
threshold. This, to our best knowledge, was required to formulate a theorem, which
could be proven. Encoding the property of being a protein5 in strict mathematical
terms is beyond our capacity. Nevertheless this theorem is useful because it reflects
the fact that without identifying a more complex internal structure for a descriptor
the extensive search is justified.

3 From Local Similarities to Global Alignments

In this section we describe a generic paradigm of computing alignments of protein


structures based on a set of local similarities. We assume that local similarities,
regardless of their particular definition, constitute mappings between the residues
they encompass. As in the previous chapter we apply the term “alignment” to any
such mapping, despite the fact that it may not preserve ordering of residues in their
respective sequences. We also assume that local alignments should be treated as
indivisible and immutable blocks and may only be included in the resulting solution
as a whole. All mappings in the solution have to belong to at least one block included
in it.
By these assumptions we simplify the problem, without risking loss of generality.
In all actual applications one can safely assume that the size of a local similarity

4 Unless P = NP.
5A protein is a polypeptide which under physiological conditions assumes and maintains a certain
native conformation.
610 P. Daniluk and B. Lesyng

is bound by a constant. The number of subsimilarities which would have to be


considered if local alignments were divisible is therefore also bound. The same
applies to the majority of useful mutations which one might care to explore. Thus
the problem with mutable, divisible blocks, can be converted to a problem where
blocks are immutable, multiplying their number by a constant factor.
The difficulty in building a global alignment in this paradigm lies in the fact that
local similarities cannot be used in the same alignment, if there exists a residue which
they map differently. An alignment may be built from local alignments, which are
all pairwise consistent. This brings into mind the well known NP-complete clique
finding problem, but a careful analysis taking into account actual properties of local
similarities used is required to correctly assess the computational complexity.
The score of a resulting global alignment may solely depend on its size (quality
being assured by local similarities), or there may be some more complex scoring
function (e.g. RMSD). We examine the case where alignment size is to be maximized
regardless of its quality.
For the purpose of this and subsequent sections let S1 and S2 be certain protein
structures, and Φ be the set of local similarities represented by mappings of residues.
Let ξ : S1 → S2 be a mapping of residues. A support of ξ in Φ (Supp(ξ, Φ)) is
a subset of mappings from Φ which are completely included in ξ .

Definition 3 An alignment of S1 and S2 with support in Φ is a mapping of S1 and


S2 , which has a support in Φ covering all its residues.

According to this definition, every element of Φ is an alignment of structures itself.


A support of the alignment contains the set of local similarities, which have to
be combined to build this alignment. They are all consistent, because they are all
completely covered by the same alignment.

Definition 4 For given structures S1 and S2 , set Φ and number k the Optimal Struc-
ture Alignment Problem (OSA) is to determine, whether there exists an alignment
of S1 and S2 with support in Φ covering at least k residues.

3.1 Computational Complexity of Optimal Structure


Alignment Problem

Theorem 2 OSA is NP-complete.

Proof To prove that OSA is NP-complete we will provide a reduction of another


NP-complete problem to OSA. There is a well known NP-complete problem called
3-DIMENSIONAL MATCHING (3DM) [15, problem SP16].

Definition 5 For a given set M ⊆ W × X × Y , where W , X i Y are disjoint sets of


size q, the 3-DIMENSIONAL MATCHING problem is to determine whether there
exists a subset M ⊆ M such that |M | = q and elements of M are disjoint.
Theoretical and Computational Aspects … 611

In other words, the task is to choose from a given set of triples (M) a subset in
which every element from sets W , X and Y occurs exactly once. There also exists a
two dimensional version of this problem – 2-DIMENSIONAL MATCHING (2DM,
also called a marriage matching problem) where M ⊆ X × Y . Although very similar
surprisingly it can be solved in polynomial time. We will use a slightly modified
version of this problem.
Definition 6 For given sets M ⊆ X × Y and G ⊆ P(M),6 where X and Y are dis-
joint sets of size q, the RESTRICTED 2-DIMENSIONAL MATCHING (R2DM)
problem is to determine, whether there exists a subset M ⊆ Msuch that, |M | = q,
elements of M are disjoint and there exists G ⊆ G such that G = M .
R2DM may be viewed as a case of the marriage matching where would-be wives
set conditions of the kind: “I will marry you, if Mr. X marries Ms. W and Mr. Q
marries Ms. S”. Such conditions are encoded as elements of the set G. To prove,
that R2DM is NP-complete, we may use a simple reduction of 3DM. Each triple
from M in 3DM is encoded as two pairs in M and a set containing these pairs in G.
Figure 4 contains an example of such transformation. One can easily establish that
a solution of an instance of R2DM obtained from 3DM can always be converted to
a solution of the original problem. Furthermore, if the original 3DM instance has a
valid solution, the corresponding instance of R2DM is always solvable.
R2DM is very convenient for proving that OSA is intractable. In the conversion
of an R2DM instance to OSA sets X and Y will correspond to sets residues of S1
and S2 ; set M – to the set of all pairs of mapped residues in alignments from Φ and
finally set G – to Φ itself. Elements of G are sets of pairs from M which have to
be picked together. In the case of alignment with support in Φ, each pair of aligned
residues has to belong to a local alignment from Φ. Let S be a set of pairs from
G. S is converted to an alignment, which for each pair in S maps together residues
corresponding to its elements. A subset A being a solution of R2DM corresponds
to the alignment covering whole structures. A set G corresponds to the support of
this alignment in Φ.
To make the reduction possible, local similarities have to allow for sequence
swaps. Otherwise, instances of R2DM for which sets X and Y cannot be ordered
in such a way that all elements in G are ordered on both positions, could not be
converted to an instance of OSA.

3.2 Important Variants of Optimal Structure Alignment


Problem and Their Complexity

In the previous section we have proven the intractability of the Optimal Structure
Alignment problem. The proof presented applies to the most generic version of OSA
where local similarities may be any arbitrary mappings between residues. All “real”
approaches known to us employ local similarities having a well defined structure
(e.g. continuous segments, pairs of segments, local descriptors). If we look back to

6 P(M) denotes a power set of M, i.e. a family of all subsets of M.


612 P. Daniluk and B. Lesyng

(a)

(b)

Fig. 4 a Sample instance of 3DM. b The same instance converted to R2DM. Each triple is converted
to two pairs and an element in set G. Primes are added to element names to comply with the
requirement that sets X and Y are to be disjoint

the proof of NP-completeness of OSA, it is evident, that it cannot be applied in the


case of local similarities in the form of single continuous segments. It also remains
to be seen whether OSA would remain intractable if it was restricted to alignments
without permutations.

Definition 7 An Optimal Straight Structure Alignment problem (OSSA) is a variant


of OSA in which an alignment satisfying the size threshold cannot contain segment
swaps.

Theorem 3 If the set of local similarities in an instance of OSSA contains only


matchings of single continuous segments, the computation required to solve OSSA
can be performed in polynomial time.

Proof These sort of cases can easily be solved with the modified Smith-Waterman
algorithm. Algorithms of this sort based on dynamic programming usually have
polynomial complexity. In this particular case a pessimistic estimate of computation
time linearly depends on the number of residues in aligned structures and the size of
Φ (O(|S1 | |S2 | |Φ|)).

Theorem 3 encourages further questions regarding the intractability of OSSA.


Perhaps it is also easy for more complex elements of Φ.

Theorem 4 A variant of OSSA where the set of local similarities may contain match-
ings of three separate continuous segments is NP-complete.
Theoretical and Computational Aspects … 613

Proof We will prove NP-completeness by reduction of the popular 3SAT problem


[15, problem LO01].

Definition 8 For a given collection of clauses C on a set of variables U , where each


clause is an alternative of exactly three literals from U (positive or negative) and
each variable is used exactly three times, the 3SAT problem is to determine whether
there exists a truth assignment for U which satisfies all clauses in C.

3SAT is one of the oldest known NP-complete problems. It is useful in proving


the intractability of problems where one can identify a set of “switches” and a certain
pattern they have to achieve. The fact that each variable appears no more than three
times in the collection of clauses is instrumental to our proof. We begin with the
observation that, without any loss of generality, we may assume that every variable
appears at least once in a positive and negative literal. Otherwise (in the case of only
positive or only negative appearances), such variables can be easily eliminated by
setting their value to true or false respectively. Let k be the number of variables and
l be the number of clauses in a certain instance of 3SAT. Below, we demonstrate a
conversion of such an instance to an instance of OSSA.
Let S1 be a protein structure which contains segments: v1 , v2 , . . . , vk , c1 ,
c2 , . . . , cl ; and S2 a protein structure containing segments: t1 , f 1 , t2 , f 2 , . . . , tk , f k ,
c1 , c2 , . . . , cl . In both these structures segments appear in the order given above. All
segments in S1 are disjoint, whereas in S2 ti overlaps f i , and all other segment pairs
are disjoint. All segments are of equal length.
Segments vi correspond to variables, segments ti and f i correspond to assignment
of true and false values to respective variables, and segments ci and ci correspond to
clauses. For each variable we define two local similarities ϕi and ϕ i :


ti s = vi
ϕi (s) =
c p s = c p and clause pcontains a positive appearance of ith variable

f i s = vi
ϕ i (s) =
c p s = c p and clause p contains a negative appearance of ith variable

Example 1 Let C be a collection of clauses on U = {u 1 , u 2 , u 3 }:

C = {{¬u 1 , u 2 , u 3 } , {u 1 , ¬u 2 , u 3 } , {u 1 , u 2 , ¬u 3 }}

equivalent to the following formula:

(¬u 1 ∨ u 2 ∨ u 3 ) ∧ (u 1 ∨ ¬u 2 ∨ u 3 ) ∧ (u 1 ∨ u 2 ∨ ¬u 3 )

This instance of 3SAT could be converted to the following instance of OSSA (assum-
ing that all segments are of length 5) (ϕ(a) = ⊥ means that a ∈/ Dom(ϕ)):
614 P. Daniluk and B. Lesyng

v1 v2 v3

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15)
S1 = a a a a a a a a a a a a a a a
c1 c2 c3

a (16) a (17) a (18) a (19)a (20) a (21)a (22)a (23) a (24)a (25) a (26) a (27) a (28) a (29)a (30)
t1 t2

(1) (2) (3) (4) (5) (6) (7) (8)


S2 = b b b b b b b b b(9) b(10) b(11) b(12)
f1 f2
t3

b(13) b(14) b(15) b(16) b(17) b(18)


f3
c1 c2 c3

a (19)a (20)a (21)a (22)a (23) a (24) a (25) a (26) a (27) a (28) a (29)a (30)a (31)a (32)a (33)
v1 v2 v3
(1) (2) (3) (4) (5)
ϕ1 (S1 ) = b b b b b ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥
t1
c1 c2 c3

⊥ ⊥ ⊥ ⊥ ⊥ b(24) b(25) b(26) b(27) b(28) b(29) b(30) b(31) b(32) b(33)
c2 c3
v1 v2 v3

ϕ 1 (S1 ) = b(2) b(3) b(4) b(5) b(6) ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥


f1
c1 c2 c3
(19) (20) (21) (22) (23)
b b b b b ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥
c1
v1 v2 v3
(7) (8) (9) (10) (11)
ϕ2 (S1 ) = ⊥ ⊥ ⊥ ⊥ ⊥ b b b b b ⊥ ⊥ ⊥ ⊥ ⊥
t2
c1 c2 c3

b(19) b(20) b(21) b(22) b(23) ⊥ ⊥ ⊥ ⊥ ⊥ b(29) b(30) b(31) b(32) b(33)
c1 c3
v1 v2 v3
(7) (8) (9) (10) (11)
ϕ 2 (S1 ) = ⊥ ⊥ ⊥ ⊥ ⊥ b b b b b ⊥ ⊥ ⊥ ⊥ ⊥
f2
c1 c2 c3
(24) (25) (26) (27) (28)
⊥ ⊥ ⊥ ⊥ ⊥ b b b b b ⊥ ⊥ ⊥ ⊥ ⊥
c2
Theoretical and Computational Aspects … 615

v1 v2 v3

(13) (14) (15) (16) (17)


ϕ3 (S1 ) = ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ b b b b b
t3
c1 c2 c3

b(19) b(20) b(21) b(22) b(23) b(24) b(25) b(26) b(27) b(28) ⊥ ⊥ ⊥ ⊥ ⊥
c1 c2
v1 v2 v3

(13) (14) (15) (16) (17)


ϕ 3 (S1 ) = ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ b b b b b
f3
c1 c2 c3

⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ b(29) b(30) b(31) b(32) b(33)


c3

All clauses are satisfied by the assignment:

u 1 → 0, u 2 → 1, u 3 → 1

Therefore, there exists an alignment with support {ϕ 1 , ϕ2 , ϕ3 }:

v1 v2 v3

ξ (S1 ) = b(2) b(3) b(4) b(5) b(6) b(7) b(8) b(9) b(10) b(11) b(13) b(14) b(15) b(16) b(17)
f1 t2 t3
c1 c2 c3

b(19) b(20) b(21) b(22) b(23) b(24) b(25) b(26) b(27) b(28) b(29) b(30) b(31) b(32) b(33)
c1 c2 c3

ξ is a straight alignment of size 30.

An alignment covering at least (k + l)r residues, where r is the length of segments,


corresponds to an assignment for which all clauses are true.  To prove this, we note
that no alignment with support in Φ = ϕ1 , ϕ 1 , . . . , ϕk , ϕ k can contain both ϕi and
ϕ i , because segment vi cannot be aligned to both ti and f i at the same time. This, in
the context of the 3SAT problem, establishes that in no assignment can a variable be
both true and false. If an alignment is supposed to cover (k + l)r residues it has to
cover all segments in S1 . This means that all variables will have an assignment.
Each clause is an alternative of three literals. It is sufficient for one of them to
have a truth value for the clause to be satisfied. In our reduction this is expressed by
aligning segments c p and c p . All clauses have to be satisfied for the alignment to be
sufficiently large.
The remaining technical details are omitted for reasons of brevity.
616 P. Daniluk and B. Lesyng

To conclude we demonstrate that OSA is intractable if local similarities are


allowed to comprise at least two continuous segments.

Theorem 5 A variant of OSA where the set of local similarities may contain match-
ings of two separate continuous segments is NP-complete.

Proof To prove this theorem it suffices to note that a variant of R2DM in which each
element of the set G contains exactly two pairs from A is NP-complete. This follows
directly from the reduction we have used to prove the intractability of R2DM. All
instances of R2DM resulting from it have this feature. Therefore, if we assume that
structures are divided into non-overlapping segments of the same length, any local
similarity in the reduction from R2DM to OSA will consist of no more than two
segments.

This series of theorems is briefly summed up in Fig. 5. Computing straight align-


ments is easier than computing alignments with permutations. Nevertheless, in the
case of moderately complex local similarities with three non-overlapping segments,
both problems are intractable. The case of OSSA with 2-segment similarities is par-
ticularly interesting since it is the basis of a popular structure alignment method DALI
[21]. Unfortunately, we are not aware of any results concerning its computational
complexity.
OSA and OSSA are theoretical approximations of actual problems. As in the
previous section, we have disregarded the fact that protein structures have several
properties, which are hard or impossible to describe with a mathematical formalism.
Therefore, results shown here may not reflect the actual complexity of these problems
applied to real proteins. Nevertheless, they are useful, since if the fact of a polypeptide
being a protein cannot be described formally, it cannot be used in the process of
designing and proving properties of an algorithm. Knowledge that a certain variant
of OSA is NP-complete indicates that it is very unlikely for an accurate, polynomial
time algorithm to exist and thus favors the application of extensive search or heuristic
solutions.
Among numeorus simplifications, we have also disregarded the issue of assessing
alignment quality. Introducing a restriction that an alignment has to satisfy some
quality requirement (e.g. RMSD) apart from the size criteria, doesn’t necessarily
make the problem harder. For example, if aligned parts were to have identical shapes

Fig. 5 Computational complexity of OSA and OSSA depending on the maximal number of seg-
ments in local similarities s
Theoretical and Computational Aspects … 617

(RMSD equal to zero), only local similarities with zero RMSD could be used, thereby
drastically limiting the size of set Φ. Nevertheless, theoretical complexity, which
should normally be assessed for all threshold values, would not change.

3.3 Solving Optimal Structure Alignment Problem with Local


Descriptors

Local descriptors are particularly useful in computing protein structure alignments


[10]. Local similarities defined by pairs of descriptors are usually significant, so that
alignments with support in the set of such similarities do not require further verifi-
cation of quality. Therefore, computing an alignment of structures using descriptors
amounts to solving an instance of OSA. Since the problem is NP-complete, there
is no disadvantage in applying formalism and experience from another well-known
NP-complete problem.
All similarities in the support of an alignment are required to be consistent. Con-
versely, if all local similarities in a certain set are consistent with each other, such a
set is a support of some valid alignment. The consistency of local similarities can be
described by a graph, with nodes representing similarities. Nodes in this graph are
connected by an edge, if corresponding similarities are consistent. A clique7 in such
a graph may be interpreted as a valid alignment between the structures. As long as
the function used to score the alignments does not decrease with the clique growth,
maximal alignments can be found by looking for the maximal cliques.
Accurate Solution – Extensive Search
Clique searching algorithms are usually designed to find the largest clique in terms
of the number of nodes. In solving the OSA, this might not be enough. The goal is
to find an alignment covering as many residues as possible. The largest clique does
not necessarily have this property, since local similarities in a clique usually overlap,
contributing to the “thickness” of coverage instead its “breadth”. Most likely “thick-
est” coverages should correspond to the best alignments, nevertheless an algorithm
to guarantee the optimal solution has to compute and assess all maximal cliques.
Applying a branch-and-bound strategy to build all possible cliques, while preserv-
ing a required number of the highest scoring alignments, is a viable solution. Each
node either belongs to the clique or not. Such decisions can be made separately for
each node in an arbitrary order assuming that choices which would violate the clique
condition are disallowed. Thus, building all maximal cliques can be performed by
traversing a decision tree in which nodes at the kth level correspond to the decision
of including the kth graph node in the subset. It is well known, that such an approach
can be vastly improved by employing a branch-and-bound strategy. In order to make
this computation feasible we introduced two optimizations (cuts).

7 Clique in a graph is a subset of nodes such that every two nodes in the subset are connected by an
edge.
618 P. Daniluk and B. Lesyng

If a clique in a given branch can be unequivocally expanded with a previously


rejected node, it is abandoned, because it does not contain any maximal cliques
(maximal cliques containing those abandoned belong to another branch of a tree).
This ensures that only maximal cliques are obtained and each is constructed precisely
once.
Since the goal is to compute the largest alignments, branches which do not con-
tain cliques corresponding to sufficiently large alignments may be abandoned. The
lower bound of the size of alignments of interest may be given as a parameter or
gradually increased as alignments are being computed. An upper bound of the size
of a maximal alignment in a given branch can be computed as a sum of the size of
the alignment being constructed, and a number of residues outside this alignment
covered by descriptor pairs which are yet to be considered. This is not an exact size
of the best solution in the branch, because some descriptor pairs are contradictory
and cannot be combined in one alignment, but still such an upper bound is frequently
low enough to discard significant portions of a decision tree.
This strategy can be modified to search for alignments which have a certain
property. For example, in order to find optimal alignments comprising only one
structurally continuous fragment a variant may be used which extends the clique only
if the subalignment which is being added has common residues with the alignment
being extended.
Monte-Carlo Approximation
In certain cases extensive search even with the cuts described above is infeasible.
This is caused by a large number of suboptimal alignments which cannot be pruned
from the decision tree. Large structures with a high degree of self-similarity (i.e.
recurring structural motifs) are especially affected. Nevertheless, in these cases cor-
rect alignments are most likely easily identifiable by visual inspection. Therefore,
one might speculate it should be possible to easily detect them without a system-
atic search of the overwhelming solution space. Monte-Carlo methods [34] have a
huge potential in finding most probable states of complex systems. In particular, a
widely recognized Replica Exchange Monte Carlo algorithm [48] can be used to
search for high score alignments. Here we will mention the algorithm for generating
transitions between states, and the energy function. Let Cn = {d1 , d2 , . . . , dn } be the
clique defining a state at the nth step. The clique Cn+1 describing the state in the next
step is generated as follows:
1. randomly pick a graph node d not belonging to Cn ,
2. take a set Cn+1 containing d and elements from Cn which are connected to d
(one sees it is a clique),
3. if there are graph nodes which belong to every maximal clique containing Cn+1 ,
add them to Cn+1 .
Such parameters as number of steps, number of replicas, their temperatures, and
exchange frequency should be adjusted to reproduce accurate results in the shortest
time.
Theoretical and Computational Aspects … 619

4 Alignments of Multiple Structures

In contrast to alignments of pairs of structures, which can be defined as mappings


between respective sets of residues, alignments of multiple structures do not have one
canonical definition. Intuitively multiple alignment is some sort of correspondence
between residues in several structures. The fundamental question is whether such
correspondence is transitive, i.e. if residue a is structurally equivalent to b, which
is equivalent to c, does it imply that a is equivalent to c. There are three common
strategies which are used to compute a score of the multiple alignment of sequences:
• sum of pairs (SP-score),
• star alignment,
• alignment according to a given phylogenetic tree (tree alignment).
Each of the above has different properties and assumptions. Star alignment, which
aims at finding a sequence most similar to a given set (which may be viewed as their
average) may be interpreted in the context of structure alignment as searching for
a core common to all structures. Maximization of the SP-score, on the other hand,
may be understood as searching for similarities within all subsets of structures. Tree
alignment is somewhat in between and can be applied, if there exists a hypothetical
phylogenetic hierarchy based on some external knowledge. No matter which of these
three strategies is used, the problem of multiple alignment of sequences is NP-hard
[14].
When comparing protein structures it is easier to detect a conserved core common
to all structures (if only it exists) than to identify all similarities. Nevertheless, we
focus on the sum of pairs strategy, since it gives more complete information. We
assume that a multiple alignment can be described with a set of alignments of all
pairs of structures. Naturally, not every set of pairwise alignments describes a multiple
alignment. We set out to establish the necessary and sufficient condition for this.

4.1 Optimal Structural Multiple Alignment Problem

We will define a multiple alignment in a similar fashion as pairwise alignment.


Firstly, we develop a notion of multiple mapping. Let S = {S1 , S2 , . . . , S N } be a
set of structures. A multiple mapping of structures in S is a symmetric reflective
relation between all residues belonging to structures in S in which under no cir-
cumstances can two different residues in one structure be transitively aligned with
each other. This property is crucial, since it distinguishes the problem of computing
an optimal multiple alignment from computing optimal alignments between all pairs
of structures. It may happen, that residue a1 ∈ S1 is mapped to b ∈ S2 in the optimal
alignment between these structures, b ∈ S2 is mapped to c ∈ S3 , which in turn is
equivalent to a2 ∈ S1 . In such cases pairwise alignments cannot be directly merged
into a valid multiple alignment, because both a1 or a2 would have to be aligned with
b and c.
620 P. Daniluk and B. Lesyng

Definition 9 An multiple alignment of structures in set S with support in Φ is a


multiple mapping of structures in S , such that pairwise alignments derived from
it for each pair of structures in S have support in Φ covering all their residues.
We assert that the aforementioned set of pairwise alignments induces the multiple
alignment.
This definition is a direct generalization of definition 3. It is easy to see, that if S
contains two structures, their multiple alignment is equivalent to pairwise alignment.8
For a given multiple alignment there exists exactly one set of pairwise alignments
inducing it. If a set of pairwise alignments induces a multiple alignment, we will
say that it is consistent. The size of a multiple alignment is the average number of
residues covered by the pairwise alignments which induce it.
Definition 10 For a given set of structures S , set Φ, and number k the Optimal
Structure Multiple Alignment Problem (OSMA) is to determine whether there exists
a multiple alignment in S with support in Φ of size no less than k.

4.2 Analysis of Computational Complexity

Theorem 6 Optimal Structure Multiple Alignment Problem is NP-complete.


Proof It is easy to see that OSMA contains the problem of computing optimal
pairwise alignments (OSA). Therefore it is NP-hard. It belongs to the NP class
because computing the size of a given multiple alignment can be readily performed
in polynomial time.
Theorem 6 is rather obvious and unfortunately gives little insight into the complex-
ity of the problem. We have already established that although OSA is NP-complete
it can be effectively solved either by accurate algorithms or Monte-Carlo approxi-
mations. If computing multiple alignments was not harder than OSA, most likely
we would be able to use similar approaches to solve it. On the other hand, com-
puting multiple sequence alignments is NP-complete, although pairwise sequence
alignments can be performed in polynomial time. Intuitively, one might argue that
“multiplicity” of the alignment introduces intractability. If this was the case, comput-
ing multiple structural alignments would require a different approach. It may seem
that assuming that OSA belongs to P and analyzing the complexity of OSMA under
this assumption would be the simplest way to check this hypothesis. Unfortunately,
assuming that a certain NP-complete problem can be solved in polynomial time is
equivalent to assuming that P equals NP (i.e. all NP problems can be solved in poly-
nomial time). Therefore, we must analyze the complexity of such variants of OSMA
for which all relevant OSA can be solved in constant time.

8 We deliberately skip over the fact that multiple alignment is a relation while a pairwise alignment
is a function. The property of being a multiple alignment guarantees that it can be converted to a
function in a trivial way.
Theoretical and Computational Aspects … 621

Theorem 7 For any value of k, a variant of OSMA in which the number of local
similarities in Φ for each pair of structures does not exceed k is NP-complete.
Before we prove this theorem, let us note that if size of set Φ in OSA is limited by
a constant, then the number of possible alignments with support in Φ is limited by
2k . This means that an extensive search algorithm for finding an optimal alignment
will have a computational complexity of O(2k ) = O(1) (because k is constant). In
layman’s terms Theorem 7 establishes that pessimistic computation time for OSMA
is exponential with respect to the number of structures.9
Proof We once more use 3SAT (see Definition 8). Let U = {u 1 , . . . , u k }, C =
{C1 , . . . , Cl } be an instance of 3SAT. We will construct three sets of structures
corresponding to: variables (set V ), clauses (set K ) and assignment of values (set
L):
(1) V = {V1 , . . . , Vk }, where Vi = a1Vi a2Vi . . . a19
Vi Vi 10
a20
Ki Ki Ki Ki
(2) K = {K 1 , . . . , K l }, where K i = a1 a2 . . . a14 a15
L0 L0 L0 L0
(3) L = {L 0 }, where L 0 = a1 a2 . . . a20 a21
To simplify the notation we will define the following segments:
(1) vi = a1Vi . . . a5Vi , vi = a6Vi . . . a20Vi

(2) ki1 = a1K i . . . a5K i , ki2 = a6K i . . . a10


Ki Ki
, ki3 = a11 Ki
. . . a15
L0 L0 L0 L0 L0 L0
(3) t = a1 . . . a5 , f = a2 . . . a6 , l = a7 . . . a21
Having structures defined we construct four sets of local similarities corresponding
to: assignment of values to variables (sets Φ T and Φ F ), occurrences of variables in
clauses (set Φ K ) and kinds of literals (positive vs. negative) occurring in clauses (set
Φ L ):
 
(1) Φ T = ϕiT : Vi → L 0 1 ≤ i ≤ k ∧ ϕiT (vi ) = t ∧ ϕiT (vi ) = l 
(2) Φ F = ϕiF : Vi → L 0 1 ≤ i ≤ k ∧ ϕiF (vi ) = f ∧ ϕiF (vi ) = l 
 1 ≤ i ≤ l ∧ 1 ≤ j ≤ 3 ∧ 1 ≤ p ≤ k ∧ ϕ K (k ) = v ∧
(3) Φ K = ϕiKj : K i → V p  ij ij p
 ∧ u p or ¬u p occurs at jth position in clause Ci
⎧  ⎫
⎪ 1 ≤ i ≤ l ∧1 ≤ j ≤ 3∧ ⎪
⎨   ⎬
(4) Φ = ϕi j : K i → L 0 
L L t jth position in clause Ci is a positive literal

⎩  ∧ ϕi j (ki j ) =
L


 f jth position in clause Ci is a negative literal
Finally, a set of local similarities Φ is the sum of the following:

Φ = (Φ T ∪ Φ F ∪ Φ K ∪ Φ L ) ∪ (Φ T ∪ Φ F ∪ Φ K ∪ Φ L )−1

where Φ −1 denotes the set of alignments inverse to those in Φ.11

9 Unless P = NP.
10 In this proof we abstain from giving residue numbers in upper index.
11 If ϕ : A → B in an alignment between structures A and B, an inverse alignment is a function

ϕ −1 : B → A derived from the same mapping of residues.


622 P. Daniluk and B. Lesyng

An assignment of values which satisfies all clauses exists if and only if there
exists an alignment of S = {L 0 , V1 , . . . , Vk , K 0 , . . . , K l } with support in Φ of size
2(20k+10l)
(k+l)(k+l+1)
. Furthermore, if Φ is a support of such an alignment, for every i either
ϕi ∈ Φ or ϕiF ∈ Φ and an assignment:
T


1 ϕiT ∈ Φ
ui →
0 ϕiF ∈ Φ

satisfies all clauses.

Example 2 Let C be a set of clauses over U = {u 1 , u 2 , u 3 }:

C = {{¬u 1 , u 2 , u 3 } , {u 1 , ¬u 2 , u 3 } , {u 1 , u 2 , ¬u 3 }}

corresponding to the formula:

(¬u 1 ∨ u 2 ∨ u 3 ) ∧ (u 1 ∨ ¬u 2 ∨ u 3 ) ∧ (u 1 ∨ u 2 ∨ ¬u 3 )

The reduction presented above yields the following instance of OSMA:

t l

L0 = a1L 0 a2L 0 a3L 0 a4L 0 a5L 0 a6L 0 a7L 0 a8L 0 a9L 0 a10
L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0
a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21
f
v1 v1

V1 = a1V1 a2V1 a3V1 a4V1 a5V1 a6V1 a7V1 a8V1 a9V1 a10
V1 V1 V1 V1 V1 V1 V1 V1 V1 V1 V1
a11 a12 a13 a14 a15 a16 a17 a18 a19 a20
v2 v2

V2 = a1V2 a2V2 a3V2 a4V2 a5V2 a6V2 a7V2 a8V2 a9V2 a10
V2 V2 V2 V2 V2 V2 V2 V2 V2 V2 V2
a11 a12 a13 a14 a15 a16 a17 a18 a19 a20
v3 v3

V3 = a1V3 a2V3 a3V3 a4V3 a5V3 a6V3 a7V3 a8V3 a9V3 a10
V3 V3 V3 V3 V3 V3 V3 V3 V3 V3 V3
a11 a12 a13 a14 a15 a16 a17 a18 a19 a20
k1,1 k1,2 k1,3

K1 = a1K 1 a2K 1 a3K 1 a4K 1 a5K 1 a6K 1 a7K 1 a8K 1 a9K 1 a10
K1 K1 K1 K1 K1 K1
a11 a12 a13 a14 a15
k2,1 k2,2 k2,3

K 2 = a1K 2 a2K 2 a3K 2 a4K 2 a5K 2 a6K 2 a7K 2 a8K 2 a9K 2 a10
K2 K2 K2 K2 K2 K2
a11 a12 a13 a14 a15
k3,1 k3,2 k3,3

K 3 = a1K 3 a2K 3 a3K 3 a4K 3 a5K 3 a6K 3 a7K 3 a8K 3 a9K 3 a10
K3 K3 K3 K3 K3 K3
a11 a12 a13 a14 a15
Theoretical and Computational Aspects … 623
  
t e = v1 t e = v2 t e = v3
ϕ1T (e)= ϕ2T (e)
= ϕ3T (e) =
l e = v1 l e = v2 l e = v3
  
f e = v1 f e = v2 f e = v3
ϕ1F (e) = ϕ2F (e) = ϕ3F (e) =
l e = v1 l e = v2 l e = v3
  
ϕ1,1
K
(e) = v1 e = k1,1 ϕ1,2
K
(e) = v2 e = k1,2 ϕ1,3
K
(e) = v3 e = k1,3
  
ϕ2,1
K
(e) = v1 e = k2,1 ϕ2,2
K
(e) = v2 e = k2,2 ϕ2,3
K
(e) = v3 e = k2,3
  
ϕ3,1
K
(e) = v1 e = k3,1 ϕ3,2
K
(e) = v2 e = k3,2 ϕ3,3
K
(e) = v3 e = k3,3
  
ϕ1,1
L
(e) = f e = k1,1 ϕ1,2
L
(e) = t e = k1,2 ϕ1,3
L
(e) = t e = k1,3
  
ϕ2,1
L
(e) = t e = k2,1 ϕ2,2
L
(e) = f e = k2,2 ϕ2,3
L
(e) = t e = k2,3
  
ϕ3,1
L
(e) = t e = k3,1 ϕ3,2
L
(e) = t e = k3,2 ϕ3,3
L
(e) = f e = k3,3

The following assignment satisfies all clauses:

u 1 → 0, u 2 → 1, u 3 → 1

Therefore, there exists an alignment with support containing ϕ1F , ϕ2T , ϕ3T and size

180
21
. The support of this alignment also contains ϕ1,1 K
, ϕ2,3
K
, ϕ3,2
K
, ϕ1,1
L
, ϕ2,3
L
, ϕ3,2
L
,
and an alignment is induced by the following (see also Fig. 6) (ϕ(a) = ⊥ means, that
a∈ / Dom(ϕ)):

v1 v11 v12 v13

ξ V1 L 0 (V1 ) = a2L 0 a3L 0 a4L 0 a5L 0 a6L 0 a7L 0 a8L 0 a9L 0 a10
L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0
a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21
f l1 l2 l3

v2 v21 v22 v23

ξ V2 L 0 (V2 ) = a1L 0 a2L 0 a3L 0 a4L 0 a5L 0 a7L 0 a8L 0 a9L 0 a10
L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0
a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21
t l1 l2 l3
v3 v31 v32 v33

ξ V3 L 0 (V3 ) = a1L 0 a2L 0 a3L 0 a4L 0 a5L 0 a7L 0 a8L 0 a9L 0 a10
L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0
a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21
t l1 l2 l3
k1,1 k1,2 k1,3

ξ K 1 V1 (K 1 ) = a1V1 a2V1 a3V1 a4V1 a5V1 ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥


v1
624 P. Daniluk and B. Lesyng

Fig. 6 Multiple alignment of structures corresponding to the solution of an instance of 3SAT in


Example 2

k2,1 k2,2 k2,3

ξ K 2 V3 (K 2 ) = ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ a1V3 a2V3 a3V3 a4V3 a5V3


v3

k3,1 k3,2 k3,3

ξ K 3 V2 (K 3 ) = ⊥ ⊥ ⊥ ⊥ ⊥ a1V2 a2V2 a3V2 a4V2 a5V2 ⊥ ⊥ ⊥ ⊥ ⊥


v2
k1,1 k1,2 k1,3

ξ K 1 L 0 (K 1 ) = a2L 0 a3L 0 a4L 0 a5L 0 a6L 0 ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥


f

k2,1 k2,2 k2,3

ξ K 2 L 0 (K 2 ) = ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ a1L 0 a2L 0 a3L 0 a4L 0 a5L 0


t
k3,1 k3,2 k3,3

ξ K 3 L 0 (K 3 ) = ⊥ ⊥ ⊥ ⊥ ⊥ a1L 0 a2L 0 a3L 0 a4L 0 a5L 0 ⊥ ⊥ ⊥ ⊥ ⊥


t

As can be seen in this example, an alignment corresponding to a 3SAT solution


contains an assignment of value to each variable (ϕiT or ϕiF for every i), a selection
Theoretical and Computational Aspects … 625

of a variable used to satisfy each clause (ϕiKj for all i and j = 1, 2, 3), and selection
of one of three literals used to satisfy a clause (ϕiLj for all i and j = 1, 2, 3). One can
easily see that for each clause only one of ϕiLj can be used. Same applies to ϕiKj if all
variables have values assigned. Otherwise each of residues a2L 0 a3L 0 a4L 0 a5L 0 would be
aligned to more than one residue in K i . If an alignment contains local similarities
encoding assignment of value for each variable and this assignment satisfies all
clauses, it can easily be extended with respective local similarities encoding choices
of variables and literals for each clause. Elements of Φ L guarantee that a variable
can be used to satisfy a clause only if it is assigned 1 and occurs in positive literal,
or is assigned 0 and occurs in negative literal. Such an alignment has a required size.
It remains to be proven, that every alignment of size (k+l)(k+l+1)2(20k+10l)
contains either ϕiT
or ϕi for each variable. We begin with an observation that local similarities related
F

to different variables are independent in a way that they cannot cause contradiction
in an alignment. This meansthat when searching for an optimalalignment one can
independently deal with sets ϕiT , ϕiF , ϕiKp , ϕiq K
, ϕirK , ϕ ps
L
, ϕqt
L
, ϕrLu , where p, q, r are
the numbers of clauses containing variable i, and s, t, u are positions of variable i
in these clauses. If an alignment does not contain neither ϕiT nor ϕiF it may contain
all of ϕiKp , ϕiq
K
, ϕirK (Only one similarity from Φ L may be picked for each clause.).
Similarities ϕiKp , ϕiq K
, ϕirK contribute (k+l)(k+l+1)
30
to the alignment size, while each of
ϕi , ϕi contributes (k+l)(k+l+1) . Therefore any alignment which does not contain
T F 40

assignments of value for all variables is suboptimal. This concludes the proof and
explains the introduction of seemingly unnecessary segments vi and l . The remaining
technical details have been left to the reader.
With the theorem above we have established, that (provided P is not equal to
NP) all algorithms performing multiple alignment of structures have exponential
computational complexity with respect to the number of structures. This is due to the
fact that although every multiple alignment can be described with a set of pairwise
alignments, not every set of pairwise alignments induces a proper multiple alignment.
Conflicts that prevent inducing a multiple alignment may involve alignments of any
number of structures and thus cannot be efficiently resolved by performing clean-ups
on subsets of structures. The last theorem in this section formalizes this observation.
Let us assume that S is divided into two subsets, S1 and S2 , and we have already
calculated optimal multi-alignments of these sets. We will consider computing a
multiple alignment of S which contains alignments of S1 and S2 .

Definition 11 For a given set of structures S , multiple alignments of its disjoint


subsets S1 and S2 (S1 ∪ S2 = S ), set Φ and number k, the Optimal Alignment of
Multiple Aligments problem (OAMA) is to determine whether there exists a multiple
alignment in S with support in Φ of size no less than k containing given alignment
of S1 and S2 .

OAMA is a natural simplification of OSMA. Very often problems requiring


computations on a given set are solved with a “divide and conquer” paradigm in
which data is divided into two or more parts, which are solved independently and
626 P. Daniluk and B. Lesyng

their solutions are merged to achieve a solution for the whole dataset. This step is
repeated recursively. OAMA would occur in the merging stage of such an approach.

Theorem 8 OAMA is NP-complete.

This theorem does not require a proof nor a comment, since OAMA contains OSA.

Theorem 9 OAMA is NP-complete even if its input is extended with optimal align-
ments of all pairs of structures from S1 × S2 .

Due to lack of space we will leave this theorem without proof, since it is similar to
the proof of Theorem 7.
In this section we have shown that computing optimal multiple alignments adds
one more level of intractability to an already difficult problem of computing optimal
alignments of two structures. Conflicts that may occur between alignments of pairs
prevent them from being merged into a multiple alignment. Resolving these conflicts
is intractable by itself.

5 Practical Aspects of Computing Structural Alignments

In this section we describe certain issues which may arise in comparison of protein
structures. The content of this section pertains mostly to application of local descrip-
tors to computing alignments, but may be treated as a collection of tips and tricks
to be used elsewhere. We will begin with a fundamental problem of setting correct
thresholds.

5.1 Dealing with Uncertainty

Setting thresholds is a common practice in all kinds of parameter dependent mod-


els and theories. Threshold values are usually carefully estimated or derived from
theoretical principles. Nevertheless, whenever we are dealing with physical objects,
which are always measured with some uncertainty, values are rounded, and can-
not be treated as indisputable truth. Therefore, it is in principle impossible to set a
threshold which could be used to classify data into discrete categories. Whenever a
value is close to a threshold, classification cannot be trusted. This issue is usually
dismissed, because a mistake in classifying of values close to the threshold is rather
insignificant, once we realize that all thresholds are somewhat arbitrary.
However, the situation changes whenever a significant result is dependent on an
apparently insignificant difference. This does happen when two local descriptors
are compared. As described in Sect. 2.2.2, there are two thresholds to be satisfied –
alignment quality (RMSD) and alignment size (number of aligned residues, contacts,
and segments). RMSD does not cause any doubts, since it is a real number and any
Theoretical and Computational Aspects … 627

uncertainty may be blamed on inaccuracies of coordinates in experimentally obtained


protein structures. Alignment size, on the other hand, is discrete. It may happen that
descriptors are considered dissimilar because one of them lacks an element the other
one may have. This would depend on the thresholds used to define contacts. It is
quite probable that dissimilar descriptors could be made similar by adding a contact
to one of them which just slightly exceeds the threshold.
This issue can be solved using a formalism of rough sets [38]. The idea is to
introduce a third logical value – “maybe”. If distances between residues fall within
contact thresholds by a very little margin, such contact is flagged as optional. When
descriptors are aligned, optional contacts which have their counterparts in the other
descriptor are treated normally, while non-aligned optional contacts are disregarded
(i.e. not counted in computation of the descriptor size). This is equivalent to giving
such contacts the benefit of the doubt – they might have a counterpart in the other
protein with a distance slightly too large to qualify as a contact.
It should be noted that equivalent results cannot be achieved by adjusting thresh-
olds. No matter how conservative or liberal the definition of a contact is, pairs of
residues with distances near to the threshold will always exist and descriptors con-
taining them will be susceptible to the described issue.

5.2 Similarity Measure

5.2.1 Segment Swaps

In Chap. 3 we considered methods of computing alignments with or without segment


swaps. However, we did not focus on the nature of segment swaps. Instead we
assumed that any mapping (with a support in a given set) is a valid alignment. In
some cases it may be useful to restrict the number of swaps to a certain value (e.g.
so-called circular permutations contain exactly one swap).

We say that a pair of aligned residues a (k) , b(m) is a swap site, if residues a (l)
and b(n) , which are the lowest numbered residues in the alignment following a (k) and
b(m) in respective structures, are not aligned together (see Fig. 7).
It is easy to compute the number of swap sites in a given alignment. If it exceeds
the preset limit it is possible to find the largest subalignment with a desired number
of swaps.

5.2.2 Alignment Quality

Usually, finding the best structural alignment is a problem of optimization of two vari-
ables – alignment size, and its quality. Root Mean Square distance (RMSD) [24, 25]
is the most popular measure due to its simplicity – it has a compact mathematical solu-
tion. Other methods (e.g. MaxSub [47]) exist, but did not manage to achieve popular-
ity. All these methods, however, rely on superimposing aligned residues and assessing
628 P. Daniluk and B. Lesyng

Fig. 7 Example of a swap


site. Axes correspond to
residue sequences. Aligned
segments are plotted with
diagonal lines

the quality of the superposition bases on distances between aligned residues. Struc-
tures are therefore treated as rigid objects. Nevertheless, it is commonly accepted
that proteins are flexible to some extent. In Sect. 1 we have suggested that methods of
aligning protein structures should take such flexibility into account, and thus quality
measures overcoming the rigid-body limitation should be used. The simplest way is
to introduce explicit “hinges” which connect rigid fragments [52].
In this section we will present a solution used in the DEDAL method, which was
designed to allow for flexible rearrangements of loosely coupled substructures (e.g.
domains) and small local distortions, while penalizing deformations significantly
changing the arrangement of interactions which stabilize structures. As in the case
of local descriptors, where inter-residue contacts are used to define the structural
neighborhood of a chosen residue, contacts may be used to detect a network of
interactions responsible for the rigidity of a protein structure. One may imagine that
residues in contact are connected with springs, which have to be somewhat extended
or compressed, if these residues were to be superimposed onto their counterparts
in the other structure. Degree of deformations of such springs may be treated as an
indicator of the structural similarity (Fig. 8).
   
Definition 12 An aligned contact is a pair a (k) , b(m) , a (l) , b (n) , such  that a (k) is
aligned to b , and a is aligned to b and at least one of a , a , b , b(n)
(m) (l) (n) (k) (l) (m)

are in contact. We call an aligned contact proper if it exists in both structures.


A distortion of a single aligned contact is called local tension and is expressed as
an RMS distance between descriptor elements:
          
tens a (k) , b(m) , a (l) , b(n) = R M S D El a (k) ∪ El a (l) , El b(m) ∪ El(b(n)
Theoretical and Computational Aspects … 629

(a) (b)
Fig. 8 Similar structures (ASTRAL domains a d1d5fa_ i b d1nd7a_) comprise two differently
arranged subdomains. Properly aligned contacts are marked with green lines. Yellow lines denote
aligned contacts which are not preserved in the other structure. Red lines mark residue pairs not in
contact, which are aligned with residues in contact. In order to superimpose these structures it is
necessary to extend springs corresponding to yellow contacts to lengths of respective red lines

A tension of the alignment ξ is a square mean of its local tensions computed for
each residue, and then for the whole structure:
!
"  [tens(a (i) ,ξ(a (i) ),a ( j) ,ξ(a ( j) ))]2
" 
" a ( j) ∈Ta (i) |Ta ( j) |
tens(ξ ) = #
|Dom(ξ )|
a (i) ∈Dom(ξ )

   
where a ( j) ∈ Ta (i) , if a (i) , ξ(a (i) ) , a ( j) , ξ(a ( j) ) is an aligned contact in ξ .

5.3 Case Studies

To illustrate the issues arising in computing alignments of protein structures we


present three cases of difficult structure alignments not handled effectively by meth-
ods limited by the rigid-body or sequence-dependency constraints.
Saposins
Similarity between saposin and saposin-like “swaposin” domains is one of the first
circular permutations discovered. It was first indicated by sequence analysis [40],
630 P. Daniluk and B. Lesyng

and verified when the crystal structures became available. NK-lysin (SCOP domain
d1nkla_) is composed of five α-helices arranged in the “folded leaf” architecture
[29]. The“swaposin” domain (d1qdma1) of aspartic proteinase prophytepsin has the
same architecture, but the helices are in a different order [27] (Fig. 9). Nevertheless,
despite the obvious similarity most of the structure comparison methods align the

(a) (b)

(c)
Fig. 9 The Saposin domain of NK-lysin (SCOP domain d1nkla_) and the “swaposin” domain of
prophytepsin (d1qdma1). Despite differing topologies these two domains have the same architec-
ture and identical disulfide bonds. a Methods incapable of handling segment swaps wrongly align
cysteine residues (figure shows alignment computed by DALI). b DEDAL correctly identifies the
best superposition and the disulfide bond network. c Alignments shown in a (red) and b (green)
are plotted against local similarity of single segments of length 5 (yellow). It can be observed that
similarity of continuous segments is insufficient to discover the correct alignment
Theoretical and Computational Aspects … 631

helices in agreement with their order along the sequence, which results in a visually
poor superposition (Fig. 9a). The similarity of continuous segments commonly used
does not provide enough information concerning the arrangement of helices and at
the same time supports an alignment without swaps (Fig. 9c). It should be empha-
sized that, apart from the worse RMS distance, alignments without swaps incorrectly
match cysteine residues forming the disulfide bonds. FlexSnap [42] and DEDAL [10]
identify the similarity correctly (Fig. 9b).
GTPases
Guanine nucleotide-binding proteins (G proteins) are important cellular regulators.
They act as binary switches, and use the GTP-GDP-GTP cycle to flip between the
on and off states. GTPase domains they contain are responsible for the GTP/GDP
binding. The GTPase activity depends on the set of five conserved sequence motifs
[36]. An alternative circularly permuted GTPase structure (cpGTPase) [45] which
contains all five motifs in a different order also exists(Fig. 11a and b). Despite a
different topology the cpGTPase domains retain the GTP binding activity, and have
the same architecture as GTPases. Although the crucial motifs are highly conserved
and identifiable by sequence analysis [3], many structure comparison methods are
unable to correctly align residues which form the GTP/GDP binding site. CE [46]
and DALI [21] yield 36% accuracy, while FlexSnap [42] and Cα -match [4] have
90% accuracy (reference alignment contains residues responsible for GTP binding).
In contrast, DEDAL [10] yields an entirely accurate superposition in this region
(Fig. 11c and d).
Cyanovirin-N
Cyanovirin-N is a potent HIV-inactivating protein, which exists in both monomeric
and domain-swapped dimeric forms. Although the monomeric form is predominant
in solution, and was determined first [7], the metastable dimeric form is also present.
The dimeric form is stabilized in the crystalline state [51] and eventually its struc-
ture was also obtained by NMR [5]. For the dimeric form, it can be observed that
the X-ray (SCOP domain d1l5ba_) and NMR (d1l5ea_) structures exhibit a
slightly different arrangement of subdomains (Fig. 10a and b), and that the local con-
formations of all residues except for the hinge region (PRO51-ASN53, Fig. 10c) are
identical. Nevertheless, the similarity between the two structures cannot be easily
determined by the rigid-body techniques, which are capable of aligning only one
subdomain. Surprisingly FlexSnap [42], although in principle capable of handling
conformational variability, gives only 50% accuracy with the reference alignment.

6 Conclusions

Computing protein structure alignments is one of the most commonly performed


tasks in computational structural biology. Most knowledge about proteins and their
functions which is not gathered experimentally is inferred from properties of sequen-
tially or structurally analogous proteins. Therefore, despite the maturity of this field,
632 P. Daniluk and B. Lesyng

G1 G5
G4
G1
G4 G5

N C

(a) (b)

(c) (d)
Fig. 10 Conformation of the Cyanovirin-N dimeric form depends on the molecular environment. a
X-ray (d1l5ba_) and b NMR (d1l5ea_) structures have different conformations of the “hinge”
region (PRO51-ASN53) c. To fully analyze the similarity of the two structures it is necessary to
abandon the rigid-body approach. The regions on both sides of the “hinge” have to be superimposed
separately. DEDAL accomplishes this by extending local similarities in both regions and effectively
defining the “hinge” as the boundary between them
Theoretical and Computational Aspects … 633

(a) (b)

(c)
Fig. 11 Topologies of a the Dynamin A GTPase (SCOP domain d1jwyb_) and b cpGTPase
domain from the YjeQ protein (d1u0la2). Aligned SSEs are indicated by lighter colors. c DEDAL
superposition of the GTPase and the cpGTPase domains (yellow and blue, respectively). For clarity,
only the aligned parts of the structures are shown. d View of the binding site in the same superposition
showing residues participating in the GDP/GTP binding (red) and the GDP molecule. Despite
significant topological differences, DEDAL effectively handles all alignable SSEs and correctly
superimposes the active sites. The sequence identity of the superimposed regions is 24.2%
634 P. Daniluk and B. Lesyng

similarity measures and efficient methods of computing alignments are subjects of


ongoing research.
In this study we have concentrated on structure alignment methods based on so-
called local fragments. This is probably the most prominent approach. Being based
on local similarities, it also has the capability to build alignments containing seg-
ment swaps and deal with spatial distortions. These features are crucial in detecting
“difficult” similarities, some of which were presented.
One of the reasons for the lack of a “golden standard” structure alignment method
is the fact that under most circumstances the problem of computing the optimal align-
ment is intractable. Therefore, all computationally feasible methods have either to be
heuristic or be based on a simplified similarity measure. In particular, we have shown
that the problem of computing an optimal alignment based on similarities compris-
ing three disjoint segments is NP-complete even if segment swaps are forbidden.
If segment swaps are allowed, the problem is also intractable for two segmented
similarities. We have described algorithms that could be used for computing such
alignments, and presented a particularly effective formalism of local similarities –
Local Descriptors of Protein Structure. This method has been implemented, tested,
and is publicly available (DEDAL [10]).
Computing multiple alignments is even harder. It is NP-complete since it contains
an intractable problem of pairwise alignment, but just like in the case of sequence
alignments, even if computing pairwise alignments was effectively solved, comput-
ing an optimal multiple alignment would require time exponential with respect to
the number of structures.
Nevertheless, despite all theoretical difficulties there exists a plethora of successful
methods. Most of them overcome inherent intractability by disallowing segment
swaps, using a rigid-body quality measure, and using simple local similarities (e.g.
single segments). One of the reasons for their success is that reference databases
of structural alignments which could be used to benchmark them have not yet been
developed. At the same time, they generally performed well enough to allow for
a proper similarity analysis by a human expert. Traditionally, alignment methods
were ranked by the number of residues they aligned under certain quality thresholds.
Although this seems to be a valid and fully objective method, it by no means takes
into account the biological meaning of the computed similarities. This is akin to the
concern frequently raised in the protein structure prediction community, that models
with correct prediction of an overall topology (fold) may be less relevant than models
correctly predicting conformation of an active site.
In recent years, studies evaluating alignment algorithms on human curated bio-
logically significant alignments have emerged [6, 32] giving a basis for more relevant
benchmarks. DEDAL [10] – a method based on the principles presented in this review
and employing local descriptors – outperforms established methods despite its sim-
plicity, especially when tested on the most difficult cases. DAMA [9] – its prototype
extension carrying out multiple structural alignments, has already been announced.
One of the underexplored directions of research (in authors’ opinion) are heuristics
based on relaxation. Computing an alignment is by its nature a discrete combinatorial
problem. Nevertheless, there are successful applications of techniques converting a
Theoretical and Computational Aspects … 635

discrete problem into an easier continuous one with the aim of obtaining an approx-
imate solution. This technique might be of use in efficiently computing multiple
alignments (private communications, unpublished work).
As the number of known protein structures and high quality models increases,
computing biologically relevant alignments is becoming a serious option in the area
traditionally reserved to genome-wide sequence searches. It is generally accepted
that a sequence of residues implies a spatial structure, which in turn determines
atomic functional motions and other properties of a molecule. Therefore conclusions
inferred from the structure comparison are in general more reliable than ones based
on sequence alignments.
One should note that although causal relations between sequence, structure,
atomic motions and function are often discussed in biological literature, until now
such relations do not have any formal, consistent mathematical framework. Nev-
ertheless, during the past few years, based on methodologies developed for com-
plex systems in economy and neurophysiology, a prototype of causal analysis for
biomolecular systems has been proposed.12 In particular, applying the presented
methodology to trajectories obtained from molecular dynamic simulations can help
to elucidate the actual logic of its functioning. The development of such formal-
ism for causal relations is one of the challenging tasks in structural biology and
bioinformatics.

Acknowledgements This study was supported by the Biocentrum-Ochota Project (POIG.02.03.00-


00-003/09), the research grant (DEC-2011/03/D/NZ2/02004) of the National Science Centre, and
partially by BST/BF funds of the University of Warsaw. Figures 1, 11 and 10 are reproduced from
an earlier study by the same authors [10].

References

1. Alexandrov, N.: SARFing the PDB. Protein Eng. 9(9), 727 (1996)
2. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.:
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Res. 25(17), 3389–402 (1997)
3. Anand, B., Verma, S.K., Prakash, B.: Structural stabilization of GTP-binding domains in circu-
larly permuted GTPases: implications for RNA binding. Nucleic Acids Res. 34(8), 2196–205
(2006)
4. Bachar, O., Fischer, D., Nussinov, R., Wolfson, H.: A computer vision based technique for 3-D
sequence-independent structural comparison of proteins. Protein Eng. 6(3), 279–88 (1993)
5. Barrientos, L.G., Louis, J.M., Botos, I., Mori, T., Han, Z., O’Keefe, B.R., Boyd, M.R., Wlo-
dawer, A., Gronenborn, A.M.: The domain-swapped dimer of cyanovirin-N is in a metastable
folded state: reconciliation of X-ray and NMR structures. Structure 10(5), 673–86 (2002)
6. Berbalk, C., Schwaiger, C.S., Lackner, P.: Accuracy analysis of multiple structure alignments.
Protein Sci. 18(10), 2027–35 (2009)
7. Bewley, C.A., Gustafson, K.R., Boyd, M.R., Covell, D.G., Bax, A., Clore, G.M., Gronenborn,
A.M.: Solution structure of cyanovirin-N, a potent HIV-inactivating protein. Nat. Struct. Biol.
5(7), 571–8 (1998)

12 For an overview and references related to this topic see [11].


636 P. Daniluk and B. Lesyng

8. Bystroff, C., Baker, D.: Prediction of local structure in proteins using a library of sequence-
structure motifs. J. Mol. Biol. 281(3), 565–77 (1998). https://doi.org/10.1006/jmbi.1998.1943
9. Daniluk, P., Lesyng, B.: DAMA: a novel method for aligning multiple protein structures. In:
Multi-Pole Approach to Structural Biology Conference. Warsaw, Poland (2011a)
10. Daniluk, P., Lesyng, B.: A novel method to compare protein structures using local descriptors.
BMC Bioinformatics 12(1), 344 (2011b). https://doi.org/10.1186/1471-2105-12-344
11. Daniluk, P., Dziubiński, M., Hallay-Suszek, M., Rakowski, F., Walewski, L., Lesyng, B.: From
experimental structural probability distributions to the theoretical causality analysis of molec-
ular changes. CAMES (In press) (2012)
12. Dobbins, S., Lesk, V., Sternberg, M.: Insights into protein flexibility: the relationship between
normal modes and conformational change upon protein–protein docking. Proc. National Acad.
Sci. 105(30), 10,390 (2008)
13. Dror, O., Benyamini, H., Nussinov, R., Wolfson, H.: MASS: multiple structural alignment by
secondary structures. Bioinformatics 19(Suppl 1), i95–104 (2003)
14. Elias, I.: Settling the intractability of multiple alignment. J. Comput. Biol. 13(7), 1323–39
(2006). https://doi.org/10.1089/cmb.2006.13.1323
15. Garey, M.R., Johnson, D.S.: Computers and intractability: a guide to the theory of NP-
completeness. A Series of books in the mathematical sciences, W. H, Freeman, San Francisco
(1979)
16. Gerstein, M., Echols, N.: Exploring the range of protein flexibility, from a structural proteomics
perspective. Curr. Opin. Chem. Biol. 8(1), 14–19 (2004)
17. Gibrat, J.F., Madej, T., Bryant, S.H.: Surprising similarities in structure comparison. Curr. Opin.
Struct. Biol. 6(3), 377–85 (1996)
18. Grishin, N.V.: Fold change in evolution of protein structures. J. Struct. Biol. 134(2–3), 167–85
(2001)
19. Guerler, A., Knapp, E.W.: Novel protein folds and their nonsequential structural analogs. Pro-
tein Sci. 17(8), 1374–82 (2008)
20. Holm, L., Park, J.: DaliLite workbench for protein structure comparison. Bioinformatics 16(6),
566–7 (2000)
21. Holm, L., Sander, C.: Protein structure comparison by alignment of distance matrices. J. Mol.
Biol. 233(1), 123–38 (1993)
22. Ilyin, V.A., Abyzov, A., Leslin, C.M.: Structural alignment of proteins by a novel TOPOFIT
method, as a superimposition of common volumes at a topomax point. Protein Sci. 13(7),
1865–74 (2004)
23. Jung, J., Lee, B.: Protein structure alignment using environmental profiles. Protein Eng. 13(8),
535–43 (2000)
24. Kabsch, W.: A solution for the best rotation to relate two sets of vectors. Acta Crystallogr. Sect.
A 32(5), 922–923 (1976)
25. Kabsch, W.: A discussion of the solution for the best rotation to relate two sets of vectors. Acta
Crystallogr. Sect. A 34(5), 827–828 (1978)
26. Kawabata, T., Nishikawa, K.: Protein structure comparison using the markov transition model
of evolution. Proteins 41(1), 108–22 (2000)
27. Kervinen, J., Tobin, G.J., Costa, J., Waugh, D.S., Wlodawer, A., Zdanov, A.: Crystal structure
of plant aspartic proteinase prophytepsin: inactivation and vacuolar targeting. EMBO J. 18(14),
3947–55 (1999)
28. Konagurthu, A.S., Whisstock, J.C., Stuckey, P.J., Lesk, A.M.: MUSTANG: a multiple structural
alignment algorithm. Proteins 64(3), 559–74 (2006). https://doi.org/10.1002/prot.20921
29. Liepinsh, E., Andersson, M., Ruysschaert, J.M., Otting, G.: Saposin fold revealed by the NMR
structure of NK-lysin. Nat. Struct. Biol. 4(10), 793–5 (1997)
30. Lindqvist, Y., Schneider, G.: Circular permutations of natural protein sequences: structural
evidence. Curr. Opin. Struct. Biol. 7(3), 422–7 (1997)
31. Mavridis, L., Ritchie, D.: 3D-blast: 3D protein structure alignment, comparison, and classifi-
cation using spherical polar fourier correlations. Pacific Symp. Biocomputing 2010, 281–292
(2010)
Theoretical and Computational Aspects … 637

32. Mayr, G., Domingues, F.S., Lackner, P.: Comparative analysis of protein structure alignments.
BMC Struct. Biol. 7, 50 (2007)
33. Menke, M., Berger, B., Cowen, L.: Matt: local flexibility aids protein multiple structure align-
ment. PLoS Comput. Biol. 4(1), e10 (2008). https://doi.org/10.1371/journal.pcbi.0040010
34. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., Teller, E.: Equation of state calcu-
lations by fast computing machines. J. Chem. Phys. 21(6), 1087 (1953)
35. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in
the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–53 (1970)
36. Niemann, H.H., Knetsch, M.L., Scherer, A., Manstein, D.J., Kull, F.J.: Crystal structure of a
dynamin GTPase domain in both nucleotide-free and GDP-bound forms. EMBO J. 20(21),
5813–21 (2001)
37. Orengo, C.A., Taylor, W.R.: SSAP: sequential structure alignment program for protein structure
comparison. Methods Enzymol 266, 617–35 (1996)
38. Pawlak, Z.: Rough Sets: theoretical aspects of reasoning about data. System theory, knowledge
engineering, and problem solving, Kluwer Academic Publishers, Theory and decision library
(1991)
39. Pearson, W., Lipman, D.: Improved tools for biological sequence comparison. Proc. National
Acad. Sci. 85(8), 2444 (1988)
40. Ponting, C.P., Russell, R.B.: Swaposins: circular permutations within genes encoding saposin
homologues. Trends Biochem Sci. 20(5), 179–80 (1995)
41. Rocha, J., Segura, J., Wilson, R.C., Dasgupta, S.: Flexible structural protein alignment by a
sequence of local transformations. Bioinformatics 25(13), 1625–31 (2009)
42. Salem, S., Zaki, M., Bystroff, C.: FlexSnap: flexible non-sequential protein structure alignment.
Algorithms for Mol. Biology 5(1), 12 (2010)
43. Shatsky, M., Nussinov, R., Wolfson, H.J.: FlexProt: alignment of flexible protein structures
without a predefinition of hinge regions. J. Comput. Biol. 11(1), 83–106 (2004a)
44. Shatsky, M., Nussinov, R., Wolfson, H.J.: A method for simultaneous alignment of multiple
protein structures. Proteins 56(1), 143–56 (2004b). https://doi.org/10.1002/prot.10628
45. Shin, D.H., Lou, Y., Jancarik, J., Yokota, H., Kim, R., Kim, S.H.: Crystal structure of YjeQ
from Thermotoga maritima contains a circularly permuted GTPase domain. Proc. Natl. Acad.
Sci. U S A 101(36), 13,198–13,203 (2004)
46. Shindyalov, I.N., Bourne, P.E.: Protein structure alignment by incremental combinatorial exten-
sion (CE) of the optimal path. Protein Eng. 11(9), 739–47 (1998)
47. Siew, N., Elofsson, A., Rychlewski, L., Fischer, D.: MaxSub: an automated measure for the
assessment of protein structure prediction quality. Bioinformatics 16(9), 776–785 (2000)
48. Swendsen, R.H., Wang, J.S.: Replica Monte Carlo simulation of spin glasses. Phys. Rev. Lett.
57(21), 2607–2609 (1986)
49. Vogel, C., Morea, V.: Duplication, divergence and formation of novel protein topologies. Bioes-
says 28(10), 973–8 (2006). https://doi.org/10.1002/bies.20474
50. Wohlers, I., Domingues, F.S., Klau, G.W.: Towards optimal alignment of protein structure
distance matrices. Bioinformatics 26(18), 2273–80 (2010)
51. Yang, F., Bewley, C.A., Louis, J.M., Gustafson, K.R., Boyd, M.R., Gronenborn, A.M., Clore,
G.M., Wlodawer, A.: Crystal structure of cyanovirin-N, a potent HIV-inactivating protein,
shows unexpected domain swapping. J. Mol. Biol. 288(3), 403–12 (1999)
52. Ye, Y., Godzik, A.: Flexible structure alignment by chaining aligned fragment pairs allowing
twists. Bioinformatics 19(Suppl 2), ii246–55 (2003)
53. Ye, Y., Godzik, A.: Multiple flexible structure alignment using partial order graphs. Bioinfor-
matics 21(10), 2362–9 (2005). https://doi.org/10.1093/bioinformatics/bti353
Fuzzy Oil Drop Model
Application—From Globular Proteins
to Amyloids

M. Banach, L. Konieczny and I. Roterman

Abstract The fuzzy oil drop model asserts the presence of a monocentric
hydrophobic core in a protein, generated by the influence of water which directs
hydrophobic residues towards the center, while exposing hydrophilic molecules on
the surface. Applying the model to a range of proteins which vary in terms of structure
and function reveals globally accordant structures and locally discordant fragments
which disrupt the hydrophobic core and appear to mediate the protein’s biological
function. Solenoids provide an example of structural elements which diverge from
the fuzzy oil drop model by adopting a linear distribution of hydrophobicity. Such lin-
ear propagation, while unbounded in principle, is arrested by terminal “caps”, which
mediate contact with water and therefore prevent the solenoid from growing indef-
initely. Amyloids—a group of misfolding proteins—follow the same principles but
lack suitable “caps” and may propagate without bound. In light of the fuzzy oil drop
model, the factor most directly responsible for this phenomenon is anomalous inter-
action with the aqueous environment, where the expected monocentric distribution
of hydrophobicity is replaced by a distribution based on the intrinsic hydrophobicity
of each residue, thus preventing a hydrophobic core from emerging. In this work we
present a set of proteins which represent progressive departures from the fuzzy oil
drop model (i.e. from the theoretical distribution of hydrophobicity expressed by a
3D Gaussian). We also discuss the biological function and/or disfunction of each
protein.

M. Banach · I. Roterman (B)


Department of Bioinformatics and Telemedicine, Jagiellonian University—Medical College,
Łazarza 16, 31-530 Krakow, Poland
e-mail: myroterm@cyf-kr.edu.pl
M. Banach
Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University,
Łojasiewicza 11, 30-348 Krakow, Poland
L. Konieczny
Chair of Medical Biochemistry, Jagiellonian University—Medical College,
Kopernika 7, 31-034 Krakow, Poland

© Springer Nature Switzerland AG 2019 639


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_19
640 M. Banach et al.

1 Extension of Fuzzy Oil Drop Applicability

The fuzzy oil drop model, introduced in [1], enables us to study how proteins vary
with respect to size and biological role. This chapter discusses a broad spectrum
of proteins, from small globular molecules (represented by 1UCS, an antifreeze
protein [2]), through large globular examples (Exonuclease III from Escherichia coli;
PDB ID 1AKO [3]), dimeric proteins which exhibit enzymatic activity and can bind
ligands (Class II fructose-1,6-biphosphate aldolase; PDB ID 1B57 [4]), solenoid-
containing proteins (another antifreeze protein—isoform 501 from spruce budworm;
PDB ID 1Z2F [5]), and amyloids (amyloid β-peptide (Aβ) fibrils Aβ1-40 peptide
with the Osaka mutation (E22D); PDB ID 2MVX [6]). Processing this heterogeneous
set of proteins with the fuzzy oil drop model reveals strong directionality: from
structures highly consistent with the theoretical core structure represented by a 3D
Gaussian, through local deviations related to biological function (ligand binding;
protein complexation), all the way to global discordance, evident e.g. in solenoids
and amyloids which expose a linear arrangement of alternating bands of high and low
hydrophobicity (propagating along their axis of elongation) in place of a monocentric
hydrophobic core. The key difference between solenoids and amyloids is that the
former group is equipped with special “stoppers” (or “caps”), preventing unchecked
growth of the solenoid structure. In amyloids no such stoppers are present.
A novel addition to the model as presented in [1] is the so-called relative distance
(RD) coefficient, given as:
O/T
RD 
(O/T + O/R)

This value determines whether the observed (O) distribution more closely approx-
imates the theoretical (T) or the unified (R) boundary case. It is worth noting that RD
can also be computed for a different set of boundary distributions. Thus far we have
focused on the theoretical distribution (T), which perfectly matches the 3D Gaus-
sian, and the unified distribution (R), where each residue is assigned a hydrophobicity
of 1/N (N being the number of residues in the polypeptide chain). In some cases,
however, it is useful to replace the unified distribution with the so-called intrinsic
distribution (H) based on the individual intrinsic hydrophobicity of each residue.
This approach results in two distinct values of RD: one for the T-O-R variant and
another for the T-O-H variant. Notably, RD can also be computed for specific frag-
ments of the input chain, such as selected secondary folds or fragments which meet
some arbitrary criterion.
A low value of RD in the T-O-R variant indicates the presence of a well-formed
hydrophobic core, consistent with the Gaussian distribution. This type of distribu-
tion can be called “cooperative” since it relies on cooperation of individual residues
in adopting a common conformation. On the other hand, a high value of RD (T-O-
R) means that the protein lacks a prominent hydrophobic core. Regarding T-O-H,
high RD shows that the hydrophobicity distribution is dominated by the individual
properties of each residue. This is a “selfish” state where the placement of residues
Fuzzy Oil Drop Model Application—From Globular … 641

is dictated by their interaction with close neighbors rather than by the general ten-
dency to produce a common core. Such distributions are commonly found in linear
structures, particularly amyloids (or solenoids) comprising sequences of identical (or
similar, from the point of view of hydrophobicity) fragments. This often leads to lin-
ear arrangement of alternating hydrophobicity maxima and minima, which propagate
along the fibril’s axis.
Figure 1 illustrates the RD parameter in both variants (T-O-R and T-O-H).
All further references to RD which do not specify a variant will indicate T-O-R.
Whenever T-O-H is considered instead, this will be clearly stated.
A thorough description of the fuzzy oil drop model can be found in [7].

Fig. 1 Distributions of hydrophobicity calculated for a sample protein and compared with reference
distributions (T, R and H). a—theoretical distribution (T); c—unified distribution (R); b—observed
distribution (O), which is assessed as accordant with the theoretical distribution (RD  0.26, as
shown on axis d). Substituting the intrinsic distribution (h) for the unified distribution results
in an RD value of 0.58, indicating that the structure in question is dominated by the intrinsic
hydrophobicity of each participating residue (axis e). For the sake of clarity, the presentation has
been restricted to a single dimension. The illustrated fragment is composed of residues 104–112
in transthyretin (1DVQ). f—theoretical distribution for the selected fragment; g—corresponding
observed distribution; h—corresponding intrinsic distribution
642 M. Banach et al.

2 Fuzzy Oil Drop Model Revealing the Degree


of Accordance/Discordance of the Observed Distribution
Versus the Theoretical Distribution

As stated above, we will discuss individual proteins in the order of increasing dis-
cordance versus the theoretical distribution, which is expressed by the 3D Gaussian.

2.1 Small Globular Protein Consistent with the Fuzzy Oil


Drop Model

The highly ordered antifreeze protein 1UCS is an example of a small globular


molecule which remains highly consistent with the theoretical hydrophobic core
structure as predicted by the fuzzy oil drop model. This protein has a chain length of
64 aa and its source organism is Lycodichthys dearborni—Antarctic eel pout (a bony
fish). According to [2], this protein belongs to type II of the four types of antifreeze
proteins found in marine fishes living at subzero temperatures.
From the perspective of the fuzzy oil drop model the tertiary conformation of this
protein involves a central hydrophobic core, with hydrophobicity values decreas-
ing along with distance from the center, reaching almost 0 on the protein surface.
This view is supported by comparing theoretical (T) and observed (O) distributions
(Fig. 2). In line with the model, the location of residues which comprise the core (peak
hydrophobicity) as well as those which are found on the surface (low hydrophobicity)
is consistent with the Gaussian form. The corresponding 3D representation (Fig. 3)
reveals certain exposure (albeit not direct) of some hydrophobic residues.
Comparing both distributions reveals only minimal differences, as evidenced by
the low value of RD for T-O-R  0.358 and for T-O-H RD  0.276. Note that RD
expresses only the relative distance between the Gaussian distribution and both ref-
erence distributions. The corresponding interpretation is as follows: RD < 0.5 indi-
cates that the observed distribution approximates the theoretical distribution, while
RD > 0.5 signals a distribution which is more closely aligned with the unified (R)

Fig. 2 T(blue), O(red) and H(green) distributions in 1UCS—antifreeze protein


Fuzzy Oil Drop Model Application—From Globular … 643

Fig. 3 3D presentation of 1UCS. a globular conformation; b minimal exposure of hydrophobic


residues (red)

distribution. In all cases, RD adopts values from the 0–1 range. Both parameters
RD of values far below 0.5 suggest domination of the Gaussian distribution in this
molecule.
The molecule under consideration is an antifreeze protein which works by dis-
rupting the natural organization of water, required for an ice crystal to emerge. The
action is quite similar to de-icing pavement by scattering salt. The mere presence of
the protein triggers structural rearrangement in the aqueous medium. By adjusting
to the distribution of charge (polar groups) on the protein surface, water molecules
adopt an ordering which disfavors the formation of ice crystals. The effect is not
limited to the layer directly adjacent to the protein but likely propagates to further
layers and can be felt at some distance from the surface.

2.2 Large Globular Protein Consistent with the Fuzzy Oil


Drop Model

Exonuclease III from Escherichia coli (1AKO) provides an example of a large pro-
tein which is nevertheless consistent with the fuzzy oil drop model [3]. This protein
has a chain length of 268 aa. Reviewers which comment on publications related to the
fuzzy oil drop model often remark that the theoretical and observed distributions can
remain consistent only for small proteins. The presented case contradicts this opinion.
Figure 4 illustrates both distributions, showing that they remain consistent in spite
of small local deviations. The corresponding RD for T-O-R value is 0.441. The dia-
gram also reveals catalytic residues and shows that strongly polar residues (note that
644 M. Banach et al.

Fig. 4 T(blue), O(red) and H(green) distributions in exonuclease III from Escherichia coli (1AKO).
The catalytic residues distinguished as orange stars

catalytic reactions are based mainly on electrostatic interactions) are located in areas
where high hydrophobicity is expected. Due to the need to interact with the substrate
(in this case—DNA), catalytic residues are commonly found in binding pockets.
Consequently, their neighborhood is characterized by high expected hydrophobicity.
When such residues are omitted from calculations, the remainder of the protein typi-
cally conforms to the theoretical hydrophobicity profile with greater accuracy (in this
case, RD  0.419 for T-O-R). In contrast, the catalytic residues themselves diverge
from the model (RD  0.715 for T-O-R). This phenomenon reflects the encoding of
information in the structure of the active site, where hydrophilic residues located
in close proximity to the hydrophobic core create suitable conditions for catalytic
reactions.
The beta-sandwich structure which forms part of the protein is characterized by
high RD for T-O-R (0.576). One of its constituent beta sheets is consistent with the
theoretical distribution (RD  0.487 for the 53–59 fragment for T-O-R), while the
remaining sheet houses three out of five catalytic residues and remains divergent
from the model (RD  0.681 for the beta sheet which includes the 75–79 fragment
for T-O-R). Such discordance of the “catalytic” beta sheet suggests a certain degree of
instability and capacity for structural rearrangements which may be required during
catalysis.
The entire molecule remains highly stable owing to the presence of a well-ordered
outer layer composed of helical folds (RD  0.440 for T-O-R).
Figure 5 depicts the 3D structure of this protein. We can observe that the surface
is composed of hydrophilic residues, while the polar residues forming the active site
are housed in a pocket.
The presented protein includes a region dominated by positive electrostatic poten-
tial, with numerous strongly preserved residues (identified in many endonucleases
from bacteria to man). This region participates in cleaving a phosphate group (via
nucleophilic attack), which also requires the presence of a metal ion. Our analysis
based on the fuzzy oil drop model is consistent with the above properties.
Fuzzy Oil Drop Model Application—From Globular … 645

Fig. 5 3D structure of exonuclease III: a globular form of the protein; b hydrophilic surface (gray),
catalytic residues—orange, local exposure of hydrophobic residues (red). Notice the red residues
localization in certain distance versus the exposed surface

2.3 Dimer—Ligand-Binding Enzyme: Protein Exhibiting


Localized, Function-Related Deviations
from the Theoretical Hydrophobic Core Structure

Class II fructose-1,6-bisphosphate aldolase in complex with phosphoglycolohydrox-


amate (1B57) [4] provides an example of a complex structure, which is (1) a homod-
imer; (2) an enzyme; (3) capable of binding ligands. This protein has a chain length of
346 aa. Owing to its conformational diversity and biological activity it represents an
interesting study subject in the context of the fuzzy oil drop model. Here, we attempt
to relate the structure of its hydrophobic core to specific functional properties.
In its monomeric form, 1B57 has an RD value of 0.562 (T-O-R), indicating that
it lacks a clearly defined hydrophobic core in the sense of the fuzzy oil drop model
(Fig. 6).

Fig. 6 T(blue), O(red) and H(green) distributions in Class II fructose-1,6-bisphosphate aldolase


(1B57) in complex with phosphoglycolohydroxamate. The chart marks the positions of residues
involved in P-P complexation (yellow squares), turquoise circles—residues engaged in ligand bind-
ing, catalytic residues—orange stars
646 M. Banach et al.

The presence of a ligand typically distorts the protein’s hydrophobic core structure
due to the need for a suitable binding cavity. In the presented case elimination of
residues responsible for interaction with the ligand lowers RD to 0.551 (T-O-R),
while the ligand binding fragment itself strongly diverges from the model (RD 
0.701 – T-O-R). Clearly, a protein which includes a binding cavity does not follow
the theoretical distribution with the same accuracy as proteins which lack such a
cavity. Comparing RD values for the entire molecule and for its remainder following
elimination of catalytic residues indicates that deviations from the theoretical model
are concentrated in areas where substrate binding occurs.
Similar conditions are observed in areas responsible for protein-protein inter-
action. The interface, when analyzed on its own, has RD  0.584 (T-O-R), while
the remainder of the molecule exhibits a lower value of RD (0.560 – T-O-R). This
suggests that local deviations from the theoretical hydrophobicity distribution are
associated with external factors, such as the presence of a ligand or complexation
partner.
Comparing T and O profiles reveals discordances in areas responsible for lig-
and binding and polymerization. Of note is the 275–340 fragment, which houses
a catalytic residue, as well as residues which mediate contact with the ligand and
dimerization. A local excess of hydrophobicity, if present on the surface, typically
enables contact with another protein chain, while local hydrophobicity deficiencies
usually correspond to catalytic active sites. This is further visualized in Fig. 7.
In terms of its tertiary conformation, the monomeric form of 1B57 presents a cen-
trally located beta sheet surrounded by helical folds. The beta sheet itself exhibits low
RD (0.354 T-O-R), indicating high structural stability (under the assumption that a
prominent hydrophobic core stabilizes the protein’s tertiary conformation). An addi-
tional hairpin, comprising two separate beta folds, is characterized by RD  0.242
(T-O-R). The helices, considered as a single unit, diverge from the theoretical distri-
bution (RD  0.521 T-O-R), with two fragments (288–306 and 307–310) regarded
as particularly divergent. Eliminating these two short folds from the “sheath” which
surrounds the central beta sheet lowers its RD value to 0.482 (T-O-R). This means
that at least part of the sheath also contributes to structural stabilization by ensuring

Fig. 7 3D presentation of
Class II fructose-1,
6-bisphosphate aldolase in
complex with phosphogly-
colohydroxamate. Red
helix—unstable fragment at
288–310; orange
spheres—catalytic residues.
Gray and blue colours—two
chains
Fuzzy Oil Drop Model Application—From Globular … 647

entropically advantageous contact with water. The two discordant helices are located
in close proximity to the catalytic residue (N286), which suggests that they form part
of the catalytic active site and undergo conformational changes during catalysis.
Our presentation of Class II fructose-1,6-bisphosphate aldolase in complex with
phosphoglycolohydroxamate is intended as an example of a molecule where various
functional factors (complexation, ligand binding) result in localized deviations from
the theoretical hydrophobic core structure.
With regard to the dimeric structure, the computed RD value of 0.662 (T-O-R)
indicates significant departure from the model, with the terminal fragments of each
chain (288-terminus) seen as particularly discordant (RD  0.691 (T-O-R)). Such
discordance suggests high flexibility of folds which bracket the active site. In contrast,
stabilization is provided by the interface helices at 61–70, 43–51 and 79–101, with
the corresponding RD value of 0.485 (T-O-R). More broadly, the following helical
folds taken together—47–53, 61–70, 79–83, 95–101, 112–134, 159–164, 270–279,
292–307 and 340–354—produce an RD of 0.487 (T-O-R), indicating their stabilizing
role.
When dealing with a complex molecule, it is often useful to consider each frag-
ment separately in order to determine whether it contributes to structural stability.
Locally unstable fragments encode information which is related to the protein’s func-
tional profile. Note that a perfectly accordant protein would be highly soluble but
incapable of any form of activity. Such conditions are approximated by antifreeze
proteins which perform their intended function merely by being present in the aque-
ous environment—such proteins benefit from excellent solubility and inability to
interact with any external molecules. Consequently, localized departures from the
3D Gaussian should be treated as a means of encoding information in a way which
enables the protein to recognize its intended ligand and undergo specific conforma-
tional changes—as is indeed the case with 1B57.

2.4 Protein Which Includes a Solenoid Fragment—Linear


Distribution of Hydrophobicity

Solenoid-containing proteins are represented by 1Z2F: antifreeze protein isoform


501 from Choristoneura fumiferana (spruce budworm) [5]. For this protein, fuzzy
oil drop analysis produces an RD value of 0.713 (T-O-R).
The diagrams shown in Fig. 8 reveal that, contrary to the fuzzy oil drop model,
no concentration of hydrophobicity can be observed in the central part of the pro-
tein. Instead, the observed distribution is sinusoidal, with alternating local minima
and maxima distributed along the chain. This kind of distribution is often associated
with solenoids, and may—in principle—propagate indefinitely. In order to prevent
unchecked growth, the protein includes special “caps” (fragments with green back-
ground) which conform to the theoretical distribution (note that a single cap is suffi-
cient to protect against dimerization). In the presented case, the RD value for the 1–8
648 M. Banach et al.

Fig. 8 T(blue), O(red) and H(green) distributions in an antifreeze protein (1Z2F)

fragment is 0.670 (T-O-R), while the C-terminal section is highly consistent with the
model (0.398 for the 111–121 fragment and 0.456 for the longer fragment at 103–121
T-O-R). The role of caps is to prevent unrestricted elongation of the solenoid and
formation of fibrillary structures. They perform their function by mediating entrop-
ically advantageous contact with water (Fig. 9). In 1Z2F a single C-terminal cap
appears to be present.
The RD value computed for the whole molecule is 0.713 (T-O-R), which, as
already stated, indicates that no concentration of hydrophobicity exists at the center
and that the protein is not stabilized by hydrophobic effects. Instead, the stabilizing
effect appears to be generated by a system of five disulfide bonds. A discussion of
the stabilizing role of disulfides in the context of the fuzzy oil drop model can be
found in [8].
The status of solenoid part in 1Z2F expressed by RD T-O-R  0.760 and for T-
O-H RD  0.803 suggest absence of uni-centric hydrophobic core RD for T-O-R.
Moreover very high value of RD for T-O-H reveals strong influence of intrinsic
hydrophobicity of individual residues on the final structure of the solenoid in this
molecule. The resultant overall distribution of hydrophobicity in solenoid appears
to represent linear, band-like propagation of low/high hydrophobicity. This peculiar
distribution of hydrophobicity in solenoids seems important from the point of view
of structuralization of water (Fig. 10). In 1UCS (an antifreeze protein), where the

Fig. 9 T(blue), O(red) and H(green) distributions for the N- and C-terminal fragments in 1Z2F (an
antifreeze protein)
Fuzzy Oil Drop Model Application—From Globular … 649

Fig. 10 3D structure of 1Z2F—antifreeze protein isoform 501. a—Fragments identified as “caps”


(left—N-terminal and C-terminal fragments—green; right—C-terminal fragment in green). a and
b—Linear propagation of alternating hydrophobicity bands

entire surface consists of polar residues, water is naturally repelled by the surface. In
the case of a solenoid, however, interaction between the protein and water becomes
far more nuanced. Electrostatic effects are believed to result in the emergence of
aqueous “bands” (with differing structural properties) in the neighborhood of the
protein. Clearly, exposed hydrophobicity has a markedly different effect on water
than any hydrophilic residues present on the surface. Some reports even suggest that
water may “levitate” above hydrophobic patches [9]. Such radical alteration of the
natural ordering of water molecules may explain the observed action of solenoid-
containing antifreeze proteins [10, 11], which prevent water from freezing even at
subzero temperatures.

2.5 Amyloid—Strong Discordance Caused by Unrestricted


Propagation of Linear Bands of High/Low
Hydrophobicity

Under the fuzzy oil drop model amyloids are regarded strongly discordant versus the
theoretical distribution of hydrophobicity. In place of a monocentric hydrophobic
core we observe linear propagation of alternating maxima and minima. These bands
propagate along the axis of the emerging fibril and typically result in identical residues
being found in close proximity to one another. Note that such a structure is not
optimal in terms of charge distribution—this proves that hydrophobic interactions
play a dominant role in determining the tertiary conformation of the amyloid.
All the above properties are exemplified by 2MVX—Aβ1-40 peptide with the
Osaka mutation (E22D) [6].
Our analysis concerns amyloid structures which emerge via complexation of 40-aa
polypeptides. The reference amyloid is an elongated fibril consisting of two identical
650 M. Banach et al.

Fig. 11 3D presentation of the reference amyloid—2MVX. Residues engaged in inter-profilaments


interaction

subfibrils (Fig. 11a—chains A–E and F–J). Each subfibril can be further divided into
two distinct beta sheets. The amyloid as a whole is characterized by RD  0.591
(T-O-R), which indicates the lack of a prominent hydrophobic core.
Figure 12 provides a comparison of T and O profiles for the amyloid fibril.
As seen in Fig. 12, the T and O distributions for peptides located at the end of the
fibril differ somewhat from those found in the central part of the chain.
Given that the presented amyloid consists of two identical subfibrils, we have
singled out residues responsible for complexation of the opposite subunit. The status
of this interface section is described by RD  0.479 (T-O-R), whereas the remaining
section of the amyloid (minus the interface) gives RD  0.654 (T-O-R). This means
that the interface is hydrophobically optimized while the remainder of the structure
diverges from the theoretical distribution. This phenomenon may also explain the
moderate RD value calculated for the complex as a whole.
Analysis of a single chain, as illustrated by the profile in Fig. 13, reveals local
accordance with the theoretical distribution, along with certain fragments for which
the observed distribution appears to correlate negatively with theoretical values
(which is typical for amyloids). These fragments are further described in Table
1. More specifically, the fragments at 5–11, 11–15 and 21–27 seem to follow the
intrinsic hydrophobicity of each residue (high values of the H/O correlation coeffi-
cient, with the T/O coefficient adopting negative values and RD remaining high for
both T-O-R and T-O-H)). This shows that in amyloids the observed distribution is
not merely divergent from its theoretical counterpart, but—in some areas—a polar
opposite of T.
The values given in bold—the parameters supporting the interpretation of high
influence of intrinsic hydrophobicity on the status in amyloid fibril.
Fuzzy Oil Drop Model Application—From Globular … 651

Fig. 12 Hydrophobicity distribution in 2MVX: blue—theoretical distribution, red—observed dis-


tribution, green—intrinsic hydrophobicity of residues. Fragments distinguished as pink—fragments
engaged in inter-profilaments interaction. Red dots—residues representing opposite tendency of
observed hydrophobicity in comparison to expected hydrophobicity

High values of RD in both variants (T-O-R and T-O-H) mean that the amyloid does
not produce a monocentric hydrophobic core. For certain fragments the observed dis-
tribution correlates negatively with the theoretical distribution. The location of such
fragments is highlighted by yellow lines in Fig. 13, and also in Fig. 11b. Figure 11c
reveals linear propagation of hydrophobic bands in place of a monocentric core. Such
propagation, occurring along the axis of elongation, is commonplace in amyloids and
distinguishes them from globular proteins—as noted in other publications [12, 13].
The status of the beta sheet (11–18) in chains B-D and G-I (except for terminal
peptides) exhibits strong divergence between T and O, with a very high value of the

Fig. 13 Theoretical (T—blue), observed (O—pink) and intrinsic (H—green) hydrophobicity dis-
tribution. Areas where O diverges from T in favor of H are marked in pink
652 M. Banach et al.

Table 1 RD parameters and correlation coefficients for fragments selected according to the profile
shown in Fig. 13, and in beta sheets. Edge chains (A, E, F and J) have been eliminated from
calculations in order to more accurately represent an unbounded fibril
Fragment T-O-R T-O-H Correlation coefficient
H-T T-O H-O
1–4 0.413 0.188 0.586 0.691 0.801
5–11 0.737 0.687 −0.303 −0.309 0.745
11–15 0.730 0.721 −0.268 −0.375 0.867
16–21 0.407 0.282 0.484 0.620 0.945
21–27 0.668 0.813 0.124 −0.214 0.911
27–40 0.480 0.429 0.583 0.649 0.761
Beta-sheet 0.640 0.490 0.111 0.090 0.935
11–18
Chains B-D
G-I

H/O correlation coefficient. This indicates that the conformation of this fragment
is dominated by intrinsic hydrophobicity, even though RD (T-O-H) remains slightly
below 0.5 (note that the beta sheet in question also encompasses some residues which
do not belong to the “divergent” fragment identified in Fig. 13. If, in the course of
complexation, a hydrophobic core were to emerge, indefinite elongation of the fib-
ril would not be possible. The proteins discussed at the beginning of this chapter
possess hydrophobic cores and therefore adopt globular conformation, without the
risk of indefinite propagation in any direction. Similarly, the “caps” which termi-
nate solenoid fragments in antifreeze proteins prevent complexation of additional
peptides. The lack of similar structures in amyloids opens the door to unrestricted
elongation, resulting in a fibril where the distribution of hydrophobicity is governed
by intrinsic properties of each residue and no hydrophobic core may form. The linear
propagation of discordant fragments can be seen in Fig. 14.

3 Discussion

The presented spectrum of proteins is another example of how the fuzzy oil drop
model can be used to study the relation between the protein’s hydrophobic core and its
biological properties. Structures classified as “misfolding proteins” are represented
by amyloid β-peptide (Aβ) fibrils 1–40 (PDB ID: 2MVX).
A near-perfect match between the theoretical and observed distribution of
hydrophobicity (equivalent to a molecular surface composed entirely of hydrophilic
residues) is observed in certain antifreeze proteins. Such proteins attain their tertiary
conformation by directing all hydrophobic residues towards the center, where they
can be shielded from contact with water. Exposure of polar (or charged) residues on
the surface causes the surrounding water particles to adapt, and disrupts their natural
Fuzzy Oil Drop Model Application—From Globular … 653

Fig. 14 Amyloid 2MVX—residues distinguished as red visualise the residues identified as dis-
cordant according to profiles shown in Fig. 13. Here the red fragments represent the fragments
distinguished as pink in Fig. 13

ordering, preventing the formation of an ice crystal. This mechanism is identical to


macroscale deicing efforts (e.g. scattering salt on pavement in subzero temperatures).
A different way to enforce specific structural ordering of water can be observed in
solenoid-containing antifreeze proteins, with the example discussed in this chapter
serving as an expansion of the group analyzed in [14]. In this case, deviations from
the theorized hydrophobic core structure involve the entire molecule, with alternating
bands of high and low hydrophobicity propagating along the solenoid’s axis. This
strongly affects the surrounding water: hydrophilic bands cause water molecules to
align themselves with polar residues, while the effect of hydrophobic bands is not
precisely known, but most likely differs from hydrophilic conditions. Some exper-
imental studies suggest that water “levitates” above hydrophobic surfaces [9]. This
hypothesis is supported by the observed increase in mobility of water when in contact
with the surface of antifreeze proteins [10, 11].
Local deviations from the theoretical distribution of hydrophobicity often corre-
spond to areas where biological activity occurs [12]. Ligand binding pockets and
catalytic active sites require a suitable cavity, which is frequently buried deep inside
the molecule; sometimes at its very core. Such cavities often contain polar residues
(required in the catalysis process), resulting in localized departures from theoretical
distribution. In FOD-based analysis this is manifested by high RD values calculated
for specific fragments. Eliminating these fragments from computations and calcu-
lating RD from the remainder of the chain usually results in a significantly lower
value. Similarly, fragments adjacent to catalytic residues are also frequently discor-
dant—this is interpreted as a means of equipping the protein with structural flexibility
required during the multi-step catalysis process (note that the presence of a promi-
654 M. Banach et al.

nent hydrophobic core is known to stabilize the tertiary conformation, as explained


in any modern biochemistry textbook).
Amyloids represent an extreme deviation from the monocentric hydrophobic core
structure (whether computed for an individual molecules or multiprotein complexes).
As shown in [13, 14] and also highlighted in this work, the amyloid is characterized
by linear propagation of alternating bands of high and low hydrophobicity. Amyloid
structures listed in PDB exhibit strong ordering of peptides due to their sequential
identity. The corresponding repetitive pattern of hydrophobicity results in propaga-
tion which can continue indefinitely in the absence of “caps”. A consequence of this
geometric recurrence is that identical residues are located in close proximity to, and
interact with one another. This behavior is not favored by electrostatic forces, but it
is consistent with hydrophobic effects, which appear to dominate.
In light of the presented argument we can theorize that amyloidosis results from
improper interactions with the aqueous environment. Many chemical factors [16]
have been implicated in amyloidogenesis; however, one important factor—namely,
shaking—is not chemical in nature. Under natural conditions, the structuralization
of water guides the folding process, producing a structure consistent with the fuzzy
oil drop model, with hydrophobic residues internalized and hydrophilic residues
exposed on the surface. If water adopts a different structuralization (for example
during shaking), this may affect the folding process. The conformation of amy-
loids is dominated by local effects rather than by the tendency to generate a shared
hydrophobic core—this “selfish” process depends on the intrinsic hydrophobicity of
each residue.
The molecules analyzed in this work have been selected in such a way as to
describe the role of the aqueous environment, and its influence upon various confor-
mational and functional properties of proteins. In this sense, the amyloid represents
an extreme case of discordance, where the environment fails to exert its normal effect
upon the polypeptide chain. From the point of view of information theory, a protein
folded in accordance with the fuzzy oil drop model carries very little information.
The emergence of a catalytic active site requires additional information, which man-
ifests itself as a localized deviation from the Gaussian distribution. Thus, we can
determine where—and how—proteins encode information.
2MVX provides an example of an amyloid which is wholly discordant versus the
theoretical distribution of hydrophobicity. This amyloid exhibits linear propagation
of alternating bands of high and low hydrophobicity. Its solubility is due to relatively
strong exposure of hydrophilic residues, while complexation of individual fibrils
appears to result from favorable contact between interface residues.
Another property which distinguishes amyloids from antifreeze proteins is the
lack of “stop” fragments (or “caps”)—such fragments are present in some antifreeze
proteins (as well as in lyases) and their purpose is to prevent unrestricted propagation
of solenoids.
The presented hypothesis concerning the amyloidogenesis mechanism is based
on active influence of the aqueous environment upon protein folding. As long as the
environment retains its usual structural characteristics (which are yet unknown), the
polypeptide chain will tend to produce a monocentric hydrophobic core. If, how-
Fuzzy Oil Drop Model Application—From Globular … 655

ever, the structure of water changes, the protein may adopt a conformation which is
dominated by intrinsic hydrophobicity, potentially favoring amyloid aggregation. As
already noted, the natural structuralization of water is unknown—however it should
be noted that from among the multitude of chemical factors which promote amy-
loidogenesis [16] none involve actual chemical reactions. Furthermore, shaking is a
known causative factor of amyloid transformation. While not chemical in nature, this
process may alter the structural properties of water in a way which allows amyloid
fibrils to form.
The search for adequate model representing protein-water relation has its long
history. The basic model for fuzzy oil drop model the oil drop model introduced
by Kautzmann [17, 18]. The role of hydrophobic interaction was the central point
of research particularly in respect to folding, unfolding and refolding phenomena
[19–27]. The influence of water environment was widely discussed [28, 29]. The
structure analysis in respect to its packing treated as final effect of folding in water
environment introduced new aspects of folding process [30–32]. The general models
for protein folding implemented the aspects based on hydrophobicity particularly
in context of hydrophobicity exposed on the surface of proteins [33–48]. The water
environment was discussed to treated as important partner in folding process [49–53].
Many fundamental papers take part in the history of protein-water relation
[54–60].
The fuzzy oil drop model described in this chapter makes the quantitative assess-
ment of the status of balance between internal force field (inter-atomic interaction in
protein molecule) and external force field characteristics of which appears to have
critical influence of the final form of polypeptide chain [61].
This paper does not discuss any disease-related problems including medical treat-
ment techniques. The best review on the basic molecular level as well as medical
aspects and therapy is given in [62, 63] especially due to the historical context of the
research oriented on mechanism of amyloidosis. The self-assembly and misfolding
processes are critical for cellular activity leading to cellular devastation. The list of
neurodegenerative diseases is even longer after discovery of defective amyloid pro-
cessing in preeclampsia [64]. In this context the search for effective therapy is of high
importance. The example of the proposal focused on inhibition of the fibrillation pro-
cess is given in [65], where the short polypeptide FVFLM is recognised to inhibit the
fibril elongation of KLVFF. However the “stop” mechanism preventing the unlim-
ited elongation of amyloid-like structures identified in biologically active proteins
[15] suggests rather the polypeptides of high preference for helical structural forms
[66]. Helix—especially amphipatic one—is able to aggregate to hydrophobic part of
amyloid with the opposite hydrophilic site exposed toward water environment. This
condition allows water penetration excluding the continuation of fibrilation process
[67].

Acknowledgements The work was financially supported by Jagiellonian University—Medical


College grants system—grant #006363.
Authors are very thankful to Piotr Nowakowski for translation and to Anna Smietanska for technical
support.
656 M. Banach et al.

References

1. Roterman, I., Konieczny, L., Banach, M., Marchewka, D., Kalinowska, B., Baster, Z., Tomanek,
M., Piwowar, M.: Simulation of protein folding process. In: Liwo A. (ed) Computational
Methods To Study the Structure And Dynamics of Biomolecules and Biomolecular Processes,
pp. 599–638. Springer (2014)
2. Ko, T.P., Robinson, H., Gao, Y.G., Cheng, C.H., DeVries, A.L., Wang, A.H.: The refined crystal
structure of an eel pout type III antifreeze protein RD1 at 0.62-A resolution reveals structural
microheterogeneity of protein and solvation. Biophys. J. 84, 1228–1237 (2003)
3. Mol, C.D., Kuo, C.F., Thayer, M.M., Cunningham, R.P., Tainer, J.A.: Structure and function
of the multifunctional DNA-repair enzyme exonuclease III. Nature 374, 381–386 (1995)
4. Hall, D.R., Leonard, G.A., Reed, C.D., Watt, C.I., Berry, A., Hunter, W.N.: The crystal structure
of Escherichia coli class II fructose-1, 6-bisphosphate aldolase in complex with phosphogly-
colohydroxamate reveals details of mechanism and specificity. J. Mol. Biol. 287, 383–394
(1999)
5. Li, C., Guo, X., Jia, Z., Xia, B., Jin, C.: Solution structure of an antifreeze protein CfAFP-501
from Choristoneura fumiferana. J. Biomol. NMR. 32(3), 251–6 (2005)
6. Schütz, A.K., Vagt, T., Huber, M., Ovchinnikova, O.Y., Cadalbert, R., Wall, J., Güntert, P.,
Böckmann, A., Glockshuber, R., Meier, B.H.: Atomic-resolution three-dimensional structure
of amyloid β fibrils bearing the Osaka mutation. Angew. Chem. Int. Ed. Engl. 54, 331–335
(2015)
7. Kalinowska, B., Banach, M., Konieczny, L., Roterman, I.: Application of divergence entropy to
characterize the structure of the hydrophobic core in DNA interacting proteins. Entropy 17(3),
1477–1507 (2015). https://doi.org/10.3390/e17031477
8. Banach, M., Kalinowska, B., Konieczny, L., Roterman, I.: Role of disulfide bonds in stabilizing
the conformation of selected enzymes—an approach based on divergence entropy applied to
the structure of hydrophobic core in proteins. Entropy 18(3), 67 (2016). https://doi.org/10.
3390/e18030067
9. Schutzius, T.M., Jung, S., Maitra, T., Graeber, G., Köhme, M., Poulikakos, D.: Spontaneous
droplet trampolining on rigid superhydrophobic surfaces. Nature 527(7576), 82–85 (2015).
https://doi.org/10.1038/nature15738
10. Modig, K., Qvist, J., Marshall, C.B., Davies, P.L., Halle, B.: High water mobility on the
ice-binding surface of a hyperactive antifreeze protein. Phys. Chem. Chem. Phys. 12(35),
10189–10197 (2010). https://doi.org/10.1039/c002970j. Epub 2010 Jul 29
11. Miskowiec, A., Buck, Z.N., Hansen, F.Y., Kaiser, H., Taub, H., Tyagi, M., Diallo, S.O., Mamon-
tov, E., Herwig, K.W.: On the structure and dynamics of water associated with single-supported
zwitterionic and anionic membranes. J. Chem. Phys. 146(12), 125102 (2017). https://doi.org/
10.1063/1.4978677
12. Banach, M., Konieczny, L., Roterman, I.: The fuzzy oil drop model, based on hydrophobicity
density distribution, generalizes the influence of water environment on protein structure and
function. J. Theor. Biol. 359, 6–17 (2014)
13. Roterman, I., Banach, M., Konieczny, L.: Application of the fuzzy oil drop model describes
amyloid as a ribbonlike micelle. Entropy 19(4), 167 (2017). https://doi.org/10.3390/e19040167
14. Roterman, I., Banach, M., Kalinowska, B., Konieczny, L.: Influence of the aqueous environment
on protein structure—a plausible hypothesis concerning the mechanism of amyloidogenesis.
Entropy 18(10), 351 (2016)
15. Banach, M., Konieczny, L., Roterman, I.: Why do antifreeze proteins require a solenoid?
Biochimie 144, 74–84 (2018)
16. Serpell, L.C.: Alzheimer’s amyloid fibrils: structure and assembly. Biochim. Biophys. Acta
1502, 16–30 (2000)
17. Kuntz Jr., I.D., Kauzmann, W.: Hydration of proteins and polypeptides. Adv. Protein Chem.
28, 239–345 (1974)
18. Kauzmann, W.: Some factors in the interpretation of protein denaturation. Adv. Protein Chem.
14, 1–63 (1959)
Fuzzy Oil Drop Model Application—From Globular … 657

19. Tanford, C.: How protein chemists learned about the hydrophobic factor. Protein Sci. 6(6),
1358–1366 (1997)
20. Tanford, C., Pain, R.H., Otchin, N.S.: Equilibrium and kinetics of the unfolding of lysozyme
(muramidase) by guanidine hydrochloride. J. Mol. Biol. 15(2), 489–504 (1966)
21. Kirshner, A.G., Tanford, C.: The dissociation of hemoglobin by inorganic salts. Biochemistry
3, 291–296 (1964)
22. Tanford, C.: Extension of the theory of linked functions to incorporate the effects of protein
hydration. J. Mol. Biol. 39(3), 539–544 (1969)
23. Tanford, C.: Protein denaturation. Adv. Protein Chem. 23, 121–282 (1968)
24. Tanford, C.: Formation of the native structure of proteins: inferences from the kinetics of
denaturation and renaturation. Ciba Found. Symp. 7, 125–146 (1972)
25. Nozaki, Y., Tanford, C.: The solubility of amino acids and two glycine peptides in aqueous
ethanol and dioxane solutions. Establishment of a hydrophobicity scale. J. Biol. Chem. 246(7),
2211–2217 (1971)
26. Tanford, C., Nozaki, Y., Reynolds, J.A., Makino, S.: Molecular characterization of proteins in
detergent solutions. Biochemistry 13(11), 2369–2376 (1974)
27. Tanford, C.: Protein-lipid interactions. Neurosci Res. Program Bull. 11(3), 193–195 (1973)
28. Baldwin, R.L., Rose, G.D.: How the hydrophobic factor drives protein folding. Proc Natl Acad
Sci U S A. 113(44), 12462–12466 (2016)
29. Baldwin, R.L.: Dynamic hydration shell restores Kauzmann’s 1959 explanation of how the
hydrophobic factor drives protein folding. Proc. Natl. Acad. Sci. U S A 111(36), 13052–13056
(2014)
30. Richardson, J.S., Richardson, D.C., Tweedy, N.B., Gernert, K.M., Quinn, T.P., Hecht, M.H.,
Erickson, B.W., Yan, Y., McClain, R.D., Donlan, M.E., et al.: Looking at proteins: represen-
tations, folding, packing, and design. Biophysical society national lecture, 1992. Biophys. J.
63(5), 1185–1209 (1992)
31. Richardson, J.S.: Introduction: protein motifs. FASEB J. 8(15), 1237–1239 (1994)
32. Richardson, J.S.: The protein surface is a moving target. Structure 12(6), 912–913 (2004)
33. Chothia, C.: Hydrophobic bonding and accessible surface area in proteins. Nature 248(446),
338–339 (1974)
34. Chothia, C.: Principles that determine the structure of proteins. Annu. Rev. Biochem. 53,
537–572 (1984)
35. Chothia, C., Janin, J.: Orthogonal packing of beta-pleated sheets in proteins. Biochemistry
21(17), 3955–3965 (1982)
36. Lesk, A.M., Chothia, C.: Solvent accessibility, protein surfaces, and protein folding. Biophys.
J. 32(1), 35–47 (1980)
37. Chothia, C.: The nature of the accessible and buried surfaces in proteins. J. Mol. Biol. 105(1),
1–12 (1976)
38. Janin, J., Miller, S., Chothia, C.: Surface, subunit interfaces and interior of oligomeric proteins.
J. Mol. Biol. 204(1), 155–164 (1988)
39. Miller, S., Janin, J., Lesk, A.M., Chothia, C.: Interior and surface of monomeric proteins. J.
Mol. Biol. 196(3), 641–656 (1987)
40. Miller, S., Lesk, A.M., Janin, J., Chothia, C.: The accessible surface area and stability of
oligomeric proteins. Nature 328(6133), 834–836 (1987)
41. Creighton, T.E., Chothia, C.: Protein structure. Selecting Buried Residues. Nat. 339(6219),
14–15 (1989)
42. Gerstein, M., Chothia, C.: Packing at the protein-water interface. Proc. Natl. Acad. Sci. U S A
93(19), 10167–10172 (1996)
43. Gong, H., Porter, L.L., Rose, G.D.: Counting peptide-water hydrogen bonds in unfolded pro-
teins. Protein Sci. 20(2), 417–427 (2011)
44. Gong, H., Rose, G.D.: Assessing the solvent-dependent surface area of unfolded proteins using
an ensemble model. Proc. Natl. Acad. Sci. U S A 105(9), 3321–3326 (2008)
45. Fitzkee, N.C., Rose, G.D.: Sterics and solvation winnow accessible conformational space for
unfolded proteins. J. Mol. Biol. 353(4), 873–887 (2005)
658 M. Banach et al.

46. Creamer, T.P., Srinivasan, R., Rose, G.D.: Modeling unfolded states of proteins and peptides.
II. Backbone Solvent Accessibility. Biochem. 36(10), 2832–2835 (1997)
47. Rose, G.D., Wolfenden, R.: Hydrogen bonding, hydrophobicity, packing, and protein folding.
Annu. Rev. Biophys. Biomol. Struct. 22, 381–415 (1993)
48. Rose, G.D., Geselowitz, A.R., Lesser, G.J., Lee, R.H., Zehfus, M.H.: Hydrophobicity of amino
acid residues in globular proteins. Science 229(4716), 834–838 (1985)
49. Dill, K.A., Truskett, T.M., Vlachy, V., Hribar-Lee, B.: Modeling water, the hydrophobic effect,
and ion solvation. Annu. Rev. Biophys. Biomol. Struct. 34, 173–199 (2005)
50. Southall, N.T., Dill, K.A.: Potential of mean force between two hydrophobic solutes in water.
Biophys. Chem. 101–102, 295–307 (2002)
51. Chan, H.S., Dill, K.A.: Solvation: how to obtain microscopic energies from partitioning and
solvation experiments. Annu. Rev. Biophys. Biomol. Struct. 26, 425–459 (1997)
52. Alonso, D.O., Dill, K.A.: Solvent denaturation and stabilization of globular proteins. Biochem-
istry 30(24), 5974–5985 (1991)
53. Dill, K.A., Shortle, D.: Denatured states of proteins. Annu. Rev. Biochem. 60, 795–825 (1991)
54. Chan, H.S., Dill, K.A.: Origins of structure in globular proteins. Proc. Natl. Acad. Sci. U S A
87(16), 6388–6392 (1990)
55. Mobley, D.L., Bayly, C.I., Cooper, M.D., Shirts, M.R., Dill, K.A.: Correction to small molecule
hydration free energies in explicit solvent: an extensive test of fixed-charge atomistic simula-
tions. J. Chem. Theory Comput. 11(3), 1347 (2015)
56. Drechsel, N.J., Fennell, C.J., Dill, K.A., Villà-Freixa, J.: TRIFORCE: tessellated semianalytical
solvent exposed surface areas and derivatives. J. Chem. Theory Comput. 10(9), 4121–4132
(2014)
57. Cohen, P., Dill, K.A., Jaswal, S.S.: Modeling the solvation of nonpolar amino acids in guani-
dinium chloride solutions. J Phys Chem B. 118(36), 10618–10623 (2014)
58. Rocklin, G.J., Mobley, D.L., Dill, K.A., Hünenberger, P.H.: Calculating the binding free ener-
gies of charged species based on explicit-solvent simulations employing lattice-sum methods:
an accurate correction scheme for electrostatic finite-size effects. J. Chem. Phys. 139(18),
184103 (2013)
59. Lukšič, M., Urbic, T., Hribar-Lee, B., Dill, K.A.: Simple model of hydrophobic hydration. J.
Phys. Chem. B. 116(21), 6177–6186 (2012)
60. Fennell, C.J., Dill, K.A.: Physical modeling of aqueous solvation. J. Stat. Phys. 145(2), 209–226
(2011)
61. Schmit, J.D., Ghosh, K., Dill, K.: What drives amyloid molecules to assemble into oligomers
and fibrils? Biophys. J. 100(2), 450–458 (2011)
62. Chiti, F., Dobson, C.M.: Protein misfolding, functional amyloid, and human disease. Annu.
Rev. Biochem. 75, 333–366 (2006)
63. Chiti, F., Dobson, C.M.: Protein misfolding, amyloid formation, and human disease: a summary
of progress over the last decade. Annu. Rev. Biochem. 86, 27–68 (2017)
64. Buhimschi, I.A., Nayeri, U.A., Zhao, G., Shook, L.L., Pensalfini, A., Funai, E.F., Bernstein,
I.M., Glabe, C.G., Buhimschi, C.S.: Protein misfolding, congophilia, oligomerization, and
defective amyloid processing in preeclampsia. Sci. Transl. Med. 6(245), 245ra92 (2014)
65. Kouza, M., Banerji, A., Kolinski, A., Buhimschi, I.A., Kloczkowski, A.: Oligomerization of
FVFLM peptides and their ability to inhibit beta amyloid peptides aggregation: consideration
as a possible model. Phys. Chem. Chem. Phys. 19(4), 2990–2999 (2017)
66. Roterman, I., Banach, M., Konieczny, L.: Propagation of fibrillar structural forms in proteins
stopped by naturally occurring short polypeptide chain fragments. Pharmaceuticals 10(4), 89
(2017)
67. Roterman, I., Banach, M., Konieczny, L.: Towards the design of anti-amyloid short peptide
helices. Bioinformation 14(1), 1–7 (2018)
13 CChemical Shifts in Proteins: A Rich
Source of Encoded Structural
Information

Jorge A. Vila and Yelena A. Arnautova

Abstract Despite the formidable progress in Nuclear Magnetic Resonance (NMR)


spectroscopy, quality assessment of NMR-derived structures remains as an important
problem. Thus, validation of protein structures is essential for the spectroscopists,
since it could enable them to detect structural flaws and potentially guide their efforts
in further refinement. Moreover, availability of accurate and efficient validation tools
would help molecular biologists and computational chemists to evaluate quality of
available experimental structures and to select a protein model which is the most
suitable for a given scientific problem. The 13 Cα nuclei are ubiquitous in proteins,
moreover, their shieldings are easily obtainable from NMR experiments and represent
a rich source of encoded structural information that makes 13 Cα chemical shifts an
attractive candidate for use in computational methods aimed at determination and
validation of protein structures. In this chapter, the basis of a novel methodology of
computing, at the quantum chemical level of theory, the 13 Cα shielding for the amino
acid residues in proteins is described. We also identify and examine the main factors
affecting the 13 Cα -shielding computation. Finally, we illustrate how the information
encoded in the 13 C chemical shifts can be used for a number of applications, viz.,
from protein structure prediction of both α-helical and β-sheet conformations, to
determination of the fraction of the tautomeric forms of the imidazole ring of histidine
in proteins as a function of pH or to accurate detection of structural flaws, at a residue-
level, in NMR-determined protein models.

J. A. Vila (B)
IMASL-CONICET, Universidad Nacional de San Luis, Ejército de Los Andes,
950-5700 San Luis, Argentina
e-mail: jv84@cornell.edu
J. A. Vila
Baker Laboratory of Chemistry and Chemical Biology, Cornell University,
Ithaca, NY 14853-1301, USA
Y. A. Arnautova
Molsoft L.L.C, 11199 Sorrento Valley Road, S209, San Diego,
CA 92121, USA

© Springer Nature Switzerland AG 2019 659


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_20
660 J. A. Vila and Y. A. Arnautova

1 Introduction

Before a protein structure can be analyzed in light of its biological function it is nec-
essary to validate it, i.e., to have a clear understanding of its reliability in terms of both
the overall structure and of its details at per-residue level. However, an accurate and
fast validation of protein structures constitutes a long-standing problem in Nuclear
Magnetic Resonance (NMR) spectroscopy [1–4]. For this reason, investigators have
proposed a plethora of methods to determine the accuracy and reliability of protein
structures in recent years [5–12]. Despite this progress, there is a growing need for
more sophisticated, physics-based and fast structure-validation methods [1, 2, 6, 7,
11].
The 13 Cα chemical shifts provide important information about conformations of
peptides and proteins in solution [13–39] and, therefore, can be used as an exquisitely
sensitive probe with which to assess the quality of protein models. We developed
recently a new, physics-based methodology [34], that makes use of observed and
computed {at the Density-functional theory (DFT) level of theory [40]} 13 Cα chemi-
cal shifts for an accurate validation of protein structures in solution and in crystal [41].
The first step in the development of this new methodology involved determining the
factors that affect 13 Cα shielding calculations, such as the protonation/deprotonation
state of distant ionizable groups, sequential nearest-neighbor or covalent geometry
effects (i.e., due to variations in the bond lengths and bond angles of residues) and
the sensitivity of the shielding/deshielding of 13 Cα nuclei to changes in side-chain
conformation. Once all these factors affecting 13 Cα -shielding have been properly
identified and considered, a very important test is to determine the accuracy and
speed of the computation of the 13 Cα -shielding as a function of the size of the basis
set chosen and the Density Functional Theory (DFT) model adopted. These are
important tests because DFT-based quantum mechanical (QM) calculations are very
CPU demanding, despite the ever-increasing computational power available.
The new DFT-based method has been applied to study a number of problems,
such as unblocked statistical-coil tetrapeptides in aqueous solution [32], polyproline
II helix conformation in a proline-rich environment [31], the 13 Cα and 13 Cβ chemical
shifts of cysteines in disulfide-bonded cysteine [42] or determination of the fraction
of the tautomeric forms of histidine in proteins as a function of pH [43]. This new
strategy also provides a unified, self-consistent method to determine high-quality
protein structures, without relying on knowledge-based information [44]. Thus, a
β-sheet or an all α-helical protein structure can be accurately determined by simply
identifying a set of conformations which simultaneously satisfy a number of con-
straints, namely 13 Cα -dynamically-derived torsional angle constraints and Nuclear
Overhauser Effect (NOE) derived distance constraints [29, 44].
The currently used 13 Cα chemical shift-based validation and determination proto-
col [29, 33, 44, 45, 34] exploits the following features: (a) the assignment of chemical
shifts is a fundamental step in a protein structure determination by NMR spectroscopy
[46], and no extra experimental work is needed; (b) in addition to the impact of the
covalent structure, 13 Cα chemical shifts are modulated mainly by the intraresidue
13 C Chemical Shifts in Proteins: A Rich … 661

backbone and side-chain dihedral angles [16, 17, 19, 20–22, 27, 47, 35, 39], with
no significant influence of the amino acid sequence [48]; (c) 13 Cα is ubiquitous in
proteins; and, (d) 13 Cα chemical shifts can be computed with high accuracy at the
QM level of theory.
This chapter is intended to be an overview of the author’s contribution to the field of
protein structure determination and validation using, mainly, information decoded
from the 13 Cα chemical shifts. Consequently, the chapter is organized as follows:
first, the method used to compute the 13 Cα chemical shifts and to analyze the results
are briefly described; second, the main factors affecting the 13 Cα chemical shifts
computation are enumerated and discussed; third, the capabilities of the computed
13 α
C chemical shifts, as a rich source of encoded structural information, are illustrated
by a series of applications that involves, but is not limited to, the determination of
protein structures; and finally a new protein-structure validation server, CheShift-2
[49], with which NMR spectroscopists can assess the quality of their protein models,
before they are deposited in the Protein Data Bank (PDB) [50], is presented. It is worth
noting that the theory, and details, behind alternative protein structure determination
and validation methods are not discussed here and, hence, the reader is referred
instead to an extensive collection of such methods [1, 5–12, 26, 51–61].

2 Methods

2.1 Calculation of 13 Cα Chemical Shifts

All the experimentally determined conformations, unless noted otherwise, were reg-
ularized, i.e., all residues were replaced by the standard Empirical Conformational
Energy Program for Peptides (ECEPP) [62] residues in which bond lengths and bond
angles are fixed (rigid-body geometry approximation) at the standard values [62] and
hydrogen atoms were added, if necessary.
Computations of the 13 Cα chemical shifts involve a series of approximations. For
each amino acid residue X in the protein sequence: (a) the 13 Cα shielding depends,
mainly, on its own backbone conformations [21, 27] and side-chain [19, 20, 35],
with no significant influence of either the amino acid sequence or the position of
the given residue in the sequence, except for residues preceding proline [48]; (b)
each amino acid residue X in the protein sequence can be treated as a terminally-
blocked tripeptide with the sequence Ac-GXG-NMe, with X in the conformation
of the protein structure; (c) the 13 Cα isotropic shielding values (σ) for each amino
acid residue X can be computed at the OB98/6-311 + G(2d,p) level of theory [28]
with the Gaussian 03 package [63]. The remaining residues in each tripeptide are
treated at the OB98/3-21G level of theory, i.e., by using the locally-dense basis set
approach [64]; (d) all ionizable residues can be considered neutral during the QM
calculations [45], unless noted otherwise; (e) no geometry optimization is necessary
662 J. A. Vila and Y. A. Arnautova

because such optimization by ab initio (HF) or DFT methods has only a small effect
on the computed chemical shifts [19].
The computed 13 Cα shieldings (σsubst, th ) are converted to 13 Cα chemical shifts (δ)
by employing the equation δth  σref – σsubst, th where the indices denote a theoretical
(th) computation, the reference substance (ref ), and the substance of interest (subst),
i.e., the 13 Cα shielding of a given amino acid residue X. The observed shielding value
of tetramethylsilane (TMS) in the gas phase [65], namely 188.1 ppm, was adopted
as an initial (see below) reference value. All the computed 13 Cα shielding (σsubst, th )
values are calculated using the Gauge-Invariant Atomic Orbital method at the DFT
level of theory as implemented in the GAUSSIAN 03/09 suite of programs (Frisch
et al., 2003). For all purposes, in this chapter, we have used only one exchange-
correlation functional, OB98, because it was shown [30] to be one of the most
accurate and fast functionals with which to reproduce the observed 13 Cα chemical
shifts of proteins in solution (see Sect. 3.2).

2.2 Determination of an Effective TMS Shielding Value

Determination of a proper TMS shielding value for each functional is crucial for
an accurate computation of the 13 Cα chemical shifts because it will enable us to
minimize the presence of systematic errors which might bias the chemical shifts-
based analysis. From this point of view the effective TMS value will provide the
most accurate approach to solve the problem because it will not require further
adjustments. Consequently computation of an effective TMS values is central to our
calculations.
By adopting the observed TMS value of 188.1 ppm (Jameson and Jameson, 1987)
as a reference it is possible to find for any functional, the characteristic mean (x o ) and
standard deviation (σ) of the Normal (or Gaussian) fit of the frequency of the errors
distribution. For all functionals tested in our work the characteristic mean value (x o )
appears displaced from its ideal value of 0.0 by a positive, or negative, amount, e.g.,
for OB98 a x o  + 3.6 ppm was found. Further analysis [30] indicates that for any
of the 10 functionals tested a straightforward use of the observed TMS shielding
value (188.1 ppm) is not appropriate, if no further corrections are introduced. Hence,
for each functional and basis set chosen it is feasible to find an ‘effective’ TMS
shielding value for which the Normal (or Gaussian) fit shows a zero displacement,
i.e., an effective TMS value that gives a x o  0.0. For example, use of OB98 with a
large [6-311 + G(2d,p)/3-21G] basis set leads to an effective TMS of 184.5 ppm, i.e.,
by subtracting 3.6 ppm from 188.1 ppm [30], that gives a x o  0.0 ppm. Likewise,
use of a small (6-31G/3-21G) basis set leads to an effective TMS of 195.4 ppm.
13 C Chemical Shifts in Proteins: A Rich … 663

2.3 Computation of the Ca-RMSD Model

The observed chemical shift for each residue i, 13 Cαobserved, i , represents contributions
from an ensemble of rapidly interconverting conformers that coexist in solution.
Then, an accurate comparison between the observed and computed 13 Cα chemical
shifts requires consideration of an ensemble of NMR-derived conformers, rather
than of a single conformation [41, 33]. Consequently, for each amino acid residue
in the sequence, i, the average of the chemical shifts calculated for the individual
residues in the ensemble of  conformers representing the NMR structure, < 13 Cα
> i , is computed as:
Ω

< 13 Cα >i  (1/Ω) 13
Ci,α k , (1)
k1

where 13 Cαi, k is the computed chemical shift for residue i in conformer k, with 1 ≤ i
≤ N, where N is the number of residues in the sequence. Derivation of Eq. (1) was
obtained through the following approximation:for each residue i the quantity to be
computed must, in principle, be <13 Cα >i  Ω 13 α
k1 λk Ci,k , where λk is the Boltz-

mann factor for conformer k, with k1 λk ≡ 1. But, computation of the Boltzmann
factors at QM level of theory is not possible, with the existing computational facilities,
because it would require computation of the total energy at the QM level of theory
for each of the conformers in the ensemble used to represent the NMR structure.
Therefore, the following approximation was used: λk  1/ [48]; in other words,
in this approximation each conformer contributes equally to the average chemical
shift obtained by fast conformational averaging. Whether a computation of a Boltz-
mann average, rather than the arithmetic average, would lead to a more accurate
representation of the 13 Cα chemical shifts needs further investigation.
The < 13 Cα > i value obtained from Eq. (1) is used to compute the conformational-
average difference i between the observed and computed 13 Cα chemical shifts for
each amino acid residue i,
13 α 13 α

i  Cobser ved,i − < C >i (2)

Hereafter, the conformational-average root-mean-square-deviation (rmsd) param-


eter, ca-rmsd [48], is obtained as:


N
ca − r msd  [(1/N ) 2i ]1/2 , (3)
i1

which is a global property of the protein NMR structure given as the weighted aver-
age of the differences between the experimental 13 Cα chemical shifts and the < 13 Cα
> i —values for all the residues in the protein.
664 J. A. Vila and Y. A. Arnautova

2.4 13 Cα -Based Protein Structure Determination Method

The 13 Cα -based procedure used for determination of protein structures consists of


three steps. The flow chart of this protocol [44] is shown in Fig. 1 and a brief
description of each step follows.
Step 1: The Variable-Target-Function (VTF) approach with a simplified soft-
sphere potential function [66] is used to generate an ensemble of conformations at ran-
dom that simultaneously satisfy a set of long-range distance constraints derived from
the experimental NOEs and (ϕ, ψ) torsional constraints, derived from the observed

Fig. 1 Flow-chart of the 13 Cα -based protein structure determination protocol described in the
Methods section. Figure adapted from Vila et al. [44]. Copyright 2007 American Chemical Society
13 C Chemical Shifts in Proteins: A Rich … 665

13
Cα and 13 Cβ conformational shifts [27]. The derived torsional constraints are only
for those amino acids residues in the sequence that pertain to a regular structure, i.e.,
to a α-helix or β-sheet. Consequently, these (ϕ,ψ)α,β torsional constraints (shown
in Fig. 1) are limited to, on average, ~50% of the amino acids residues in proteins
because the remaining ones populate non-regular structures.
Then, a clustering procedure, e.g., the Minimal Spanning Tree method [67], is
used to select a small sub-set of the total number of the VTF-derived conformations,
namely those possessing a maximum NOE-derived distance violation lower than
some arbitrary fixed value. For each of these conformations the 13 Cα chemical shifts
are computed as described in Sect. 2.1. Examination of the chemical shifts of all the
amino acids in the ensemble of conformations enables us to identify the amino acid
at each position in the sequence whose computed chemical shifts most closely match
the observed ones, among all these conformations. This identified set of individual
amino acid conformations corresponds to only one conformation of the whole chain:
the ‘theoretical minimal-rmsd model’ [33]. In this model, the 13 Cα chemical shift
of each residue individually best matched the experimental one, thereby providing
a new set of φ, ψ, and χ torsional angle constraints for all amino acid residues in
the sequence, i.e., not just for the amino acid residues in regular structures. Because
the chemical shifts are a multivalued function of the φ, ψ, and χ torsional angles,
the set of torsional angles derived from the ‘theoretical minimal-rmsd model’ does
not, necessarily, represent a unique solution to a given set of observed 13 Cα chemical
shifts values.
Step 2: Only one conformation among all the conformations produced in Step
1 is selected, for example, the conformation possessing the lowest rmsd between
the computed and observed 13 Cα chemical shifts. The selected conformation is used
as a starting one in a new conformational search with the Monte Carlo with Mini-
mization (MCM) method [68, 69]. The MCM search is carried out with two types of
constraints: the original set of NOE-derived distance constraints and the new set of φ,
ψ, χ torsional angles derived in Step 1. This time the conformational search is carried
out using a complete force-field including the internal potential energy described by
ECEPP/05 [70], the solvent free energy calculated by using a solvent-accessible sur-
face area model [71], and an additional energy terms aimed at penalizing violations
of the distance and torsional angle constraints [72]. Convergence of the determina-
tion protocol is monitored using the ca-rmsd between the computed and observed
13 α
C chemical shifts.
Step 3: If the computed ca-rmsd is lower than certain, arbitrary chosen, cutoff
value (ξ), then the procedure is ended. Otherwise, the Step 2 is repeated using a new
set of (φ,ψ,χ) derived from the minimal-rmsd-model of the previous step.
It is worth noting that after our physics-based protocol was published [44] an alter-
native knowledge-based method that makes use of 1 H, 13 Cα , 13 Cβ and 15 N chemical
shifts as restraints, was successfully applied to structure determination of several
proteins [53]. A blind test of computational methods, included several that use also
chemical shifts as restraints, aimed at fully automated determination of protein struc-
tures has been carried out recently [60].
666 J. A. Vila and Y. A. Arnautova

2.5 Computation of the 13 Cα Chemical Shifts as Function


of the PH

For a given residue i, of a protein in a conformation k, the average charge distribu-


tion, <ρi,k > , could be determined by solving the Poisson equation by considering the
2ξ ionization states, with ξ being the number of ionizable groups in the molecule.
Regarding this problem, it is worth noting that ξ could be a large number because
~30% of all residues in a protein sequence are, on average, ionizable and, hence,
an accurate solution would require a fast algorithm. Consequently, in all the appli-
cations mentioned in this chapter, we used the Multiple Boundary Element (MBE)
method [73, 74], in which the free energy associated with the state of ionization of
the ionizable groups at a fixed pH value, namely 6.5, is calculated with the general
multi-site titration formalism [75, 76]. The charges and atomic radii from the PARSE
(Parameters for Solvation Energy) algorithm [77] were used for the solvation free
energy calculations using the MBE method, and the internal (εint ) and solvent (εsolv )
dielectric constants of 2 and 80, respectively [76] were adopted for the calculations
of <ρi,k > . The value of εint  2 is consistent with the use of PARSE charges [78]
and is also commonly assumed as an adequate representation of the protein interior.
Following these approximations, for a given conformation k, the average degree of
ionization of the ith ionizable group of this conformation is computed as:
ξ

2
−1
< ρi,k > Z ρi,k
n
[−G(Pk , xkn )/k B T ] (4)
n1

where Z is the partition function, k B is the Boltzmann constant, T is the absolute


temperature, xkn  (ρ1,k n
, . . . , ρi,k
n
, . . . , ρ Nn ,k ) with ρi,k
n
 (1 or 0) is the nth protona-
tion microstate of conformation k for protein Pk . G(Pk , xnk ) is the free energy of
ionization of the nth microstate of protein Pk in conformation k [75].
It should be noted that for any ionizable residue i of a single conformation k,
Eq. (4) can lead to a non-integer average degree of charge, although we know that
such non-integer charges do not make physical sense. Due to the Boltzmann nature
of the averaged value computed by Eq. (4), a fractional charge should physically be
interpreted as follows: for a given conformation k, there are many identical replicas
of such a conformation in solution and, hence, a fractional charge computed by
Eq. (4), e.g., 0.75, means that 75% of these replicas possess the ionizable group
i protonated/deprotonated with an integral charge while the remaining 25% of the
replicas possess the same ionizable group as deprotonated/protonated, depending on
whether the ionizable group is basic or acidic.
Assuming that the protonation/deprotonation reactions are instantaneous on the
NMR time scale, i.e., microsecond to millisecond [79], the theoretical 13 Cα chemical
computed
shifts, δi ( p H ), for a given residue i in the sequence (except for histidine
that possess 2 tautomers) are computed as a function of the pH using the following
equation:
13 C Chemical Shifts in Proteins: A Rich … 667



computed
δi ( p H )  (1/) {< ρi,k > δ +,i,k + (1− < ρi,k >)δ 0,i,k } (5)
k1

where δ+,i,k and δ0,i,k are the computed 13 Cα chemical shifts, for the amino acid i in
conformation k, with fully charged and neutral side chains, respectively, Ω is the
number of conformers in the protein ensemble, and < ρi,k > the averaged degree of
charge, as given by Eq. (4).

3 Factors Affecting the Calculation of 13 Cα Chemical Shifts

3.1 Transferability of the Results

The current methodology [33, 34] relies on a crucial observation: once residue con-
formations are established by their interactions with the rest of the protein the 13 Cα
shielding of each residue depends, mainly, on its backbone and side-chain confor-
mations, with no significant influence by the nature of the nearest-neighbor amino
acids, except for residues immediately preceding proline [48].
The above observation allows us to parallelize the 13 Cα shielding calculations in
proteins and, hence, to make them computationally feasible. Moreover, a given set of
accurately-determined amino acid residue conformations representing the accessible
conformational space for all the 20 naturally occurring amino acids and showing a
good distribution of side-chain conformations will constitute a reasonable ensemble
with which to carry out tests of the current methodology. The results of these tests
should be transferable to proteins of any class or size. Consequently, we used struc-
tures of three proteins solved by NMR and X-ray, namely PDB id 1D3Z, 2JVD and
1NS1 to evaluate the performance of different DFT functionals and basis sets, as
explained below.

3.2 Performance of Different DFT Functionals to Reproduce


Observed 13 Cα Chemical Shifts

DFT has become a method of choice for QM calculations of the electronic structure
and properties of many molecular and solid systems. Because the exact exchange-
correlation functional is unknown, a large number of approximations has been pro-
posed in the literature making it essential to pursue more accurate and reliable approx-
imate functional, a process which, on the other hand, depends on the applications.
Selection of the most appropriate density functional model for a particular application
becomes one of the main problems of the DFT method. For this reason we decided
[28] to test several density functional models (namely B3LYP, OLYP, PBE1PBE,
668 J. A. Vila and Y. A. Arnautova

OPBE, O3LYP, OPW91, OB98, BPW91, BPBE and B971). The benchmarking was
intended to find not only the most accurate functional with which to reproduce the
observed 13 Cα chemical shifts in solutions but also the fastest one, in terms of CPU
time, because speed of DFT calculations could severely limit their applicability to
proteins. The test was applied to 10 NMR-derived conformations of the 76-residue
α/β protein ubiquitin (PDB id 1D3Z).
Comparison of the observed and computed 13 Cα chemical shifts shows that there
are five functionals, namely OPW91, OB98, OPBE, OLYP, and O3LYP, which are
among the faster ones and, even more importantly, behave very similarly in their
ability to reproduce accurately the observed 13 Cα chemical shifts. In particular, we
observe that OB98 appears to be slightly better than any other of the five functionals
in terms of both the correlation coefficient, R, (or Pearson coefficient) between the
observed and the conformational-averaged 13 Cα chemical shifts and the standard
deviation of the computed conformational-averaged 13 Cα chemical shifts from a
linear regression. Consequently, we chose the OB98 for all the applications [30].
We also compared the results obtained using OB98 with those obtained with
B3LYP, a very popular functional that has been used extensively in our group, and
elsewhere. The correlation existing between averaged 13 Cα chemical shift values
obtained for the 10 conformations of 1D3Z with OB98 and B3LYP functional, is
excellent [30], i.e., showing a correlation coefficient R  0.998 and standard deviation
of 0.300 ppm. This test provides solid evidence that the results and conclusions
obtained using B3LYP do not need to be revised if the OB98 functional is adopted
[30].

3.3 Performance of Different Basis Sets to Reproduce


Observed 13 Cα Chemical Shifts

To study the dependence of the accuracy and speed of DFT calculations of the 13 Cα
chemical shifts in proteins on the size of the basis set used, six basis sets, viz.,
6-31G/3-21G, 6-31G(d)/3-21G, 6-311G(d, p)/3-21G, 6-311 + G(d, p)/3-21G, and 6-
311 + G(2d,p)/3-21G locally-dense basis-set approximations, and uniform 3-21G/3-
21G set were initially applied [28] to 10 NMR-derived conformations ubiquitin
[54]. For each of these six basis sets, combined with the OB98 functional, the 13 Cα
shielding was computed for 760 amino acid residues by treating each amino acid X
in the sequence as a terminally-blocked tripeptide with the sequence Ac-GXG-NMe
in the conformation of the regularized experimental protein structure. Analysis of
the results [28], in terms of the agreement between the computed and observed 13 Cα
chemical shifts shows that the accuracy with which the observed 13 Cα chemical shifts
are reproduced by using either the small basis set (6-31G/3-21G) or the larger basis
set [6-311 + G(2d,p)/3-21G] is very similar, although, use of the small basis set leads
to a significant decrease in computational time.
13 C Chemical Shifts in Proteins: A Rich … 669

The results also indicates that the 13 Cα chemical shifts computed with the large [6-
311 + G(2d,p)/3-21G] basis set, can be reproduced accurately (within an average error
of ~0.4 ppm) and faster (by ~9 times) by using the small (6-31G/3-21G) basis set after
extrapolating it with: 13 C α  −1.597+1.040×13 Cμα . In effect, the correlation existing
between averaged 13 Cα chemical shift values computed for the 32 conformations
of 1NS1 with these two basis sets, is excellent [28], i.e., showing a correlation
coefficient R  0.999 and standard deviation of 0.284 ppm. Even more important,
an analysis of the magnitude of the errors and their distribution carried out for Val
and Arg hypersurfaces, constructed by calculating a grid of 6864 and 6794 points,
respectively, corresponding to different combinations of the φ, ψ, χ1, and χ2 (only
for Arg) torsional angles, indicates that ~70% of them are within ~0.6 ppm and that
the most populated regions of the Ramachandran map are not affected by errors
higher than ~1.0 ppm [28].
In conclusion, the described analysis enabled us to select the smaller basis set
(6-31G/3-21G) that provides accuracy similar to that of a ‘basis set limit’ [6-311 +
G(2d,p)/3-21G] to reproduce the computed chemical shifts, but at a significantly
lower computational cost [28].

3.4 Effect of Sequential Nearest-Neighbors on the 13 Cα


Chemical Shifts Calculations

The 13 Cα chemical shifts for a residue X in the model peptide Ac-G-X-G-NMe


has always been computed [44, 34] considering that all the torsional angles of the
residue X are exactly those of the residue in the protein conformation and that the
surrounding Gly residues and the end-blocking groups are free to rotate. It is implicit
in this approach that the 13 Cα chemical shifts of residue X do not depend on the
identity of the nearest-neighbor residues. This assumption needs to be proved.
The structure of the Nucleic Acid Binding (NAB) protein of the SARS coronavirus
[80], a 116-residue α/β protein containing 9 Prolines (Pro) and with 50% of its
residues in loops and turns, was chosen to further evaluate the origin of differences
between computed and observed 13 Cα chemical shifts, as well as to study the influence
of the nearest-neighbor residues on the computed 3 Cα chemical shifts.
The results [48] indicate that computation of the 13 Cα chemical shifts of a given
residue in the sequence of the NAB protein is not influenced significantly, i.e., within
~0.5 ppm, by the nature of the nearest-neighbor amino acids, except for residues
immediately preceding proline (see Fig. 2a). For such residues, Pro must be consid-
ered during the computation of the 13 Cα chemical shifts; otherwise, an overestimation
of the computed 13 Cα chemical shifts by about +1.7 ppm occurs. This finding is in
good agreement with both the experimental evidence [36, 81, 82] and the empirical
observations [37, 81]. It is equally important to emphasize the physical nature of this
effect: “…an imide bond formed by an Xxx–Pro pairing is generally thought to be
much less electron-withdrawing than an amide bond…” [37].
670 J. A. Vila and Y. A. Arnautova

Overall, except for the Pro effects, use of the Ac-G-X-G-NMe model peptide for
the computation of the 13 Cα chemical shifts of residue X is a good approximation
because the computed values are accurate within ±0.5 ppm for all residue-types, if
neither the subsequent nor precedent residue-type effects are taken into account (see
Fig. 2).

3.5 Rigid-Geometry Approximation and Accuracy


of the Calculations of 13 Cα Chemical Shifts

Experimental protein structures are often solved using force fields which allow vari-
ation of bond lengths and bond angles. However, it is known that QM calculations
are very sensitive to bond lengths and bond angles [16]. Therefore, we have explored
the dependence of the computed 13 Cα -chemical shifts on the bond lengths and bond
angles to establish whether a rigid- rather than non-rigid geometry approximation is
a more accurate representation with which to compute the chemical shifts.

Fig. 2 Histogram of the average, over all 20 conformers of the protein PDB id 2K87, second-
order differences : a with   < (X − YX ) > arising from the nature of the sequentially
preceding residue-type (Yyy). X and YX are the differences between the observed chemical
shifts and those computed using the Ac-Gly-Xxx-Gly-NMe and Ac-Gly-Yyy-Xxx-Gly-NMe model
peptides, respectively; b with   < (X – XY ) > for the differences arising from the nature of
the subsequent residue-type, i.e., with XY computed with Ac-Gly-Xxx-Yyy-Gly-NMe. Figure
adapted from [48] (with permission of Springer)
13 C Chemical Shifts in Proteins: A Rich … 671

For this test, the structure of ubiquitin deposited in the PDB (PDB id 1UBQ)
was chosen because it possesses non-regularized geometry and has been solved by
X-ray diffraction at 1.8 Å resolution [83]. We have also examined the corresponding
structure with regularized geometry, i.e., the one with all the residues replaced by the
standard ECEPP residue geometry [62], named here as 1UBQregular . Analysis of the
differences between the computed and observed 13 Cα chemical shifts for the 1UBQ
and 1UBQregular structures, leads to rmsd of 3.28 ppm and 2.38 ppm, respectively.
The better agreement obtained with 1UBQregular , rather than 1UBQ, is consistent
with the long-time recognition that the bond lengths and bond angles of both X-
ray and NMR-derived structures are not as highly accurately defined as in studies of
small molecules [16], with which the ECEPP geometry [62] has been parameterized.
Further analysis of the agreement of the two ubiquitin structures with the deposited
electron density data [83] of 1UBQ, in terms of the R-factor, leads to 19.2 and 23.1%
for 1UBQ and 1UBQregular , respectively; while the all-heavy-atom rmsd between
these two structures is 0.142 Å [34].
Overall, the use of regularized geometry, i.e., ECEPP geometry, is an accurate
approximation with which to compute the 13 Cα chemical shifts in proteins and, hence,
is used in most of the application discussed in this chapter.

3.6 13 Cα Chemical Shifts as a Function of the Charge


Distribution

Among the factors that affect 13 Cα -shielding, which are important for an accu-
rate computation of chemical shifts, is the sensitivity of 13 Cα nuclei to the shield-
ing/deshielding induced by changes in the protonation/deprotonation of distant ion-
izable groups [84–87]. However, these factors have not been taken into account
explicitly in current computations of 13 Cα chemical shifts in proteins at the QM level
of theory because, usually, the calculations are carried out in the gas phase, and the
ionizable residues are treated as neutral groups.
The question of whether the use of neutral, rather than charged, side chains is more
accurate for computation of the 13 Cα chemical shifts of ubiquitin, at a given fix pH,
was investigated as follows [45]. For a given ionizable residue i in a conformation k,
first, the average charge distribution, < ρi,k > , was computed by using Eq. (4), i.e., by
explicit consideration of the 2ξ ionization states for every conformation [75], with ξ
being the number of ionizable groups in the molecule, namely 22; and second, the
13 α
C chemical shifts as a function of the pH,δi ( p H ), were computed by using Eq. (5).
This analysis was applied to 139 conformations of ubiquitin: 138 (10 conformations
from PDB id 1D3Z plus 128 conformations from PDB id 1XQQ) NMR-derived
conformations [54, 88], while the remaining one is an X-ray structure (PDB id
1UBQ) solved at 1.8 Å resolution [83].
Additionally, an extra set of 50 randomly generated conformations for each amino
acid residue X, in the terminally-blocked tripeptide with the sequence Ac-GXG-
672 J. A. Vila and Y. A. Arnautova

NMe, with X being Lysine (Lys), Ornithine (Orn), Diaminobutyric acid (Dab),
Glutamic acid (Glu) or Aspartic (Asp) acid, were also obtained. This set of ran-
domly generated conformations was used to determine: (i) the range of shield-
ing/deshielding of the 13 Cα nucleus of free acidic/basic amino acid residues in solu-
tion, in their fully charged and neutral forms, respectively; (ii) how these ranges
of shielding/deshielding variations compare with those derived from 3058 ionizable
groups of the 139 conformations of the protein ubiquitin; and (iii) how the computed
shielding/deshielding range of variations are influenced by the distance between the
charged side-chain group and the 13 Cα nucleus (for example, there are two chemical
bonds in Asp, rather than three in Glu, separating the deprotonated carboxyl group
from the 13 Cα nucleus). To examine an analogous effect for a basic side-chain group,
such as Lys, use was made of the non-natural amino acids Orn and Dab because, for
these amino acids, the protonated amino group is separated from the 13 Cα nucleus
by four and three chemical bonds, rather than by five in Lys.
The results of this study [45], based on the analysis of 139 conformations of
ubiquitin at pH 6.5, indicate that use of neutral, rather than charged, amino acids is
a significantly better approximation of the observed 13 Cα chemical shifts in solution
for the acidic groups, and a slightly better representation, though significantly less
expensive computationally, for the basic groups (see Fig. 3).
Additionally, our analysis of Lys, Orn and Dab revealed a significantly greater
deshielding of the 13 Cα nucleus (due to the deprotonation of the acidic groups)
than the shielding due to the protonation of the basic groups. The origin of such a
difference can be found in the distance between the ionizable groups and the 13 Cα
nucleus, which is shorter for the acidic than for the basic groups.

3.7 13 CαChemical Shifts as a Function of Side-Chain


Flexibility

To what extent are the chemical shifts of the amino acid residues in a protein affected
by the side-chain orientation? The basis for such a query arises from the fact that the
three torsion angles φ, ψ and χ1 are not independent on each other over the whole
range because they involve a common N-Cα bond [89, 90]. To find an answer to this
question, the dependence of the 13 C chemical shifts on side-chain orientation was
investigated [35], at DFT level of theory, for two-strand antiparallel β-sheet model
peptide with the amino acid sequence Ac-A3 -X-A12 -NH2 where X represents any of
the 17 naturally-occurring amino acids considered here, i.e., not including alanine,
glycine and proline. Because the majority of β-sheets are twisted, rather than planar,
with a right-hand twist in the approximately ±30° range for the backbone dihedral
angles [91–94] conformational parameters for β-sheets may deviate from those for
planar pleated sheets and, hence, are difficult to model by using canonical values. The
fact that β-sheets in proteins appear as parallel or antiparallel strands, or a combination
of both, only exacerbates the modeling problem. For this reasons, the dihedral angles
13 C Chemical Shifts in Proteins: A Rich … 673

Fig. 3 Average difference,


, computed over a set of 9
conformations of protein
ubiquitin using Eq. (2) for: a
acidic and b basic groups,
respectively. Grey and white
bars denote charged and
neutral side-chain,
respectively. Figure adapted
from [45] (with permission
of John Wiley and Sons)

adopted for the backbone were taken, and kept fixed, from the experimental structure
of an antiparallel β-sheet, specifically from the 16-residue segment (G41-G56) of the
B3 binding domain of protein G (PDB id 1P7E).
For the 17 naturally occurring amino acids considered the analysis indicates that
there is: (a) good agreement between computed and observed 13 Cα and 13 Cβ chem-
ical shifts, i.e., with correlations coefficient, R, of 0.95 and 0.99, respectively; (b)
significant variability of the computed 13 Cα and 13 Cβ chemical shifts as function of
χ1 for all 17 residues, except for Ser; and (c) a smaller compared to χ1 , although
significant, dependence of the computed 13 Cα chemical shifts of χξ (with ξ ≥ 2) for
11 out of 17 residues.
The above results obtained by Villegas et al. [35] for an antiparallel (16-residue
segment) β-sheet were later validated on a 76 residues α/β protein, i.e., by exploring
the effects of side-chain conformation on the computed 13 Cα chemical shifts [45].
This validation process involved an exhaustive conformational search, starting from
an arbitrary selected conformation of the NMR-determined ubiquitin protein (PDB
id 1D3Z), in which only the torsional angles of the side chains were allowed to vary,
i.e., all backbone dihedral angles (φ, ψ, ω) were fixed at their corresponding observed
674 J. A. Vila and Y. A. Arnautova

values. Furthermore, the correlation coefficient, R, between computed, by using the


Karplus equation [95], and observed vicinal coupling constants 3 J N-Cγ and 3 J C -Cγ
of 17 valine, threonine and Isoleucine residues, was used to check the accuracy of
the side-chain conformational search.
The obtained results on an antiparallel β-sheet segment and the ubiquitin protein
enabled us to determine the role and impact of a proper side-chain conformation for
an accurate computation of the observed 13 Cα chemical shifts in solution.

4 Use of the Structural Information Decoded from 13 C


Chemical Shifts

We have chosen three examples to illustrate how the structural information decoded
from the observed 13 C chemical shifts can be used in practice: (1) to determine the
fraction of the tautomeric forms of the imidazole ring of histidine (His) in proteins as
a function of pH, provided that the observed 13 Cγ and 13 Cδ2 chemical shifts and the
protein structure, or the fraction of H+ form are known; (2) to determine either all
α-helical or all β-sheet protein structures in solution; and (3) to assess the reliability
of NMR-determined protein models before they are published or deposited in the
PDB. Each of these applications is described in the following subsections.

4.1 The Importance of Being His

In 1965 Mandel [96], in a pioneering NMR experiment, detected the imidazole (C2)
protons of histidine (His) residues in Ribonuclease A and in 1966, Bradbury and
Scheraga [97], were able to distinguish between the histidine residues of Ribonu-
clease A, i.e., they resolved the NMR-peaks of three out of four histidines of this
enzyme. Subsequently, use of NMR spectroscopy, X-ray crystallography and theo-
retical studies, based on QM calculations, have continuously evolved in their ability
to determine properties of the histidine residues in solution and in the solid state [43,
79, 98–116]. The reason for this persistent interest in His is due to the fact that this
residue is unique among all 20 naturally occurring amino acids because ~50% of all
enzymes use His in their active sites [117]. This is, mainly, because of the versa-
tility of imidazole His ring, which includes two neutral, chemically-distinct forms,
referred to as Nδ1 -H and Nε2 -H tautomers, and a protonated form, the charged H+
form, with one form favored over the other two by the protein environment and pH. In
addition, His with a pK° of 6.6 [118] is the only ionizable residue that titrates around
neutral pH, allowing the non-protonated nitrogen of its imidazole ring to serve as an
effective ligand for metal binding [79], or to play a crucial role in the proton-transfer
process [103].
13 C Chemical Shifts in Proteins: A Rich … 675

Certainly, determination of the fraction of the tautomeric forms of the imidazole


ring of His in proteins in solution is an important problem for a number of reasons.
At a given fixed pH proteins in solution exist as an ensemble of conformations and,
hence, the form of each His residue among different protein conformers may vary
significantly because the tautomeric equilibrium is determined by the environment
[43]. Moreover, because the exchange between different protonation states is assumed
to occur in the fast exchange regime [79], the NMR resonances of a given nucleus,
which include rotation, protonation and tautomerization, merge into a single average
signal. Decoding the information from these exchange processes offers possibility to
determine the extent to which the His residues in proteins behave as free His, where
the Nε2 -H tautomer is favored over the Nδ1 -H tautomer in a ratio of 4:1 [108].
To find a solution to this long-standing problem in the biophysical chemistry of
proteins, first, each form of His was treated as a terminally-blocked model tripeptide
with the sequence: Ac-GHξ G-NMe, with Hξ in the Nδ1 -H, the Nε2 -H tautomeric
form or the protonated form H+ , respectively. For each of the forms, a set of ~35,000
conformations, representing a uniform sampling of the whole Ramachandran map
as function of φ, ψ, ω, χ1 and χ2 torsional angles, was generated. Afterward, the
gas-phase, isotropic shielding value was computed using the method described in
Sect. 2.1. Finally, the distribution of the computed shielding of the imidazole ring
of His was analyzed in terms of all 13 C nuclei, namely 13 Cγ , 13 Cδ2 , and 13 Cε1 (see
Fig. 4). Specifically, the histogram of the shielding distribution (among all ~35,000
conformations) was fit by a Gaussian function with a mean value σ o (shown as bars
in Fig. 4) and standard deviation sd (data not shown). A visual inspection of the
histogram shown in Fig. 4 revealed that the mean σ o shielding values obtained for
the 13 Cε1 nucleus is not sensitive to changes in the form of the imidazole ring and,
therefore, we confine our interest to those nuclei that are sensitive to such changes,
namely 13 Cδ2 and 13 Cγ .
Use of first-order shielding differences for a pair of selected nuclei, 13 Cδ2 and 13 Cγ ,
rather than chemical shifts, is a very convenient approach because the experimental
referencing problem may be a source of errors [99]. Consequently, we define the
γ
first-order shielding difference, ξ , as ξ  |σoδ2 – σo |ξ , with ξ denoting the form of
δ2 γ
the imidazole ring, and σo and σo are the computed mean values of the shielding
distribution for the 13 Cδ2 and 13 Cγ nuclei, respectively. In other words, the following
convention is adopted: ξ  δ, ε, or +, to designate the Nδ1 -H, Nε2 -H or the H+ form,
respectively.
Analysis of the first-order shielding differences indicates that the following
inequality holds: ε > + > δ, and δ ~0. Therefore, once the fraction of pro-
tonated H+ form, f + = < ρ > , computed with Eq. (4), and obs  |13 Cδ2 – 13 Cγ |, with
13 δ2
C and 13 Cγ being the observed chemical shifts in solution, at a given pH, are
known, the fraction of the Nε2 -H tautomer (f ε) can be obtained assuming: (a) that
all forms are in fast exchange on the NMR chemical shift time-scale [79], i.e., as:
obs = f ε ε + f + + + f δ δ; and (b) that δ ≡ 0.
Using these assumptions, together with some physical constraints, enable us to
find an analytical expression with which to compute f ε, namely as: f ε   (1−ρ)
obs


,
676 J. A. Vila and Y. A. Arnautova

Fig. 4 Bar diagram of the average σo shielding values computed for each carbon of the imidazole
ring of His for each of the two tautomers: Nδ1 -H, Nε2 -H, and for the H+ form. The values were
averaged over ~35,000 conformations of histidine in the model tripeptide Ac-GHG-NMe. Grey,
black and white colors indicate the results obtained for the 13 Cγ , 13 Cδ2 and 13 Cε1 nuclei, respectively.
Figure adapted from Vila et al., 2011 (with permission of PNAS)

with ε the single-valued first-order shielding difference computed for the Nε2 -H
tautomer (ε ~ 31 ppm). The fraction of the f δ tautomer is obtained straightforwardly
as: f δ  1− < ρ > − f ε .
The above formulation was used to determine the tautomeric forms of His for each
of 8 selected proteins for which both the structure and the 13 Cδ2 and 13 Cγ chemical
shifts of the imidazole ring of His, are available. In each of these applications the
average degree of protonation < ρ > for all ionizable residues was computed by using
Eq. (4). The tautomeric forms of His are determined by using the expressions for f δ
and f ε given above [43]. Likewise, using the observed values, obs , obtained from
solid-state NMR for unblocked dipeptides, with the sequence His-Leu, His-Met,
Gly-His, Leu-His, His-Ala, His-Glu, Ala-His and His-Asp [99], we also determined
the tautomeric fractions of the imidazole ring of His for each of these 8 compounds.
Results obtained from the 8 proteins indicate that the protonated form is the
most populated one while the distribution of the tautomeric forms for the imidazole
ring varies significantly among different histidine residues in the same protein (see
Fig. 5a). Thus, His226 and His250 show comparable degree of protonation, < ρ >,
although the tautomeric distribution is very different (see Fig. 5a), i.e., showing the
importance of the environment of the histidines in determining the tautomeric forms.
Let us explain the origin of this observation. On one hand, the Nδ1 nucleus of H250
13 C Chemical Shifts in Proteins: A Rich … 677

is located only 2.9 Å from the carbonyl backbone oxygen of S248 (see Fig. 5b),
presumably forming a hydrogen-bond (green dots in Fig. 5b), while the Nε2 nucleus
is exposed to the solvent but the imidazole ring is surrounded by fully protonated
R264 and R266 (data not shown) and, hence, lowering the probability that a proton
binds to Nε2 , in good agreement with the computed tautomeric distribution for H250
in Fig. 5a. On the other hand, the Nε2 nucleus of the imidazole ring of H226 is at
3.3 Å from a backbone carbonyl oxygen of W246 (see Fig. 5c), while the Nδ1 is at
3.1 Å from a backbone amino group of H226 (see Fig. 5c). As a result, a preference
of Nε2 -H over the Nδ1 -H tautomeric form for H226 is expected, in agreement with
the computed tautomeric fractions for this residue in Fig. 5a.
In addition, our results show that for ~70% of the neutral histidine-containing
dipeptides the method leads to fairly good agreement between the calculated and
the experimental tautomeric form. Co-existence of different tautomeric forms in the
same crystal structure may explain the disagreement obtained for the remaining 30%
of dipeptides.

4.2 Protein Structure Determination

In this section we illustrate, with two examples, how the structural information
encoded in the 13 Cα chemical shifts can be used to determine an ensemble of con-
formations, provided that a set of NOE-derived distance constraints, is available.
However, since the chemical shifts are sensitive to the dynamics of a protein on the
microsecond time scale [88] the question whether a single rather than an ensemble
of conformations is a better representation of the NMR observables, such as the
chemical shifts, must be investigated first.

4.2.1 The Crystallographer Dilemma: A Single Structure


or an Ensemble of Conformations?

In protein crystallography it is conventional to represent the conformation of a protein


by a single structure, although proteins are very flexible in solution, and, hence, the
question whether a single structure, rather than an ensemble of conformations, is a
more accurate representation of the observed 13 Cα chemical shifts in solution deserves
to be investigated.
Proteins in solution are flexible molecules which exhibit anisotropic motion and
exist as a dynamic ensemble of conformations. Although, protein flexibility in the
crystalline state is reduced (compared to solution) as a result of crystal packing, some
dynamics and heterogeneity still remain [119, 120] because of the high solvent con-
tent in most protein crystals [104]. Despite this, protein structures solved by X-ray
diffraction are traditionally represented by a single conformation. Crystallographic
temperature (B) factors, which contain information about atomic displacements aris-
ing from the combined effects of dynamic, static and lattice disorders within the
678 J. A. Vila and Y. A. Arnautova

Fig. 5 a Fraction of His


form distribution for 3 out of
6 His residues in protein
PDB id 1E1A, for which the
chemical shifts were
determined in solution at pH
6.5. Blue and green bars
represent the fraction of the
Nε2 -H and Nδ1 -H tautomers,
respectively, and the red bars
represent the fraction of the
protonated form, H+ . The
dotted horizontal line
indicates the fraction of the
H+ form that a free His
residue would have in
solution at pH 6.5; b Ball and
stick representation of H250
in protein 1E1A. The grey,
blue and red colors designate
carbon, nitrogen and oxygen
atoms, respectively. The
background shows a ribbon
diagram of part of protein
1E1A. The Nδ1 nucleus of
H250 is located at only 2.9 Å
from the carbonyl backbone
oxygen of S248, presumably
forming a hydrogen-bond
(indicated by green dotted
line); c Same as b for H226.
All displayed distances are in
Angstroms. Figure a adapted
from Vila and Arnautova
[43] (with permission of
PNAS)
13 C Chemical Shifts in Proteins: A Rich … 679

crystal lattice, provide an important indication of protein motions in the crystalline


state.
Consequently, consideration of an ensemble of protein conformations generated
by using B-factor values as a guide may potentially improve the agreement between
the NMR- and X-ray-derived protein models in terms of some NMR observables,
such as 13 Cα chemical shifts. To explore such possibility we selected ubiquitin, an
α/β 76 residues protein. The structure of this protein was solved by X-ray (PDB id
1UBQ [83]), and NMR (PDB id 1D3Z [54]) methods, with the latter providing the
available 13 Cα chemical shifts.
Since the deposited PDB structures of 1UBQ were solved and refined by using
software and force-field parameters different from those employed in our method,
a new set of conformations was generated using MCM and rigid geometry starting
from the corresponding regularized experimental X-ray structure (1UBQregular ). Dur-
ing the MCM search, variations of the (φ, ψ, χ) torsional angles were allowed for all
the residues in the sequence. The reported B-factors for 1UBQ were used to estimate
the upper limit of the torsional angle variation adopted (±10° ). The generated set of
conformations was subjected to several rounds of refinement using a standard pro-
cedure in X-ray crystallography, i.e., the Crystallography and NMR System (CNS)
program [51, 52]. As a result 5 conformations were selected.
All the 5 generated models are quite different among themselves and from the
corresponding starting structure, with an all-atom rmsd of 0.36–1.13 Å. Moreover,
for all 5 models, no residues were in disallowed regions of the Ramachandran plot [8]
and all unfavorable contacts occur between the atoms from the last five residues in the
sequence, which were not visible in the electron-density map. In addition, the R and
Rfree factors of the 5 models are equivalent to or better than those of the one obtained
for a Simulated Annealing Refined (SAR) structure of PDB 1UBQ. This refinement
of the deposited 1UBQ structure i.e., named SAR structure, is a necessary step for a
consistent comparison between the chemical shifts of the generated 5 models and the
PDB structure, because C13 chemical shifts are very sensitive to small differences in
bond lengths and bond angles [16].
Figure 6 shows the rmsd values between the observed and computed 13 Cα chem-
ical shifts obtained for each of the 5 new models (light-grey bars) and the SAR
structure (black-filled bar). The ca-rmsd, computed from the ensemble of 5 new
models, is shown as a horizontal solid line in Fig. 6. The ca-rmsd (2.36 ppm) is
lower than the value for the SAR structure (2.74 ppm) or for any of the new models.
These results obtained for ubiquitin demonstrate that consideration of an ensemble
of 5 conformations, derived from the regularized experimental X-ray (1UBQregular )
structure, leads to better agreement with the observed 13 Cα chemical shifts than does
a single conformation (the SAR structure).
The above conclusion is in line with the suggestion of crystallographers’ that
“…a more suitable representation of a macromolecular crystal structure would be
an ensemble of models...” [121]. Analysis of NMR-determined ensemble of confor-
mations also lead to similar conclusion, i.e., use of the ca-rmsd value led to closer
agreement with the observed 13 Cα chemical shifts in solution than when individual,
or the mean, rmsd is used [33]. In other words, proteins in solution are conforma-
680 J. A. Vila and Y. A. Arnautova

Fig. 6 Bar diagram of the rmsd (ppm) between the computed and observed 13 Cα chemical shifts
of ubiquitin. Black-filled bar (2.74 ppm) represents the results from the SAR structure. Grey-filled
bars represent the rmsd for each of the generated 5 new models; the horizontal black line represents
the ca-rmsd (2.36 ppm) computed from the ensemble of 5 new models. Figure adapted from [41]
(with permission of the International Union of Crystallography)

tionally labile, as indicated by both the ca-rmsd and the theoretical minimal-rmsd
model analyses, and this must be taken into account to predict the 13 Cα chemical
shifts most accurately.

4.2.2 Determination of β-Sheet Structures

Evidence obtained from the probability-based secondary structure identification


method of Wang and Jardetzky [122] suggests that the reliability to distinguish an
α-helix from a statistical coil based on chemical shift information follows, for the
heavy nuclei only, the ranking: 13 Cα > 13 C > 13 Cβ > 15 N, whereas a different trend
(13 Cβ > 13 Cα ~ 13 C ~ 15 N) was found for the corresponding reliability to distinguish a
β-strand conformation from a statistical coil. This trend raises the question whether
a mainly 13 Cα -driven methodology can be used to predict predominantly β-sheet
structures and, if so, how well the corresponding 13 Cβ chemical shift predictions
would be.
To answer this question, our recently-introduced physics-based protocol (see
Fig. 1) was applied to determine the structure a 20-residue peptide capable of form-
ing a three-stranded antiparallel β-sheet in aqueous solution, i.e., the BS2 peptide
with the sequence: TWIQND PGTKWYQND PGTKIYT, for which both a complete
set of 13 Cα chemical shifts and a reduced number of NOEs were reported. The exper-
imental structure determination of small proteins and peptides, which are able to
13 C Chemical Shifts in Proteins: A Rich … 681

fold as monomers and do not contain disulfide bonds, is very valuable because such
determinations can provide important information for force-field development and
evaluation or improvement of search algorithms aimed at an efficient exploration of
the conformational space [123–126].
The results obtained indicate that an accurate all β-sheet structure can be deter-
mined by simply identifying a set of conformations which simultaneously satisfy a
set of constraints including 13 Cα -dynamically-derived torsional angle constraints for
all amino acid residues in the sequence and a fixed set of NOE-derived distance con-
straints [29]. Among the thousands of conformations generated by the VTF approach,
i.e., during the step 1 of the protein structure determination protocol shown in Fig. 1,
25 of them (see Fig. 7a) were selected by using a clustering procedure. This small
set of conformation was used to determine the theoretical minimal-rmsd model that
provides us with a set of φ, ψ, and χ torsional angle constraints for all the residues
in the sequence not just for those in α-helix or β-sheet regions. Using this set of tor-
sional angle constraints (φ, ψ, χ), combined with different number of NOE-derived
constraints, 2 sets of conformations of the BS2 peptide were determined after the step
2 of the protocol. One set of 20 conformations (shown in Fig. 7b) was obtained by
using 118 NOE-derived distance constraints, while the other set of 10 conformations
(shown in Fig. 7c) was obtained by using 130 NOE-derived distance constraints.
Regardless of the number of the NOE’s-derived distance constraints used, addition
of the 13 Cα -derived torsional constraints led to a noticeably lower ca-rmsd’s (2.2 and
3.5 ppm, for the set of 20 and 10 conformations, respectively) compared to the 20
models obtained by Santiveri et al. [127] who used a full set of 130 NOE’s-derived
distance constraints but no 13 Cα chemical shift information (4.6 ppm). In line with
this finding, graphical inspection of the results shown in Fig. 7b–c also indicated
that use of 13 Cα -derived torsional constraints led to sets of conformations with less
side-chain torsional angle spreading, i.e., as can be seen from comparison of Fig. 7b
and c against 7d, with the latter obtained by Santiveri et al [127]. In addition, the
correlation coefficient, R, between the observed and computed 13 Cβ chemical shifts
was somewhat better for the two sets obtained using the 13 Cα -based determination
protocol (shown in Fig. 1). Thus, R is 0.99 and 0.98 for the 20 and 10 conformation
sets, respectively, while R is 0.97 for the set of conformation derived by Santiveri
et al [127].
Overall, analysis of the ca-rmsd, the NOE-derived distance violations, the 13 Cβ
chemical shifts, and some stereo chemical quality factors for these sets, as a mea-
sure of the closeness with which the calculations reproduce the structure in solution,
indicates that our self-consistent physics-based method is able to produce a more
accurate set of conformations (shown in Fig. 7b and c) than that obtained with the
traditional methods [127] [shown in Fig. 7d]. Our results also suggest that for a
flexible molecule in solution, like BS2, it may not be possible to determine a single
structure that would satisfy all the constraints simultaneously. This is a consequence
of the well-known fact that NMR parameters, such as the observed NOE-derived
distances and the 13 Cα chemical shifts, correspond to a dynamic ensemble of con-
formations and, therefore, may not be reproduced exactly by a limited set of static
structures [44, 128].
682 J. A. Vila and Y. A. Arnautova

Fig. 7 a Superposition of 25 NMR-derived conformations of BS2 peptide (represented by ribbon


diagrams) obtained in Step 1 after the VTF procedure (see Flow-chart in Fig. 1); b Superposition
of 20 NMR-derived conformations of BS2 obtained after the conformational search in Step 2 (see
Flow-chart in Fig. 1); 118 out of 130 NOE’s distance constraints were used; c Same as b for 10
NMR-derived conformations; 130 NOE’s distance constraints were used; d Superposition of 20
NMR-derived conformations obtained by Santiveri et al. [127] using traditional methods

Characterization of the structural flexibility of molecules in solution is of funda-


mental importance for the study of biological function, stability and folding [129,
130]. Therefore, additional analysis of the per-residue average 13 Cα conformational
shifts was carried out and the results indicated that the third, C-terminal, strand in
the β-sheet of the BS2 peptide is the most flexible strand, although less flexible
than the turns. In addition, a 20 ns molecular dynamics simulations (MD) using
the AMBER 8.0 package [131] were performed. The MD runs yielded a plausible
atomic description of the motion of BS2 peptide in solution, as revealed by both the
pattern of hydrogen bonds and the generalized Lindemann parameter [132]. The MD
results were in line with the per-residue average 13 Cα conformational shifts analysis,
providing additional evidence of greater flexibility of the C-terminal strand.
The fact that the observed 13 Cα chemical shifts, supplemented only by NOE-
derived distance constraints, provide accurate information for validation and refine-
ment of protein structures, as well as site-specific information about the flexibility of
a molecule in solution, may be very useful for NMR spectroscopists and theoreticians
interested in analysis of the stability and protein-folding mechanism.
13 C Chemical Shifts in Proteins: A Rich … 683

4.2.3 A Blind Test to Determine an α-Helical Structure

The solution NMR structures of both full length (residues 1–77) and truncated
(residues 1–46) forms of YnzC protein (PDB id 2JVD) from Bacillus subtilis [133],
that is part of the small yneA SOS response operon that regulates cell division in
this organism [134], have been determined recently [135]. The corresponding X-ray
crystal structure (PDB ID, 3BHP) was solved by Kuzin et al. [133] at 2.0 Å resolu-
tion. The unique two-helix monomeric structure of YnzC, with no disulfide bonds,
makes it an attractive subject for testing our physics-based methodology for protein
structure determination.
The goal of this application is two two-fold. First, as a blind test, we attempted to
determine whether it is possible to obtain an ensemble of conformations for which
each individual conformer simultaneously satisfies the NOE-derived distance con-
straints and the 13 Cα -derived torsional constraints for the YnzC protein in solution
[136]. Although the solution NMR structure [135] of this protein had been solved at
the time of this blind test, the only information provided was a full set of both the
observed 13 Cα chemical shifts and the NOE-derived distance constraints. In particu-
lar, no information about the coordinates of the solved structures of the YnzC protein
[135] or the heteronuclear 15 N-1 H NOE data was provided at the moment of the test.
Our second goal was to carry out a cross-validation test of high-quality sets of
conformations obtained for the YnzC protein in solution by using alternative deter-
mination methods, namely, the solution NMR set of conformations (PDB id, 2JVD)
obtained by using NOE-derived distance constraints, dihedral-angle constraints and
hydrogen-bond constraints [135], and the 2.0-Å X-ray crystal structure (PDB id,
3BHP) (Kuzin et al. [133]. For this second goal, several validation scores were used
[136], including: (i) Recall, Precision, F-measure (RPF) analysis [6]; (ii) several
global quality score indicators provided by Verify3D [10], ProsaII [137], Procheck
[8], and MolProbity [5]; (iii) the ca-rmsd and rmsd between observed 13 Cα chemical
shifts and those computed at the DFT level, and (iv) the backbone rmsd between
these refined structures and the mathematical average coordinates of the ensemble
of NMR structures of YnzC(1–48) deposited in the PDB.
By carrying out a blind test we demonstrated [136] that an accurate all α-helical set
of protein structures can be determined by simply identifying conformations which
simultaneously satisfy a set of constraints, including 13 Cα -dynamically-derived tor-
sional angle constraints for all amino acid residues in the sequence and a fixed set
of 1022 NOE-derived distance constraints. The protein structure determination was
carried out as follows: after generation of thousands of conformations using the VTF
procedure (step 1) 10 of them, shown in Fig. 8b, were selected, i.e., those possessing
a maximum NOE-derived distance violation lower than some fixed cutoff value; only
one of the 10 conformations produced in step 1 was selected. The selected confor-
mation was used as a starting one in a conformational search carried out with two
types of constraints: the original fixed limited NOE-derived distance constraints and
the set of φ, ψ, χ torsional angles derived from step 1. The resulting new set of 10
conformations is shown in Fig. 8c. Repetition of the step 2 with a tighter tolerance
range, than in the previous iteration, for the torsional angle constraints enabled us
684 J. A. Vila and Y. A. Arnautova

Fig. 8 Results for the 77-residue YnzC protein from Bacillus subtilis. a Bar diagram indicating the
rmsd (ppm) between the computed and observed 13 Cα chemical shifts for each of the 10 conforma-
tions from Set-NOE-CS (red bars), for the 20 conformations from 2JVD (yellow bars), and for each
of the three chains in the 2.0 Å crystal structure of YnzC protein, PDB id 3BHP, namely chain a, b
and c (black, cyan and green bars). Black (1.54 ppm) and red (1.38 ppm) horizontal lines show the
ca-rmsd values computed for the residues 1–46 of 2JVD and Set-NOE-CS, respectively; b Super-
position of 10 NMR-derived conformations of YnzC (represented by ribbon diagrams) obtained
after the VTF procedure, in Step 1 (see Flow-chart in Fig. 1); c Same as b after the conformational
search in Step 2; d Same as c after repeating the conformational search in Step 2 (Set-NOE-CS),
i.e., this time by using a new set of torsional angles (ϕ, ψ, χ) derived from the set of conformations
shown in panel (c); e superposition of 20 NMR-derived conformations (PDB id 2JVD) of YnzC
protein obtained by Aramini et al. [135]; and f Graphic representation of the X-ray determined
structure of YnzC protein (PDB id 3HBP); the asymmetric unit contains 3 similar, but not identical,
copies of the YnzC protein molecule, namely chain a, b and c. Figure a adapted from [136] (with
permission of PNAS)

to determine the final set of 10 conformations shown in Fig. 8d, i.e., the so-called
Set-NOE-CS.
13 C Chemical Shifts in Proteins: A Rich … 685

A comparative analysis of the rmsd, between the computed and observed 13 Cα


chemical shifts values for the residues 1–46, for all three sets of conformations is
shown in Fig. 8a as a bar diagram, viz., the Set-NOE-CS (shown in Fig. 8d), 2JVD
(shown in Fig. 8e) and the three chains of the X-ray crystallography structure 3HBP
(shown in Fig. 8f). The results shown in Fig. 8a reveals that the two NMR-derived
ensembles of structures (2JVD and Set-NOE-CS) are a better representation for the
observed 13 Cα chemical shifts in solution in terms of the ca-rmsd (solid horizontal
black and red lines in Fig. 8a), than any single conformer (red or yellow bars in
Fig. 8a), or any single chain of the X-ray structure (black, cyan and green bars
in Fig. 8a). This result is in line with previous calculations for 10 NMR-derived
conformations (PDB id 1D3Z) and the X-ray structure (PDB id 1UBQ) of ubiquitin.
Since the ca-rmsd analysis might be biased by the fact that the 10 conformations
of Set-NOE-CS were computed using a 13 Cα -based method while the others were
not, a cross-validation quality test was also carried out. These structures consistently
show good values for the RFP and DP-scores as well as for global structure quality
indicators. This analysis reveals that all three sets of structures analyzed here display
very good agreement with the experimental NOE data, as well as dihedral angle
distributions and atomic clash scores typical of good quality protein structures. Taken
together, these results indicate that the 20 conformations from the 2JVD set, the DFT-
computed 10 conformations from Set-NOE-CS, and each of the three chains of the
X-ray structure are highly-accurate sets of conformations which represent the YnzC
protein in solution.

4.3 Protein Structure Validation

The PDB is the most important archive of experimental protein structures solved
by X-ray crystallography and NMR spectroscopy. The large number of structures
deposited in PDB constitutes an extraordinary source of information that has been,
and continuously is, used for a wide range of applications in structural drug design,
molecular modeling, force-field parameterization, molecular biology applications,
etc. Some deposited protein structures, showing few, or a large number, of flaws,
are formally withdrawn from the data-base and, hence, considered as obsolete, even
though their coordinates remain available in PDB. In most cases, a successor (or
superseded) structure replaces the old obsolete one. The large number of obsolete
structure indicates that development of accurate validation protocols remains an
important task.

4.3.1 A Chemical-Shift-Based Server

An ideal validation method should meet two requirements. First, it should be strong
rather than weak. A validation method is considered ‘strong’ if it is able to assess how
well a structure, or an ensemble of structures, predicts experimental data not used in
686 J. A. Vila and Y. A. Arnautova

the structure-determination process; otherwise it should be considered ‘weak’, since


it is limited to reproducing the observed experimental data used in the determination
of the protein models [138]. Second, it should be able to detect fast and accurately,
at residue level, the existence of structural flaws. With these goals in mind a new
server (CheShift) has been developed recently to predict 13 Cα chemical shifts of
protein structures. It is based on a database of chemical shifts computed for 696,916
conformations as a function of the φ, ψ, ω, χ1 and χ2 torsional angles for all 20
naturally occurring amino acids. The 13 Cα chemical shifts were computed at the
DFT level of theory using the methodology described in Sect. 2.1. Because of the
large number of conformations, the computed shielding values were obtained using
a small basis set (6-31G/3-21G) and later extrapolated to a large basis set [6-311 +
G(2d,p)/3-21G], as described in Methods section.
An analysis of the accuracy and sensitivity of the CheShift predictions, in terms
of the correlation coefficient R between the observed and predicted 13 Cα chemical
shifts, was carried out on 36 X-ray-derived protein structures solved at 2.3 Å, or
better, resolution. Results indicate that for all the proteins the R values obtained
using the CheShift, SHIFTX [24], SPARTA [25], SHIFTS [38, 39], and PROSHIFT
[23] servers were comparable, although the CheShift values were systematically
lowest. This raises the following question: do these servers provide a more sensi-
tive validation than CheShift? To answer this question we choose protein 1RGE,
solved at 1.15 Å resolution [139]. The corresponding crystal structure of this protein
contains two chemically identical but crystallographically independent molecules in
the asymmetric unit, named here as A and B [139]. The main structural difference
between molecules A and B (with an all-heavy-atom rmsd of 1.1 Å) is due to dif-
ferences in side chain conformations, especially those occupying different rotameric
states. For this test, that do not require a comparison with the observed 13 Cα chemical
shifts, we computed the correlation coefficient R between the 13 Cα chemical-shift
predictions obtained for molecules A and B, respectively, by using five servers listed
above. The results of this test give the following R values: 0.96, 1.00, 1.00, 0.98,
and 1.00 for CheShift, SHIFTX, SPARTA, SHIFTS, and PROSHIFT, respectively.
Except for CheShift (0.96) and SHIFTS (0.98), none of the servers is able to dis-
criminate, beyond doubt, between molecules A and B. From a statistical point of
view the R values obtained from SHIFTX (1.00), SPARTA (1.00) and PROSHIFT
(1.00) servers indicate that molecules A and B are practically indistinguishable pro-
tein models. Therefore a lower R value between the predicted and observed 13 Cα
chemical shifts does not necessarily mean poorer accuracy but it could mean higher
sensitivity to subtle structural differences. This conclusion can be confirmed by a
similar analysis carried out at a higher level of accuracy, for example, by using a
larger basis set and the actual geometry of chains A and B, i.e., without need for any
torsional angle interpolations as with the CheShift server. In this case, the R value
(0.93) computed with the larger basis set was significantly lower than the R value
obtained with CheShift (0.96), or any other server, namely, 1.00, 1.00, 0.98, and 1.00
for SHIFTX, SPARTA, SHIFTS, and PROSHIFT, respectively.
So far, we have shown that the QM basis of the CheShift server enables us to
predict the 13 Cα chemical shifts with reasonable accuracy in seconds. Our results
13 C Chemical Shifts in Proteins: A Rich … 687

suggest that CheShift can provide a standard with which to evaluate the quality of
protein structures solved by either X-ray crystallography or NMR-spectroscopy, if
the experimentally observed 13 Cα chemical shifts are available.

4.3.2 CheShift-2: A Picture Is Worth a Thousand Words

Differences between the observed and CheShift-predicted 13 Cα chemical shifts can be


used as a sensitive probe with which to detect possible local flaws in NMR-determined
protein structures; hence, a graphical user interface has been added to the CheShift-2
server [49] to render such flaws easily visible. CheShift was originally developed to
return a list of 13 Cα predicted chemical-shift values, one for each amino acid in the
sequence of a protein, except for the first and last residues [28, 33]. The validation
process, i.e., the comparison between the predicted and the observed 13 Cα chemical-
shift values, is left to the user of the server who can use the provided information to
determine the quality of the NMR structure as a whole, e.g., by computing the ca-
rmsd [33]. However, it is a highly-desirable goal of any accurate validation method
[11, 34] to identify the existence of local flaws in the sequence rather than only the
global quality. Therefore, we added a graphical user interface (GUI) to the CheShift
server. As a result, it will be possible to facilitate the validation process by displaying
the differences between the observed and computed 13 Cα chemical shifts by using a
three-color code mapped onto a 3D protein model. This graphic validation method,
far from being only an aesthetic improvement, will enable users of CheShift-2 to
detect local flaws in proteins on a per-residue basis fast and accurately without the
need for the user to carry out the extensive DFT calculations on which the server is
based.
The CheShfit-2 server [49] makes use of the following sequential steps: (i) for
each amino acid residue i the average difference between the observed and predicted
13 α
C chemical-shifts, i , is computed by using Eq. (2); (ii) the i value is smoothed
by averaging it over the values of the two nearest-neighbor residues (< i >); (iii) the
resulting nearest-neighbor averaged value, < i > , is discretized, i.e., it is assigned
an integer value of 1, 0 or −1, depending on the magnitude of < i > ; and (iv) these
discrete values are mapped onto the 3D protein model and color coded as blue,
white and red, respectively. This color-code assignment is based on the assumption
that < i > values which are within ~1.7 ppm (blue), are considered as small; within
~3.4 ppm (white), as medium; and beyond 3.4 ppm (red), as large. Differences
corresponding to blue and white colors are considered acceptable, while red color
indicates possible flaws in the structure. In addition, the yellow color was adopted to
specify the absence of observed or computed 13 Cα chemical shifts [49].
When more than one protein model exists the averaged i values are computed
considering all the deposited conformations, although the colored representation is
illustrated by using only the first model. This situation is illustrated in Fig. 9 for the
20 NMR-determined conformations (see Fig. 9a) of Bacillus Cereus, a membrane
associate protein, PDB id 2K5Q. The large dispersion of conformation in the loops
and at the N- and C-termini shown in Fig. 9a, rather than being poor representation
688 J. A. Vila and Y. A. Arnautova

of the protein, reflects the flexibility of these segments of the molecules in solution,
as is clearly shown by the CheShift-2 validation of 2K5Q (see Fig. 9b).

4.3.3 Global Versus Local Validation of Proteins

The NMR-determined ensembles of dynein light chain 2A protein, PDB id 1TGQ


and 2B95, respectively, show different fold, with one of them, namely 1TGQ (now

Fig. 9 a Superposition of 20 NMR-derived conformations of Bacillus Cereus, a membrane asso-


ciate protein, PDB id 2K5Q; b Protein 2K5Q colored according to CheShift-2. The BMRB accession
number, from which the observed 13 Cα chemical shifts were obtained, is 15,846
13 C Chemical Shifts in Proteins: A Rich … 689

obsolete) having a wrong fold; while the other one, 2B95 (that replaced the obsolete
1TGQ in the PDB), showing a correct fold. This difference is a result of the oligomeric
state assumed during the protein-structure determination, namely a monomer for
1TGQ, and a homodimer for 2B95, as pointed out by Nabuurs et al. [11].
Validation of both protein ensembles, as a whole, shows that 2B95 is a slightly
better representation of the observed 13 Cα chemical shifts, in terms of the ca-rmsd
[34], than 1TGQ, viz., ca-rmsd = 2.08 and 2.35 ppm, for 2B95 and 1TGQ, respec-
tively. However, the ca-rmsd difference between these two ensembles (~0.30 ppm)
is not large enough to assure, unambiguously, that the 1TGQ ensemble needs fur-
ther refinement. In fact, a similar difference in terms of rmsd, i.e., within a range of
~0.30 ppm, was found among 5 new models of the protein ubiquitin (see grey bars
in Fig. 6), all of which fit X-ray diffraction data with R and Rfree factors similar to
those for the deposited X-ray structure, PDB id1UBQ, solved at 1.8 Å resolution
[41]. Certainly, these 5 new models can be considered to be of comparable structural
quality. Consequently, variations of ca-rmsd ~0.30 ppm cannot be used as a univer-
sal criterion to unequivocally determine if a protein, such as 1TGQ, needs further
refinement.
Analysis of dynein light chain 2A protein illustrates that validation of a protein as
a whole (global validation), e.g., with the ca-rmsd, may not enable us to determine
unambiguously whether one protein model is of better quality than another model
of the same protein, while the validation at a per-residue basis (local validation),
e.g., as with the CheShift-2 server, does (see Fig. 10). To further test the ability of
CheShift-2 server to detect small differences between protein models, a small set of
15 obsolete/successor pairs of proteins was also considered (see Supplementary Data
of [49]. The results indicate that the CheShift-2 server constitutes a fast and accurate
validation tool with which to determine, at the per-residue basis, the existence of local
flaws in protein models even for conformations that differ in small details, as for the
obsolete and successor models of Membrane-bound Lytic Murein Transglycosylase
D (fragment Lysm Domain) (see Fig. 11).
In general, pairs of obsolete and successor proteins present in PDB can be used
as a benchmark set with which to test validation methods. These ensembles of obso-
lete/successor pairs of proteins are very appealing because their members possess
different topology and numbers of residues and a complete sets of 13 Cα chemical
shifts are available for a large number of them from the Bio Magnetic Resonance
Data Bank (BMRB) [117].

5 Conclusions and Future Directions

In this chapter we have illustrated how the information encoded in the 13 C chemical
shifts can be used for an assorted number of applications, namely, from protein
structure prediction to accurate detection of structural flaws, at a residue-level, in
NMR-determined protein models.
690 J. A. Vila and Y. A. Arnautova

Fig. 10 Two models of the dynein light chain 2A protein: a 1TGQ (obsolete) and b 2B95 (succes-
sor). Both models are shown as ribbons and colored according to CheShift-2. The BMRB accession
number, from which the observed 13 Cα chemical shifts were obtained, is 6527. Figure adapted from
[49] (with permission of Oxford University Press)
13 C Chemical Shifts in Proteins: A Rich … 691

Fig. 11 Two models of Membrane-bound Lytic Murein Transglycosylase D (fragment Lysm


Domain): a PDB id 1E01 (obsolete) and b 1E0G (successor). The BMRB accession number, from
which the observed 13 Cα chemical shifts were obtained, is 4680. Figure adapted from [49] (with
permission of Oxford University Press)
692 J. A. Vila and Y. A. Arnautova

The ability to detect and accurately characterize the mobility of the surface side
chains by computing 13 Cα chemical shifts constitutes one of the strengths of the cur-
rent methodology. Hence, we are planning to focus our research on the development
of new physics-based algorithms for a fast and accurate determination and validation
of side-chain conformations, with the goal to improve the quality of NMR-determined
protein models. Since NMR spectroscopy provides chemical shifts for several other
nuclei, besides 13 Cα , feasibility of their DFT-computation and benefits of including
the information encoded in these data in structure determination protocols is cur-
rently under investigation in our group. In general, new developments in the field
of NMR spectroscopy are needed in order to develop protocols for high-throughput
NMR determination of high-quality protein structures in solution.

References

1. Bhattacharya, A., Tejero, R., Montelione, G.T.: Evaluating protein structures determined by
structural genomics consortia. Proteins 66, 778–795 (2007)
2. Billeter, M., Wagner, G., Wüthrich, K.: Solution NMR structure determination of proteins
revisited. J. Biomol. NMR 42, 155–158 (2008)
3. Williamson, M.P., Craven, C.J.: Automated protein structure calculation from NMR data. J.
Biomol. NMR 43, 131–143 (2009)
4. Williamson, M.P., Kikuchi, J., Asajura, T.: Application of 1H-NMR chemical-shifts to mea-
sure the quality of protein structures. J. Mol. Biol. 247, 541–546 (1995)
5. Davis, I.W., Leaver-Fay, A., Chen, V.B., Block, J.N., Kapral, G.J., Wang, X., Murray, L.W.,
Arendall III, W.B., Snoeyink, J., Richardson, J.S., Richardson, D.C.: MolProbity: all-atom
contacts and structure validation for proteins and nucleic acids. Nucleic Acids Res. 35,
W375–W383 (2007)
6. Huang, Y.J., Powers, R., Montelione, G.T.: Protein NMR Recall, Precision, and F-measure
scores (RPF scores): Structure quality assessment measures based on information retrieval
statistics. J. Am. Chem. Soc. 127, 1665–1674 (2005)
7. Huang, Y.J., Tejero, R., Powers, R., Montelione, G.T.: A topology-constrained distance net-
work algorithm for protein structure determination from NOESY data. Proteins 62, 587–603
(2006)
8. Laskowski, R.A., MacArthur, M.W., Moss, D.S., Thornton, J.: PROCHECK—a program to
check the stereochemical quality of protein structures. J. Appl. Cryst. 26, 283–291 (1993)
9. Lovell, S.C., Davis, I.W., Arendall III, W.B., de Bakker, P.I.W., Word, J.M., Prisant, M.G.,
Richardson, J.S., Richardson, D.C.: Structure validation by Cα geometry: φ, ψ, and Cβ devi-
ation. Proteins 50, 437–450 (2003)
10. Lüthy, R., Bowie, J.U., Eisenberg, D.: Assessment of protein models with three-dimensional
profiles. Nature 356, 83–85 (1992)
11. Nabuurs, S.B., Spronk, C.A.E.M., Vuister, G.W., Vriend, G.: Tradional biomolecular structure
determination by NMR spectroscopy allows for major errors PLOS. Comp. Biol. 2, 71–79
(2006)
12. Vriend, G.: WHAT IF: a molecular modeling and drug design program. J. Mol. Graph. 8,
52–56 (1990)
13. Berjanskii, M., Wishart, D.S.: A simple method to predict protein flexibility using secondary
chemical shifts. J. Am. Chem. Soc. 127, 14970–14971 (2005)
14. Berjanskii, M., Wishart, D.S.: The RCI server: rapid and accurate calculation of protein
flexibility using chemical shifts. Nucleic Acids Res. 35, W531–W537 (2007)
13 C Chemical Shifts in Proteins: A Rich … 693

15. Cornilescu, G., Delaglio, F., Bax, A.: Protein backbone angle restraints from searching a
database for chemical shift and sequence homology. J. Biomol. NMR 13, 289–302 (1999)
16. de Dios, A.C., Pearson, J.G., Oldfield, E.: Chemical shifts in proteins: An ab initio study
of carbon-13 nuclear magnetic resonance chemical shielding in glycine alanine and valine
residues. J. Am. Chem. Soc. 115, 9768–9773 (1993)
17. de Dios, A.C., Pearson, J.G., Oldfield, E.: Secondary and tertiary structural effects on protein
NMR chemical shifts: An ab initio approach. Science 260, 1491–1496 (1993)
18. Frank, A., Möller, H.M., Exner, T.H.: Toward the quantum chemical calculation of NMR
chemical shifts of proteins. 2 Level of theory, basis set, and solvent model dependence. J.
Chem. Theory Comput. 8, 1480–1492 (2012)
19. Havlin, R.H., Le, H., Laws, D.D., de Dios, A.C., Oldfield, E.: An ab initio quantum chemical
investigation of carbon–13 NMR shielding tensors in glycine, alanine, valine, isoleucine,
serine, and threonine: Comparisons between helical and sheet tensors, and effects of χ1 on
shielding. J. Am. Chem. Soc. 119, 11951–11958 (1997)
20. Iwadate, M., Asakura, T., Williamson, M.P.: Cα and Cβ carbon-13 chemical shifts in proteins
from an empirical database. J. Biomol. NMR 13, 199–211 (1999)
21. Kuszewski, J., Qin, J., Gronenborn, A.M., Clore, M.: The impact of direct refinement against
13Cα and 13Cβ chemical shifts on protein structure determination by NMR. J. Magn. Reson.
Ser. B 106, 92–96 (1995)
22. Luginbühl, P., Szyperski, T., Wüthrich, K.: Statistical basis for the use of 13Cα chemical shift
in protein structure determination. J. Magn. Reson. 109, 229–233 (1995)
23. Meiler, J.: PROSHIFT: protein chemical shift prediction using artificial neural networks. J.
Biomol. NMR 26, 25–37 (2003)
24. Neal, S., Nip, A.M., Zhang, H., Wishart, D.S.: Rapid and accurate calculation of protein 1H,
13C and 15 N chemical shifts. J. Biomol. NMR 26, 215–240 (2003)
25. Shen, Y., Bax. Ad.: Protein backbone chemical shifts predicted from searching a database for
torsional angle and sequence homology. J. Biomol. NMR, 38, 289–302 (2007)
26. Shen, Y., Lange, O., Delaglio, F., Rossi, P., Aramini, J.M., Liu, G., Eletsky, A., Wu, Y.,
Singarapu, K.K., Lemak, A., et al.: Consistent blind protein structure generation from NMR
chemical shift data. Proc. Natl. Acad. Sci. U. S. A. 105, 4685–4690 (2008)
27. Spera, S., Bax, A.: Empirical correlation between protein backbone conformation and Cα
and Cβ 13C nuclear magnetic resonance chemical shifts. J. Am. Chem. Soc. 113, 5490–5492
(1991)
28. Vila, J.A., Arnautova, Y.A., Martin, O.A., Scheraga, H.A.: Quantum-mechanics-derived 13Cα
chemical shift server (CheShift) for Protein Structure validation. Proc. Natl. Acad. Sci. U. S.
A 106, 16972–16977 (2009)
29. Vila, J.A., Arnautova, Y.A., Scheraga, H.A.: Use of 13Cα chemical shifts for accurate deter-
mination of β-sheet structures in solution. Proc. Natl. Acad. Sci. U. S. A. 105, 1891–1896
(2008)
30. Vila, J.A., Aramini, J.M., Rossi, P., Kuzin, A., Su, M., Seetharaman, J., Xiao, R., Tong, L.,
Montelione, G.T., Scheraga, H.A.: Quantum chemical 13Cα chemical shift calculations for
protein NMR structure determination. refinement, and validation. Proc. Natl. Acad. Sci. U.
S. A. 105, 14389–14394 (2008)
31. Vila, J.A., Baldoni, H.A., Ripoll, D.R., Ghosh, A., Scheraga, H.A.: Polyproline II helix con-
formation in a proline-rich environment: a theoretical Study. Biophys. J. 86, 731–742 (2004)
32. Vila, J.A., Baldoni, H.A., Ripoll, D.R., Scheraga, H.A.: Unblocked statistical-coil tetrapep-
tides in aqueous solution: quantum-chemical computation of the carbon-13 NMR chemical
shifts. J. Biomol. NMR 26, 113–130 (2003)
33. Vila, J.A., Villegas, M.E., Baldoni, H.A., Scheraga, H.A.: Predicting 13Cα chemical shifts
for validation of protein structures. J. Biomol. NMR 38, 221–235 (2007)
34. Vila, J.A., Scheraga, H.A.: Assessing the accuracy of protein structures by quantum mechan-
ical computations of 13Cα chemical shifts. Acc. Chem. Res. 42, 1545–1553 (2009)
35. Villegas, M.E., Vila, J.A., Scheraga, H.A.: Effects of side-chain orientation on the 13C chem-
ical shifts of antiparallel β-sheet model peptides. J. Biomol. NMR 37, 137–146 (2007)
694 J. A. Vila and Y. A. Arnautova

36. Wishart, D., Bigam, C.G., Yao, J., Abildgaard, F., Dyson, H., Oldfield, E., Markley, J., Sykes,
B.: 1H, 13C and 15 N chemical shift referencing in biomolecular NMR. J. Biomol. NMR 6,
135–140 (1995)
37. Wishart, D., Bigam, C.G., Holm, A., Hodges, R.S., Sykes, B.D.: 1H, 13C and 15 N random
coil NMR chemical shifts of the common amino acids. I Investigation of nearest-neigbor
effects. J. Biomol. NMR 5, 67–81 (1995)
38. Xu, X.-P., Case, D.A.: Probing multiple effects on 15 N, 13Cα, 13Cβ and 13C chemical shifts
in peptides using density functional theory. Biopolymers 65, 408–423 (2002)
39. Xu, X.-P., Case, D.A.: Automated prediction of 15 N, 13Cα, 13Cβ and 13C’ chemical shifts
in proteins using a density functional database. J. Biomol. NMR 21, 321–333 (2001)
40. Parr, R.G., Yang, W.: Density functional theory of atoms and molecules. Oxford University
Press, New York (1989)
41. Arnautova, Y.A., Vila, J.A., Martin, O.A., Scheraga, H.A.: What can we learn by computing
13Cα chemical shifts for X-ray protein models? Acta Crystallogr. D D65, 697–703 (2009)
42. Martin, O.A., Villegas, M.E., Vila, J.A., Scheraga, H.A.: Analysis of 13Cα and 13Cβ chemical
shifts of cysteine and cystine residues in proteins: A quantum chemical approach. J. Biomol.
NMR 46, 217–225 (2010)
43. Vila, J.A., Arnautova, Y.A.: Vorobjev and Scheraga HA. Assessing the fractions of tautomeric
forms of the imidazole ring of histidine in proteins as a function of pH. Proc. Natl. Acad. Sci.
U. S. A. 108, 5602–5607 (2011)
44. Vila, J.A., Ripoll, D.R., Scheraga, H.A.: Use of 13Cα chemical shifts in protein structure
determination. J. Phys. Chem. B 111, 6577–6585 (2007)
45. Vila, J.A., Scheraga, H.A.: Factors affecting the use of 13Cα chemical shifts to determine,
refine, and validate protein structures. Proteins: structure. Funct. Bioinformatics 71, 641–654
(2008)
46. Wüthrich, K.: NMR of Proteins and Nucleic Acids. Wiley, New York, NY, U. S. A. (1986)
47. Sun, H., Sanders, L.K., Oldfield, E.: Carbon-13 NMR shielding in the twenty common amino
acids: comparisons with experimental results in proteins. J. Am. Chem. Soc. 124, 5486–5495
(2002)
48. Vila, J.A., Serrano, P., Wüthrich, K., Scheraga, H.A.: Sequential nearest-neighbor effects on
computed 13Cα chemical shifts. J. Biomol. NMR 48, 23–30 (2010)
49. Martin, O.A., Vila, J.A., Scheraga, H.A.: CheShift-2: graphic validation of protein structures.
Bioinformatics 28, 1538–1539 (2012)
50. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov,
I.N., Bourne, P.E.: Protein Data Bank Nucleic Acids Res. 28, 235–242 (2000)
51. Brünger, A.T., Adams, P.D., Clore, G.M., DeLano, W.L., Gros, P., Grosse-Kunstleve, R.W.,
Jiang, J.-S., Kuszewski, J., Nilges, M., Pannu, N.S., Read, R.J., Rice, L.M., Simonson, T.,
Warren, G.L.: Crystallography and NMR system: a new software suite for macromolecular
structure determination. Acta Crystallogr D 54, 905–921 (1998)
52. Brünger, A.T.: Version 1.2 of the Crystallography and NMR system. Nat. Protoc. 2, 2728–2733
(2007)
53. Cavalli, A., Salvatella, X., Dobson, C.M., Vendruscolo, M.: Protein structure determination
from NMR chemical shifts. Proc. Natl. Acad. Sci. U.S.A. 104, 9615–9620 (2007)
54. Cornilescu, G., Marquardt, J.L., Ottiger, M., Bax, A.: Validation of protein structure from
anisotropic carbonyl chemical shifts in a dilute liquid crystalline phase. J. Am. Chem. Soc.
120, 6836–6837 (1998)
55. Frank, A., Onila, I., Moller, H.M., Exner, T.E.: Toward the quantum chemical calculation of
nuclear magnetic resonance chemical shifts of proteins. Proteins 79(2189), 2202 (2011)
56. Guerry, P., Herrmann, T.: Advances in automated NMR protein structure determination. Q.
Rev. Biophys. 44, 257–309 (2011)
57. Güntert, P.: Structure calculation of biological macromolecules from NMR data. Q. Rev.
Biophys. 31, 145–237 (1998)
58. Güntert, P.: Automated structure determination from NMR spectra. Eur. Biophys. J. 38,
129–143 (2009)
13 C Chemical Shifts in Proteins: A Rich … 695

59. Güntert, P., Braun, W., Wüthrich, K.: Efficient computation of threedimensional protein struc-
tures in solution from nuclear magnetic resonance data using the program DIANA and the
supporting programs CALIBA, HABAS and GLOMSA. J. Mol. Biol. 217, 517–530 (1991)
60. Rosato, A., Aramini, J.M., Arrowsmith, C., Bagaria, A., Baker, D., Cavalli, A., Doreleijers,
J.F., Eletsky, A., Giachetti, A., Guerry, P., et al.: Blind testing of routine, fully automated
determination of protein structures from NMR data. Structure 20, 227–236 (2012)
61. Rosato, A., Bagaria, A., Baker, D., Bardiaux, B., Cavalli, A., Doreleijers, J.F., Giachetti, A.,
Guerry, P., Guntert, P., Herrmann, T., et al.: CASDNMR: critical assessment of automated
structure determination by NMR. Nat. Methods 6, 625–626 (2009)
62. Némethy, G., Gibson, K.D., Palmer, K.A., Yoon, C.N., Paterlini, G., Zagari, A., Rumsey, S.,
Scheraga, H.A.: Energy parameters in polypeptides. 10. Improved geometrical parameters
and nonbonded interactions for use in the ECEPP/3 algorithm, with application to praline-
containing peptides. J. Phys. Chem. 96, 6472–6484 (1992)
63. Frisch, M.J., Trucks, G.W., Schlegel, H.B., Scuseria, G.E., Robb, M.A., Cheeseman, J.R.,
Zakrzewski, V.G., Montgomery, J.A., Jr Stratmann, R.E., Burant, J.C., et al.: Gaussian 03,
Revision E.01, Gaussian, Inc., Wallingford CT (2003)
64. Chesnut, D.B., Moore, K.D.: Locally dense basis-sets for chemical-shift calculations. J. Comp.
Chem. 10, 648–659 (1989)
65. Jameson, A.K., Jameson, C.J.: Gas-phase 13C chemical shifts in the zero-pressure limit:
Refinements to the absolute shielding scale for 13C J. Chem. Phys. Lett. 134, 461–466 (1997)
66. Vásquez, M., Scheraga, H.A.: Variable-target-function and buildup procedures for the calcula-
tion of protein conformation—application to bovine pancreatic trypsin-inhibitor using limited
simulated nuclear magnetic-resonance data. J. Biomol. Struct. Dyn. 5, 757–784 (1988)
67. Kruskal Jr., J.B.: On the shortest spanning subtree of a graph and the traveling salesman
problem. Proc. Am. Math. Soc. 7, 48–50 (1956)
68. Li, Z., Scheraga, H.A.: Monte Carlo minimization approach to the multiple minima problem
in protein folding. Proc. Natl. Acad. Sci. U. S. A. 84, 6611–6615 (1987)
69. Li, Z., Scheraga, H.A.: Structure and free energy of complex thermodynamic systems. J.
Molec. Str. (Theochem) 179, 333–352 (1998)
70. Arnautova, Y.A., Jagielska, A., Scheraga, H.A.: A new force field (ECEPP05) for peptides
proteins and organic molecules. J. Phys. Chem. B 110, 5025–5044 (2006)
71. Vila, J., Williams, R.L., Vásquez, M., Scheraga, H.A.: Empirical solvation models can be used
to differentiate native from near-native conformations of bovine pancreatic trypsin inhibitor
Proteins: structure. Funct. Genet. 10, 199–218 (1991)
72. Ripoll, D.R., Ni, F.: Refinement of the thrombin-bound structure of a hirudin peptide by a
restrained electrostatically driven monte-carlo method. Biopolymers 32, 359–365 (1992)
73. Vorobjev, Y.N., Scheraga, H.A.: A fast adaptive multigrid boundary element method for
macromolecule electrostatic computations in solvent. J. Comp. Chem. 18, 569–583 (1997)
74. Vorobjev, Y.N., Vila, J.A., Scheraga, H.A.: FAMBE-pH: a fast and accurate method to compute
the total solvation free energies of proteins. J. Phys. Chem. B 112, 11122–11136 (2008)
75. Ripoll, D.R., Vorobjev, Y.N., Liwo, A., Vila, J.A., Scheraga, H.A.: Coupling between folding
and ionization equilibria: Effects of pH on the conformational preferences of polypeptides. J.
Mol. Biol. 264, 770–783 (1996)
76. Vila, J.A., Ripoll, D.R., Arnaturova, Y.A., Vorobjev, Y.N., Scheraga, H.A.: Coupling between
conformation and proton binding in proteins. Proteins 61, 56–68 (2005)
77. Sitkoff, D., Sharp, K.A., Honig, B.: Accurate calculation of hydration free energies using
macroscopic solvent models. J. Phys. Chem. 98, 1978–1988 (1994)
78. Barth, P., Alber, T., Harbury, P.B.: Accurate, conformation-dependent predictions of solvent
effects on protein ionization constants. Proc. Natl. Acad. Sci. U. S.A. 104, 4898–4903 (2007)
79. Hass, M.A.S., Hansen, D.F., Christensen, H.E.M., Led, J.J., Kay, L.E.: Characterization of
conformational exchange of a histidine side chain: protonation, rotamerization, and tautomer-
ization of His61 plastocyanin from Anabaena variabilis. J. Am. Chem. Soc. 130, 8460–8470
(2008)
696 J. A. Vila and Y. A. Arnautova

80. Serrano, P., Johnson, M.A., Chatterjee, A., Neuman, B., Joseph, J.S., Buchmeier, M.J., Kuhn,
P., Wüthrich, K.: NMR structure of the nucleic acid-binding domain of the SARS coronavirus
nonstructural protein 3. J. Virol. 83, 12998–13008 (2009)
81. Schwarzinger, S., Kroon, G.J.A., Foss, T.R., Chung, J., Wright, P.E., Dyson, H.J.: Sequence-
dependent correction of random coil NMR chemical shifts. J. Am. Chem. Soc. 123, 2970–2978
(2001)
82. Wang, Y., Jardetzky, O.: Investigation of the neighboring residue effects on protein chemical
shifts. J. Am. Chem. Soc. 12, 14075–14084 (2002)
83. Vijay-Kumar, S., Bugg, C.E., Cook, W.J.: Structure of ubiquitin refined at 1.8 Å resolution.
J. Mol. Biol. 194, 531–544 (1987)
84. Quirt, A.R., Lyerla Jr., J.R., Peat, I.R., Cohen, J.S.: Reynolds WF and freedman MH Carbon-
13 nuclear magnetic resonance titration shifts in amino acids. J. Am. Chem. Soc. 96, 570–574
(1974)
85. Rabenstein, D.L., Sayer, T.L.: Carbon-13 shifts parameters for amines, carboxylic acids and
amino acids. J. Magn. Res. 24, 27–39 (1976)
86. Sayer, T.L., Rabenstein, D.L.: Nuclear magnetic resonance studies of the acid-base chemistry
of amino acids and peptides. III Determination of the microscopic and macroscopic acid
dissociation constants of α, ω-diaminocarboxylic acids Can. J. Chem. 54, 3392–3400 (1976)
87. Surprenant, H.L., Sarneski, J.E., Key, R.R., Byrd, J.T., Reilley, C.N.: Carbon-13 studies of
amino acids: chemical shifts, protonation shifts, microscopic protonation behavior. J. Magn.
Res. 40, 231–243 (1980)
88. Lindorff-Larsen, K., Best, R.B., Depristo, M.A., Dobson, C.M., Vendruscolo, M.: Simulta-
neous determination of protein structure and dynamics. Nature 433, 128–132 (2005)
89. Chakrabarti, P., Pal, D.: Main-chain conformational features at different conformations of the
side-chains in proteins. Protein Eng. 11, 631–647 (1998)
90. Dumbrack Jr., R.L., Karplus, M.: Conformational analysis of the backbone-dependent rotamer
preferences of protein sidechains. J. Mol. Biol. 230, 543–574 (1993)
91. Chothia, C., Levitt, M., Richardson, D.: Structure of proteins: packing of α-helices and β-
sheets. Proc. Natl. Acad. Sci. U. S. A. 74, 4130–4134 (1977)
92. Chou, K.-C., Pottle, M., Némethy, G., Ueda, Y., Scheraga, H.A.: Structure of β sheets. Origin
of the right handed twist and of the increased stability of antiparallel over parallel sheets. J.
Mol. Biol. 162, 89–112 (1982)
93. Chou, K.-C., Scheraga, H.A.: Origin of the right handed twist of β sheets of poly(L Val)
chains. Proc. Natl. Acad. Sci. USA 79, 7047–7051 (1982)
94. Creighton, T.E.: Proteins: Structure and Molecular Properties, pp. 186, 223. W.E. Freeman
and Company, New York (1984)
95. Karplus, M.: Contact electron-spin coupling of nuclear magnetic moments. J. Chem. Phys.
30, 11–15 (1959)
96. Mandel, M.: Proton Magnetic resonance spectra of some proteins: I. Ribonuclease, oxidized
ribonuclease, lysozyme, and cytochrome c. J. Biol Chem. 240, 1586–1592 (1965)
97. Bradbury, J.H., Scheraga, H.A.: Structural studies of ribonuclease. XXIV. The application
of nuclear magnetic resonance spectroscopy to distinguish between the histidine residues of
ribonuclease. J. Am. Chem. Soc. 88, 4240–4246 (1966)
98. Bachovchin, W.W.: 15 N NMR spectroscopy of hydrogen-bonding interactions in the active
site of serine proteases: evidence for a moving histidine mechanism. Biochemistry 25,
7751–7759 (1986)
99. Cheng, F., Sun, H., Zhang, Y., Mukkamala, D., Oldfield, E.: A solid state 13C NMR, crystal-
lographic, and quantum chemical investigation of chemical shifts and hydrogen bonding in
histidine dipeptides. J. Am. Chem. Soc. 127, 12544–12554 (2005)
100. Farr-Jones, S., Wong, W.Y.L., Gutheil, W.G., Bachovchin, W.W.: Direct observation of the tau-
tomeric forms of histidine in 15 N NMR spectra at low temperatures. Comments on intramolec-
ular hydrogen bonding on tautomeric equilibrium. J. Am. Chem. Soc. 115, 6813–6819 (1993)
101. Harbison, G., Herzfeld, J.: Griffin RGJ Nitrogen-15 chemical shifts tensors in L-histidine
hydrochloride monohydrate. J. Am. Chem. Soc. 103, 4752–4754 (1981)
13 C Chemical Shifts in Proteins: A Rich … 697

102. Hass, M.A.S., Yilmaz, A., Christensen, H.E.M., Led, J.J.: Histidine side-chain dynamics
and protonation monitored by 13C CPMG NMR relaxation dispersion. J. Biomol. NMR 44,
225–233 (2009)
103. Hu, F., Wenbin, L., Hong, M.: Mechanism of proton conduction and gating in influenza M2
proton channels from solid-state NMR. Science 330, 505–508 (2010)
104. Jensen, M.R., Has, M.A.S., Hansen, D.F., Led, J.J.: Investigating metal-binding in proteins
by nuclear magnetic resonance. Cell. Mol. Life Sci. 64, 1085–1104 (2007)
105. Markley, J.L.: Observation of histidine residues in proteins by means of nuclear magnetic
resonance spectroscopy. Acc. Chem. Res. 8, 70–80 (1974)
106. Meadows, D.H., Jardetzky, O., Epand, R.M., Ruterjans, H.H., Scheraga, H.A.: Proc. Natl.
Acad. Sci. U.S.A. 60, 766–772 (1968)
107. Pelton, J.G., Torchia, D.A., Meadow, N.D., Roseman, S.: Tautomeric states of the active-site
histidine of phosphorylated and unphosphorylated IIIGlc, a signal-transducing protein from
Escherichia coli, using two-dimensional heteronuclear NMR techniques ProtSci 2, 543–558
(1993)
108. Reynolds, W.F., Peat, I.R., Freedman, M.H., LyerlaJr, J.R.: Determination of the tautomeric
form of the imidazole ring of L-Histidine in basic solution by carbon-13 magnetic resonance
spectroscopy. J. Am. Chem. Soc. 95, 328–331 (1973)
109. Schuster, I.I., Roberts, J.D.: Nitrogen-15 nuclear magnetic resonance spectroscopy. Effects of
hydrogen bonding and protonation on nitrogen chemical shifts in imidazoles. J. Org. Chem.
44, 3864–3867 (1979)
110. Shimba, N., Serber, Z., Lewidge, R., Miller, S.M., Craik, C.S., Dotsch, V.: Quantitative iden-
tification of the protonation state of histidine in vitro and in vivo. Biochem 42, 9227–9234
(2003)
111. Shimba, N., Takahashi, H., Sakakura, M., Fuji, I., Shimada, I.: Determination of protonation
and deprotonation forms and tautomeric states of histidine residues in large proteins using
nitrogen-carbon J couplings in imidazole ring. J. Am. Chem. Soc. 120, 10988–10989 (1998)
112. Steiner, T.: L-Histidyl-L-alanine dehydrate. Acta. Cryst. C 52, 2554–2556 (1996)
113. Steiner, T., Koellner, G.: Coexistence of both histidines tautomers in the solid state and sta-
bilization of the unfavorable Nδ-H form by intramolecular hydrogen bonding: rystalline L-
His-Gly hemihydrates. Chem. Commun. 13, 1207–1208 (1997)
114. Strohmeier, M., Stueber, D., Grant, D.M.: Accurate 13C and 15 N chemical shift and 14 N
quadrupolar coupling constant calculations in amino acid crystals: Zwitterionic, hydrogen-
bonded systems. J. Phys. Chem. A 107, 7629–7642 (2003)
115. Sudmeier, J.L., Bradshaw, E.M., Coffman Haddad, K.E., Day, R.M., Thalhauser, C.J., Bullock,
P.A., Bachovchin, W.W.: Identification of histidine tautomers in proteins by 2D 1H/13Cδ2
one-bond correlated NMR. J. Am. Chem. Soc. 125, 8430–8431 (2003)
116. Wüthrich, K.: NMR in Biological Research: Peptides and Proteins. North-Holland, Amster-
dam (1976)
117. Ulrich, E.L., Akutsu, H., Doreleijers, J.F., Harano, Y., Ioannidis, Y.E., Lin, J., Livny, M.,
Mading, S., Maziuk, D., Miller, Z., Nakatani, E., Schulte, C.F., Tolmie, D.E., Wenger, R.K.,
Yao, H., Markley, J.L.: BioMagResBank nucleic. Acids Res. 36, D402–D408 (2008)
118. Demchuk, E., Wade, R.C.: Improving the continuum dielectric approach to calculating pKas
of ionizeable groups in proteins. J. Phys. Chem. 100, 17373–17387 (1996)
119. DePristo, M.A., de Bakker, P.I.W., Blundell, T.L.: Heterogeneity and inaccuracy in protein
structures solved by X-ray crystallography. Structure 12, 831–838 (2004)
120. Ringe, D., Petsko, G.A.: Study of protein dynamics by X-ray diffraction Methods in Emzy-
mology 131, 389–433 (1986)
121. Furnham, N., Blundell, T.L., DePristo, M.A., Terwilliger, T.C.: Is one solution good enough?
Nature Struct. Mol. Biol. 13, 184–185 (2006)
122. Wang, Y., Jardetzky, O.: Probability-based protein secondary structure identification using
combined NMR chemical-shift data. Prot Sci 11, 852–861 (2002)
123. Höfinger, S., Almeida, B., Hansmann, U.H.E.: Parallel tempering molecular dynamics folding
simulation of a signal peptide in explicit water. Proteins 68, 662–669 (2007)
698 J. A. Vila and Y. A. Arnautova

124. Jang, S., Kim, E., Pak, Y.: Free energy surfaces of miniproteins with a beta beta alpha motif:
replica exchange molecular dynamics simulation with an implicit solvation model. Proteins
62, 663–671 (2006)
125. Mohanty, S., Hansmann, U.H.E.: Folding of proteins with diverse folds. Biophy. J. 91,
3573–3578 (2006)
126. Zhou, R.: Free energy landscape of protein folding in water: Explicit versus implicit solvent.
Proteins 53, 148–161 (2003)
127. Santiveri, C.M., Santoro, J., Rico, M., Jiménez, M.A.: Factors involved in the stability of
isolated beta-sheets: turn sequence, beta-sheet twisting, and hydrophobic surface burial. Prot.
Sci. 13, 1134–1147 (2004)
128. Zhao, D., Jardetzky, O.: An assessment of the precision and accuracy of protein structures
determined by NMR–dependence on distance errors. J. Mol. Biol. 239, 601–607 (1994)
129. Korzhnev, D.M., Orekhov, V.Y., Arseniev, A.S.: Model-free approach beyond the borders of
its applicability. J. Mag. Res. 127, 184–191 (1997)
130. Palmer III, A.G.: NMR characterization of the dynamics of biomacromolecules. Chem. Rev.
104, 3623–3640 (2004)
131. Case, D.A., Darden, T.A., Cheatham, T.E., III, Simmerling, C.L., Wang, J., Duke, R.E., Luo,
R., Merz, K.M., Wang, B., Pearlman, D.A., et al.: AMBER 8 University of California, San
Francisco (2004)
132. Zhou, Y., Vitkup, D., Karplus, M.: Native proteins are surface-molten solids: Application of
the Lindemann criterion for the solid versus liquid state. J. Mol. Biol. 285, 1371–1375 (1999)
133. Kuzin, A.P., Su M., Seetharaman, J., Janjua, H., Cunningham, K., Maglaqui, M., Owens,
L.A., Zhao, L., Xiao, R., Baran, M.C., Acton, T.B., Rost, B., Montelione, G.T., Hunt, J.F.,
Tong, L.: Crystal structure of UPF0291 protein ynzC from Bacillus subtilis at resolution 2.0
A. (2008) Northeast Structural Genomics Consortium target SR384. https://doi.org/10.2210/
pdb3bhp/pdb
134. Kawai, Y., Moriya, S., Ogasawara, N.: Identification of a protein YneA, responsible for
cell division suppression during the SOS response in Bacillus subtilis. Mol. Microbiol. 47,
1113–1122 (2003)
135. Aramini, J.M., Sharma, S., Huang, Y.J., Swapna, G.V.T., Ho, C.K., Shetty, K., Cunningham,
K., Ma, L.-C., Zhao, L., Owens, L.A., Jiang, M., Xiao, R., Liu, J., Baran, M.C., Acton, T.B.,
Rost, B., Montelione, G.T.: Solution NMR structure of the SOS response protein YnzC from
Bacillus subtilis Proteins: Structure. Funct. Bioinformatics 72, 526–530 (2008)
136. Vila, J. A., Baldoni, H. A., Scheraga, H. A.: performance of density functional models to
reproduce observed 13Cα chemical shifts of proteins in solution. J. Comp. Chem. 38, 884–892
(2008b)
137. Sippl, M.J.: Recognition of errors in three-dimensional structures of proteins. Proteins 17,
355–362 (1993)
138. Kleywegt, G.J.: On vital aid: the why, what and how of validation Acta. Cryst, D 65, 134–139
(2009)
139. Sevcik, J., Dauter, Z., Lamzin, V.S., Wilson, K.S.: Ribonuclease from streptomyces aureofa-
ciens at atomic resolution. Acta Cryst D D52, 327–344 (1996)
Protein Secondary Structure
Assignments and Their Usefulness
for Dihedral Angle Prediction

Eshel Faraggi and Andrzej Kloczkowski

Abstract We present and compare different protein secondary structure assign-


ment methods and the effect of their use in dihedral angle prediction. It is found
that consensus reassignment of secondary structure tends to improve the accuracy
of secondary structure prediction. However, it is less useful for the prediction of the
dihedral angles than a better defined reassignment method based on angle values.
Considering reassigned residues, we find them to be hard to predict. We find the
most significant improvement for reassigned residues is due to our new reassign-
ment method. This method also reassigns a smaller number of residues as compared
to consensus methods. We additionally find that improvements to the accuracy of
dihedral angle prediction is due both to single residue and local-neighborhood effects.

E. Faraggi
Department of Physics, Indiana University Purdue University Indianapolis,
Indianapolis, IN 46202, USA
E. Faraggi
Department of Physics, Butler University, Indianapolis, IN 46208, USA
E. Faraggi · A. Kloczkowski
Battelle Center for Mathematical Medicine,
The Research Institute at Nationwide Children’s Hospital,
Columbus, OH 43215, USA
E. Faraggi (B)
Physics Division, Research and Information Systems,
LLC, Indianapolis, IN 46240, USA
e-mail: efaraggi@gmail.com
A. Kloczkowski
Department of Pediatrics, The Ohio State University, Columbus, OH 43215, USA
e-mail: Andrzej.Kloczkowski@nationwidechildrens.org

© Springer Nature Switzerland AG 2019 699


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_21
700 E. Faraggi and A. Kloczkowski

1 Introduction

Proteins are the most important part of the biological machinery that carries out the
instructions contained in genetic code. As such, proteins are responsible on some
level for all biological functions. For example charge regulation, tissue building and
repair, and cellular transport are all achieved by the use of proteins. Proteins preform
their functional duties by various interactions associated with their three-dimensional
structures. Their three-dimensional structures are thought to be determined by their
amino acid sequence which in turn is specifically encoded in the genetic material of
the hosting organism.
While the genetic code, and the amino acid sequences of proteins, can be obtained
using automated experimental whole-genome sequencing and whole exome sequenc-
ing procedures relatively cheaply, currently the three-dimensional structures of pro-
teins can only be experimentally determined using labor-intensive and costly pro-
cedures. This creates a widening gap between the number of proteins for which
the sequence is known and the number of proteins for which the three-dimensional
structure is solved. Furthermore, for some proteins it is difficult to obtain experimen-
tal structure either because isolating the protein or applying standard experimental
procedures such as solvation and crystallization is difficult.
These considerations lead to extensive interest and activity in the field of protein
structure prediction. Protein structures are typically categorized into four levels of
increasing structural information. The first level in this categorization, sometimes
called the primary structure, is the amino acid sequence of the protein. This is the
chemical structure of the protein. The second level is the so called secondary struc-
ture which involves the patterns of hydrogen bonds corresponding to α-helices or
β-sheets along the amino acid sequence. These patterns are manifested as local
three dimensional structures. The third level of structural information, the tertiary
structure, is associated with the packing of secondary structure elements into single
domain protein structure. In theory tertiary structure can be deduced from the dihe-
dral angles. In practice however, small errors in the coil regions create large errors
for the overall arrangement of secondary structure elements and additional local
refinement is needed. The forth level of structural classification is called the quater-
nary structure and is associated with packing of the tertiary structure of individual
protein chains into functioning biological multimeric assemblies of several chains.
This hierarchy enables proteins to bridge the size gap between individual atoms and
biological components.
Prediction of protein structure usually exploits this hierarchy as well. Secondary
structure predictions [1–26] are used to set initial conditions and act as constraints
in three-dimensional prediction schemes [27–34]. Recently it was shown that substi-
tution of dihedral angles for secondary structure constraints in template-free model-
ing of three-dimensional structure results in a 100% improvement in the prediction
accuracy (twice the number of structures predicted to within 6 Å of native structure).
Faraggi et al. [35] Part of that approach of predicting dihedral angles uses secondary
structures predictions as input features. The present work shows how these secondary
Protein Secondary Structure Assignments … 701

structures were obtained and analyzes their usefulness relative to other assignment
schemes as input features for the prediction of the dihedral angle ψ.
Additionally to the structured proteins there are proteins that are partially disor-
dered (containing both ordered and disordered regions) or fully disordered. These
types of proteins are not considered here.

2 Materials and Methods

2.1 Secondary Structure Assignment

A necessary step in developing a predictor for secondary structure is the assignment of


the different amino acids into structural motifs. A common approach for this assign-
ment is to use the program DSSP (Define Secondary Structure of Proteins) [36],
which analyzes the pattern of the hydrogen bonds between neighboring amino acids
in the protein sequence. Other methods for secondary structure assignment exist,
e.g., STRIDE (Structure Identification) [37]. In the so-called hard classification the
eight assignment states of DSSP are converted into three states by grouping the 310 -
helix, alpha-helix and pi-helix (G, H, I) into helix (H); beta-bridge and extended
strand (B, E) into strand (E); and hydrogen bonded turn, bend and other (T,S,_) into
coil (C). Other approaches for assigning secondary structure also exist, such as the
so-called easy classification where alpha-helix is taken as helix, extended-strand as
strand and all the rest of the DSSP states are assigned as coil. As discussed by Wei
et al. [38] different classification approaches will lead to different assignments for
some residues. They also showed that a consensus based on several secondary struc-
ture assignment methods results in improved agreement of the assigned secondary
structures for structurally aligned proteins.
Since we use these secondary structure assignments for dihedral angles predictions
it is instructive to study the angle distributions for the three secondary structure
classes. In Fig. 1a we show the frequency of the observed dihedral ψ angles for the
three classes according to the hard classification. While the helix class is well peaked
around a specific range of ψ angle values, the sheet class shows a major peak around
ψ ≈ 130◦ and a minor one around ψ ≈ −50◦ . A second peak is observable also for
the φ angle of the sheet conformation and similarly for helix, though for helix the
frequency of occurrence is much reduced. We shall refer to angles in the second peak
as odd angles. A typical three-dimensional structure associated with these angles is
shown in Fig. 1b (plot produced with Rasmol [39]). These dihedral angles appear to
be associated with sharp turns in the sheet. We can quantify the curvature along the
sheet by the angle between consecutive segments of the Ca backbone, labeled θ . In
Fig. 1c we give θ (i) as a function of ψ in the sheet conformation. We see that these
ψ angles correspond uniquely to right angle turns. These considerations led us [35]
to modify the assignment of residues involved in the formation of these odd angles
from their original assignment into the coil class.
702 E. Faraggi and A. Kloczkowski

(a) 1 Sheet
0.9 Coil
0.8 Helix
0.7

Frequency
0.6
0.5
0.4
0.3
0.2
0.1
0
-200 -150 -100 -50 0 50 100 150 200
ψ

(b)

(c) 140
120

100

80
θ

60

40

20

0
-200 -150 -100 -50 0 50 100 150 200
ψ

Fig. 1 a Distribution of dihedral angle ψ for all three secondary structures. b Example of a defor-
mation in the three-dimensional structure associated with a beta-sheet, the dihedral angle ψ there
would be in the range of values covered by alpha-helix structures. c The distribution of the curvature
angle (θ) along the backbone
Protein Secondary Structure Assignments … 703

Another approach we can take is to reassign the secondary structure for residues
with odd angles to the opposite structure classes. That is, reassign sheet residues
with odd angles into the helix class and reassign helix residues with odd angles
into the sheet class. Since for the distribution of ψ the locations of the odd angles
approximately correspond to the opposite structure class, this modification may help
in the prediction of quantities such as dihedral angles where predicted secondary
structures are used as input. As we shall see later, these reassignments actually do
not change the accuracy of secondary structure prediction.
The following nomenclature will be used to designate the types of secondary
structure assignment. A1 will be used to designate the original assignment by DSSP.
A2 and A5 will be used to designate the modified assignment where helix and sheet
are interchanged for residues with odd angles. A3 and A4 will be used to designate
reassignment into coil those residues with odd angles. For A2 and A4 we perform the
modification on the DSSP assignments, while for A3 and A5 we perform odd angle
reassignments on the consensus secondary structure as discussed by Wei et al. [38].
We will use A6 to designate the easy classification of the DSSP assignment as
discussed above.

2.2 Neural Network

We have used several types of neural networks to analyze this problem. In the first part
of the analysis we used a neural network to predict the secondary structure assignment
according to the classifications described above. One should note that at this step
we are interested in a comparative analysis between the different methods and not in
the overall accuracy. Results for the overall accuracy of the SPINE-X server will be
given later. The general form of the neural networks used to predict the secondary
structure were described in detail earlier [35, 40], here we give only a brief overview
of them. The input layer for the network is composed of a n-residue input window
where each residue has the following descriptors: Twenty values from the position
specific scoring matrix as obtained from the PSI-BLAST program [41] normalized
by 9.0. Seven physical parameters describing the physicochemical properties of the
residue: a steric parameter(graph shape index), hydrophobicity, volume, polarizabil-
ity, isoelectric point, helix probability, and sheet probability. These parameters were
identified by Meiler et al. [42] and have been proved useful in protein structure pre-
diction [18, 35, 40, 43, 44]. They were linearly normalized such that their values vary
between -1 and 1. We also constructed our own mutation profile by taking aligned
sequences from the PSI-BLAST NR dataset with bit values between 20 and 60.
Included in this profile were also probabilities of gaps at particular residues. Finally
we calculated a sequence complexity parameter [45] for window sizes of 5, 11, 21,
31 around a given residue. Thus, for a given residue we have a 21 × 52 = 1092 input
features. Three state secondary structure probabilities were predicted for the central
residue in a 21 residue input window.
704 E. Faraggi and A. Kloczkowski

Two hidden layers, each with 71 nodes, were used for a preprocessing network.
Probability predictions from this network were further refined using a filter network
with a single hidden layer of 51 nodes. In addition, guiding weights were used
to control the dependence on sequence separation as described in Refs. [35, 40].
Training and testing for all neural networks considered here was preformed on the
SPINE dataset [18].
Predicted secondary structure assignments are then used as input for ψ angle
prediction. We chose the ψ angle since its variation is a good discriminator between
the helix and sheet configurations. In general the networks used to predict ψ angles
were composed of two hidden layers, each with 51 nodes. The input to the neural
network varies as we wish to study the dependence on the various assignments, and
how these assignments complement other input features such as physical parameters
and Position Specific Scoring Matrix (PSSM).
We would like to study the exact effect of the secondary structure reassignment on
the prediction accuracy for angle prediction. In the first case we will use only the three
predicted probabilities for the secondary structure assignment for a single residue
window along the chain. However, additional information is probably achievable by
the introduction of a bigger window size. Information about residues in harder to
predict odd angle regions can also be contained in the probabilities of the secondary
structures. Hence, we shall also study the accuracies for a window size of 21 residues.
Ten fold cross validation will be used to judge the accuracy of the predictions.
Vacant positions in the windows around residues near the terminals of the protein
chain are explicitly excluded from the computation by limiting the range of the
input window. We use a bipolar activation function given by f (x) = tanh(αx), with
α = 0.2, momentum of 0.4, and the back-propagation method with relatively slow
learning (learning rate 0.001) to optimize the weights. To determine the quality
of the prediction we use the Mean Absolute Error (MAE) in degrees, Pearson’s
correlation coefficient ( pc ), and the probability that the predicted and native angles
are separated by less than 10% (Q 10 p ). We use 10-fold cross-validations [35] to
estimate the accuracy over the set. To test for possible overfit issues we take secondary
structure predictions based on the weights trained for the first cross-validation fold
and compare angle prediction between the folds.

3 Results

3.1 Comparative Analysis of Assignment Methods

We start with the accuracy of predicting the secondary structure assignments. Table 1
gives the accuracy for the six different classifications A1 through A6. For each
classification type we give the overall accuracy and the accuracy spliced according
to native and predicted secondary structure types. For a given secondary structure
type we also show its density. We first notice that the overall accuracy is the same
for A1 and A2. As we shall see later A2 is still useful for improving the dihedral
Protein Secondary Structure Assignments … 705

Table 1 Fraction of correctly predicted secondary structure residues and densities of structure
types
A1 ρ1a A2 ρ2b A3 ρ3c A4 ρ4d A5 ρ5e A6 ρ6f
All 0.800 0.800 0.812 0.803 0.809 0.821
Sheetn 0.729 0.232 0.723 0.222 0.736 0.236 0.719 0.220 0.734 0.239 0.734 0.221
Coiln 0.800 0.388 0.810 0.394 0.806 0.371 0.810 0.403 0.806 0.364 0.838 0.439
Helixn 0.842 0.380 0.834 0.385 0.863 0.393 0.846 0.377 0.856 0.398 0.857 0.340
Sheet p 0.788 0.215 0.779 0.206 0.794 0.218 0.785 0.202 0.785 0.223 0.795 0.204
Coil p 0.746 0.416 0.751 0.424 0.749 0.399 0.757 0.431 0.745 0.393 0.792 0.464
Helix p 0.867 0.369 0.867 0.370 0.888 0.383 0.868 0.367 0.888 0.384 0.878 0.332
Ten fold cross validated prediction accuracy for three state secondary structure as a function of
the native and predicted secondary structure types. Also given is the density, ρ, of the specific
secondary structure type. a A1: original DSSP assignment. b A2: switch between helix and sheet
for residues with special dihedral ψ angle and switch to coil assignment for special φ angles. c A3:
First apply consensus assignment [38], then shift all special dihedral angles to coil. d A4: Switch
all residues with special angles to coil assignment. e A5: Apply consensus assignment, then switch
helix/sheet assignment for residues with special ψ angles. f A6: Easy assignment from original
DSSP classification, with only alpha-helix assigned as helix, extended sheet as sheet, and all others
assigned as coil n Native. p Predicted

angle prediction. We also note that the accuracy is noticeably better for the consensus
assignments A3 and A5. Overall the relationship between the predicted and native
densities is similar between the three assignment methods. It is interesting to note
also that the improvement in prediction accuracy for method A3 comes mostly from
improved identification of structured segments: helix then sheet, and lastly coil. For
the A6 method we find that the secondary structure prediction accuracy is improved.
This is a known effect and is due to the relative easiness of this assignment scheme.
As we shall see later this improved accuracy does not translate into in an improved
accuracy for the prediction of derived quantities.
Overall it seems that for prediction accuracy of secondary structure assignments
A3 works best. However, we should consider the effects these classifications and
their prediction have on prediction of derived quantities. Here we shall analyze what
happens to the prediction of the dihedral angle ψ as we use the different classifications
mentioned above.
First we wish to consider as much as possible the local reassignment by taking
a window of a single residue and using only the predicted secondary structures as
inputs. In Table 2 we give the prediction accuracy of ψ in terms of Q10 p, MAE,
and pc . It seems that on the single residue level A2 gives an improved prediction for
ψ, even though A2, unlike the other assignment methods, gives no improvement for
secondary structure prediction. We see that overall reassignment remain relatively
constant with A2 slightly ahead of the rest. In terms of specific secondary structure
states, we see that all reassignment methods produce better results for the native sheet
and coil conformations, a challenging task that is important for tertiary structure
prediction [35].
706 E. Faraggi and A. Kloczkowski

Table 2 Prediction accuracy for the ψ dihedral angle using predicted secondary structure three
state assignment vector for the central residue (input window of one residue)
A1 A2 A3 A4 A5
All (70.1, 49.0, (70.4, 48.5, (70.5, 48.3, (69.7, 49.8, (70.3, 48.5,
0.637) 0.640) 0.643) 0.632) 0.638)
Sheetn (86.5, 25.7, (87.0, 24.9, (86.7, 25.5, (86.5, 25.7, (86.4, 25.9,
0.019) 0.163) 0.006) 0.013) 0.053)
Coiln (45.5, 80.4, (45.4, 80.7, (46.0, 79.0, (45.4, 80.7, (45.8, 79.3,
0.144) 0.141) 0.168) 0.139) 0.165)
Helixn (85.1, 31.2, (85.7, 30.1, (85.5, 30.8, (84.2, 32.8, (85.6, 30.8,
0.310) 0.317) 0.312) 0.328) 0.309)
Sheet p (83.2, 31.1, (85.7, 27.8, (85.0, 28.5, (85.1, 28.8, (83.6, 30.6,
0.005) 0.005) 0.008) −0.002) −0.001)
Coil p (43.9, 86.2, (44.1, 85.8, (43.6, 86.8, (43.9, 86.5, (43.7, 86.6,
−0.002) 0.001) 0.001) 0.002) 0.006)
Helix p (91.1, 18.6, (91.3, 18.3, (90.4, 19.2, (91.3, 18.4, (89.8, 20.1,
−0.010) −0.009) −0.006) −0.005) −0.003)
Ten fold cross validated prediction accuracies for the ψ dihedral angle as a function of the native
and predicted secondary structure types. For each assignment and secondary structure type we give
a three vector of the form (Q10 p, M AE, corr elation). The assignments A1 through A5 are as
given in Table 1. n Native. p Predicted

By including neighboring residues and expanding the window size to 21, we


find an additional improvement in the accuracy that is observed for all reassignment
methods. This is presented in Table 3. Additionally, the consistent improvement for
native sheets and coils is also more pronounced for this case. It is interesting that
besides improving the overall accuracy, increasing the window size increases the
benefits from the reassignment methods. If instead of the probability vector we use
the assignment three state vector as inputs for the 21-residue window we find an even
more significant improvement for the reassignment methods with Q10% relatively
improving by as much as 1.5% for method A3 and by about 1% for method A2.
In Table 4 we give the prediction accuracy of ψ in terms of Q10 p, MAE, and
pc . In this case we use the PSSM, physical parameters and predicted secondary
structure probabilities as inputs features. Note that angle symmetry is not used in
these calculations. These accuracies are given using the same splicing as previously.
It is interesting to investigate the influence of the training set on the prediction
accuracies. This will also allow further investigation of over-training. In Fig. 2 we
show the MAE of the dihedral angle ψ for the ten different sets used in the ten-fold
cross-validation. In this case the position specific scoring matrix, physical param-
eters, and predicted secondary structure probabilities are used as inputs. Predicted
secondary structure probabilities are taken from the weights tested on the first fold.
Hence, depending on the amount of over-training for the secondary structure, the
first fold would be at a disadvantage and would generally have a higher MAE. As is
seen from Fig. 2, in this case the amount of over-training is small as the accuracy of
predicted ψ angle for the first fold is comparable to the accuracy from the other folds.
Protein Secondary Structure Assignments … 707

Table 3 Prediction accuracy for the ψ dihedral angle using predicted secondary structure three
state probability vector for a 21 residues window
A1 A2 A3 A4 A5
All (74.1, 42.1, (74.5, 41.8, (74.6, 41.8, (74.4, 42.1, (74.6, 41.9,
0.651) 0.653) 0.651) 0.645) 0.645)
Sheetn (84.2, 29.5, (85.1, 28.6, (85.1, 28.8, (84.4, 30.1, (84.4, 30.1,
0.031) 0.149) 0.105) 0.064) 0.091)
Coiln (50.3, 72.7, (50.5, 72.7, (51.6, 71.5, (51.1, 72.2, (51.8, 71.3,
0.256) 0.250) 0.257) 0.254) 0.261)
Helixn (92.3, 18.4, (92.4, 18.2, (91.7, 19.3, (92.2, 18.5, (91.8, 19.1,
0.250) 0.263) 0.276) 0.258) 0.254)
Sheet p (83.4, 30.8, (85.9, 27.5, (85.2, 28.3, (85.3, 28.6, (84.0, 30.2,
0.107) 0.059) 0.043) 0.053) 0.095)
Coil p (53.0, 70.7, (52.9, 71.1, (52.4, 72.5, (54.2, 69.7, (52.8, 71.8,
0.285) 0.272) 0.252) 0.287) 0.257)
Helix p (91.9, 17.3, (92.2, 17.1, (91.9, 17.3, (92.1, 17.1, (91.3, 18.2,
0.243) 0.236) 0.271) 0.231) 0.260)
Ten fold cross validated prediction accuracies for the ψ dihedral angle as a function of the native
and predicted secondary structure types. For each assignment and secondary structure type we give
a three vector of the form (Q10 p, M AE, corr elation). The assignments A1 through A5 are as
given in Table 1. n Native. p Predicted

Table 4 Prediction accuracy for the ψ dihedral angle using PSSM, physical parameters and pre-
dicted secondary structure probabilities for a 21 residue window using the indicated assignment
method A1 through A6
A1 A2 A3 A4 A5 A6
All (78.5, 36.7, (78.8, 36.4, (78.6, 36.7, (78.4, 36.9, (78.3, 37.3, (78.1, 37.4,
0.698) 0.701) 0.697) 0.695) 0.693) 0.693)
Sheetn (86.7, 27.2, (87.0, 26.8, (86.7, 27.4, (86.6, 27.5, (86.5, 27.7, (87.0, 26.8,
0.170) 0.207) 0.188) 0.190) 0.178) 0.176)
Coiln (60.5, 59.6, (60.6, 59.6, (60.8, 59.4, (60.4, 59.8, (60.6, 59.6, (60.3, 60.0,
0.393) 0.393) 0.395) 0.390) 0.392) 0.387)
Helixn (91.9, 19.3, (92.2, 18.5, (91.7, 19.4, (91.7, 19.4, (91.4, 20.2, (90.9, 20.9,
0.247) 0.253) 0.257) 0.253) 0.252) 0.257)
Sheet p (85.4, 29.0, (87.1, 26.5, (86.4, 27.2, (86.6, 27.4, (85.4, 28.8, (86.2, 28.1,
0.248) 0.175) 0.145) 0.201) 0.198) 0.252)
Coil p (62.4, 58.9, (62.4, 58.9, (61.3, 60.5, (62.6, 58.7, (61.3, 60.7, (63.1, 58.3,
0.401) 0.400) 0.384) 0.406) 0.381) 0.424)
Helix p (92.1, 17.1, (92.4, 16.9, (92.1, 17.3, (92.2, 17.1, (91.5, 18.2, (94.2, 13.9,
0.292) 0.302) 0.291) 0.273) 0.308) 0.239)
Ten fold cross validated prediction accuracies for the ψ dihedral angle as a function of the native
and predicted secondary structure types. For each assignment and secondary structure type we give
a three vector of the form (Q10 p, M AE, corr elation). The assignments A1 through A6 are as
given in Table 1. n Native. p Predicted
708 E. Faraggi and A. Kloczkowski

0.209
0.208
0.207
0.206

MAE 0.205
0.204
0.203
0.202
0.201
0.2
0.199
1 2 3 4 5 6 7 8 9 10
Fold

Fig. 2 Mean absolute error of the dihedral angle ψ as a function of the different folds in a ten-fold
cross-validation. In this case the position specific scoring matrix, physical parameters, and predicted
secondary structure probabilities are used as inputs. Predicted secondary structure probabilities are
taken from the weights tested on the first fold. Hence, depending on the amount of over-training for
the secondary structure, the first fold would be at a disadvantage and would generally have a higher
mean absolute error. As is seen from the plot in this case the amount of over-training is small as the
ψ accuracy for the first fold is comparable to the accuracies from the other folds

For the assignment method A6, which produced a significantly better secondary
structure prediction, we find that the accuracy of the ψ angle predicted from these val-
ues is significantly worse than the rest of the assignment methods. Both the improved
secondary structure prediction and the diminished accuracy of the ψ angle prediction
results from the relatively coarser discrimination between secondary structure states.

4 Discussion

In general, one should not be surprised that the effect of reassignment over the entire
dataset is small. The reason is that only a small set of residues had its secondary
structure reassigned. It is instructive to look at the effect of a modification specifically
for those residues that are modified.
In Table 5, we consider specifically those residues for which the DSSP assign-
ment was modified. For modifications by method A3, there are 7% of residues with
reassigned secondary structure out of the total database of 583,935 residues. The
accuracy for secondary structure prediction for these reassigned residues is 49% for
the A1 assignment and 65% for the A3 assignment. This improved secondary struc-
ture prediction is translated to better dihedral angle prediction. The MAE is reduced
from 51.6◦ to 50.9◦ , while the correlation increases from 0.545 to 0.551. In terms of
peak prediction [35] the accuracy is improved from 78.5 to 79.3% correctly predicted
peaks.
Protein Secondary Structure Assignments … 709

Table 5 Prediction accuracy for reassigned residues


A1 A2 A3
Set modified by (36.0, 99.5, 0.194, (39.6, 94.0, 0.232, (37.6, 96.7, 0.211,
method A2 0.504) 0.546) 0.527)
Set modified in (66.7, 51.6, 0.545, (67.6, 50.3, 0.558, (67.2, 50.9, 0.551,
method A3 0.785) 0.797) 0.793)
Ten fold cross validated prediction accuracies for the ψ dihedral angle for
the reassigned sets of residues. For each case a four vector of the form
(Q10p, MAE, correlation, peak prediction accuracy) gives the accuracy. The assignments A1
through A3 are as given in Table 1

Out of the entire database there are 1.5% of residues for which their secondary
structure assignment was modified by the A2 method. The accuracy for secondary
structure prediction for this set of residues is 46% using method A1, 44% using
method A2, and 67% using method A3. Overall the secondary structure prediction
accuracy resulting from method A2 was reduced for this case. A plausible reason is
that reassigned residues occur within larger blocks of secondary structure types. Reas-
signing their secondary structure types creates isolated secondary structure regions
which are harder to predict. It is interesting to note that over the set of residues
reassigned by method A3 the secondary structure accuracy from A2 is 51%, better
than that of method A1. While not significantly improving the accuracy of secondary
structure prediction, method A2 produced the most significant effect in terms of the
accuracy of ψ prediction. For the set modified by method A2, the MAE for ψ using
method A1 is 99.5◦ and the correlation is 0.194. Using method A3 the dihedral angle
prediction accuracy is improved to an MAE of 96.7◦ and correlation of 0.211. Using
method A2 the accuracy is further improved with an MAE of 94.0◦ and a correla-
tion of 0.232. In terms of peak prediction the improvement is from 50.4 to 54.6%.
The hypothesis of equal distribution of error can be rejected at greater than a 99%
confidence level according to a t-test calculation. It is also interesting to note that the
accuracy of method A2 over the set of residues modified by method A3 is also sig-
nificantly better with an MAE of 50.3◦ and a correlation of 0.558. Note that judging
by these accuracy parameters it is evident that both the set of residues modified by
A2 and those modified by A3 are difficult to predict.

5 Conclusions

We have presented a comparison between different secondary structure assignment


methods, and the effect of their use in dihedral angle prediction. We find the following
key points. Improvements to the accuracy of dihedral angle ψ prediction is due both
to single residue and local-neighborhood effects. That is, reassignment, window
size, secondary structure type, etc., all determine the secondary structure prediction
accuracy and its usefulness for the prediction of the ψ angle.
710 E. Faraggi and A. Kloczkowski

We also find that consensus reassignment of secondary structure tends to improve


the accuracy of secondary structure prediction. It decreases the accuracy of ψ angle
obtained using these secondary structure predictions on a single residue window level.
However, when the window size is increased to 21 residues both reassignments A2
and A3 give better ψ predictions. In terms of usefulness for ψ prediction, the overall
best reassignment method we found was A2.
When specifically considering those residues for which there is disagreement
between the different reassignment methods, we find this to be a hard-to-predict
group. For example, MAE for this group is two to three times greater than the
overall average. For this group we also find the most significant improvement using
reassignment. It is interesting to note that though only a small set of residues were
reassigned by method A2, the removed ambiguity increases the accuracy over the
prediction of all residues.

Acknowledgements We gratefully acknowledge support from National Science Foundation grant


DBI 1661391, and Bridge funds provided by The Research Institute at Nationwide Children’s
Hospital.

References

1. Garnier, J., Osguthorpe, D.J., Robson, B.: Analysis of the accuracy and implications of simple
methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120(1), 97–
120 (1978)
2. Gibrat, J.-F., Garnier, J., Robson, B.: Further developments of protein secondary structure
prediction using information theory: new parameters and consideration of residue pairs. J.
Mol. Biol. 198(3), 425–443 (1987)
3. Howard, L.: Holley and Martin Karplus. Protein secondary structure prediction with a neural
network. Proc. National Acad. Sci. 86(1), 152–156 (1989)
4. Kneller, D.G., Cohen, F.E., Langridge, R.: Improvements in protein secondary structure pre-
diction by an enhanced neural network. J. Mol. Biol. 214(1), 171–182 (1990)
5. Sikorski, A.: Prediction of protein secondary structure by neural networks: Encoding short and
long range patterns of amino acid packing. Acta. Biochim. Pol., 39(4), (1992)
6. Rost, B., Sander, C.: Prediction of protein secondary structure at better than 70% accuracy. J.
Mol. Biol. 232(2), 584–599 (1993)
7. Rost, B., Sander, C., Schneider, R.: Phd-an automatic mail server for protein secondary structure
prediction. Comput. Appl. Biosci.: CABIOS 10(1), 53–60 (1994)
8. Garnier, J., Gibrat, J.-F., Robson, B.: Gor method for predicting protein secondary structure
from amino acid sequence. Methods Enzymol. 266, 540 (1996)
9. Frishman, D., Argos, P.: Seventy-five percent accuracy in protein secondary structure predic-
tion. Proteins-Struct. Funct. Genet. 27(3), 329–335 (1997)
10. Cuff, J.A., Clamp, M.E., Siddiqui, A.S., Finlay, M., Barton, G.J.: Jpred: a consensus secondary
structure prediction server. Bioinformatics 14(10), 892–893 (1998)
11. Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices.
J. Mol. Biol. 292, 195–202 (1999)
12. James, A.: Cuff and Geoffrey J Barton. Application of multiple sequence alignment profiles to
improve protein secondary structure prediction. Proteins: Struct. Funct. Bioinformatics 40(3),
502–511 (2000)
13. Hua, S., Sun, Z.: A novel method of protein secondary structure prediction with high segment
overlap measure: support vector machine approach. J. Mol. Biol. 308(2), 397–408 (2001)
Protein Secondary Structure Assignments … 711

14. Kloczkowski, A., Ting, K.-L., Jernigan, R.L., Garnier, J.: Protein secondary structure prediction
based on the gor algorithm with multiple sequence alignments. Polymer 43, 441–449 (2002)
15. Kloczkowski, A., Ting, K.-L., Jernigan, R.L., Garnier, J.: Combining the gor v algorithm with
evolutionary information for protein secondary structure prediction from amino acid sequence.
Proteins: Struct. Funct. Gen. 49, 154–166 (2002)
16. Kolinski, A.: Protein modeling and structure prediction with a reduced representation. Acta
Biochim. Pol.-English Edition- 51, 349–372 (2004)
17. Cheng, H., Sen, T.Z., Kloczkowski, A., Margaritis, D., Jernigan, R.L.: Prediction of protein
secondary structure by mining fragments database. Polymer 46, 4314–4321 (2005)
18. Dor, O., Zhou, Y.: Achieving 80% ten-fold cross-validated accuracy for secondary structure
prediction by large-scale training. Proteins 66, 838–845 (2007)
19. Homaeian, L., Kurgan, L.A., Ruan, J., Cios, K.J., Chen, K.: Prediction of protein secondary
structure content for the twilight zone sequences. Proteins: Struct. Funct. Bioinformatics 69(3),
486–498 (2007)
20. Kurgan, L., Cios, K., Zhang, H., Zhang, T., Chen, K., Shen, S., Ruan, J.: Sequence-based
methods for real value predictions of protein structure. Curr. Bioinformatics 3(3), 183–196
(2008)
21. Kurgan, L., Cios, K., Chen, K.: Scpred: accurate prediction of protein structural class for
sequences of twilight-zone similarity with predicting sequences. BMC Bioinformatics 9(1),
226 (2008)
22. Cole, C., Barber, J.D., Barton, G.J.: The jpred 3 secondary structure prediction server. Nucleic
Acids Res. 36, W197–W201 (2008)
23. Kountouris, P., Hirst, J.D.: Prediction of backbone dihedral angles and protein secondary struc-
ture using support vector machines. BMC Bioinformatics 10(1), 437 (2009)
24. Faraggi, E., Zhang, T., Yang, Y., Kurgan, L., Zhou, Y.: Spine X: improving protein secondary
structure prediction by multistep learning coupled with prediction of solvent accessible surface
area and backbone torsion angles. J. Comp. Chem. 33, 259–267 (2012)
25. Sen, T.Z., Jernigan, R.L., Garnier, J., Kloczkowski, A.: Gor v server for protein secondary
structure prediction. Bioinformatics 21(11), 2787–2788 (2005)
26. Kouza, M., Faraggi, E., Kolinski, A., Kloczkowski, A.: The gor method of protein secondary
structure prediction and its application as a protein aggregation prediction tool. In: Prediction
of Protein Secondary Structure, pp. 7–24. Springer (2017 )
27. Rost, B.: TOPITS: threading one-dimensional predictions into three-dimensional structures.
In: Third international conference on intelligent systems for molecular biology, pp. 314–321.
AAAI Press (1995)
28. Rost, B., Sander, C.: Protein fold recognition by prediction-based threading. J. Mol. Biol. 270,
471–480 (1997)
29. Kihara, D., Hui, L., Kolinski, Aj, Skolnick, J.: Touchstone: an ab initio protein structure pre-
diction method that uses threading-based tertiary restraints. Proc. National Acad. Sci. 98(18),
10125–10130 (2001)
30. Przybylski, D., Rost, B.: Improving fold recognition without folds. J. Mol. Biol. 341, 255–269
(2004)
31. Cheng, J., Baldi, P.: A machine learning information retrieval approach to protein fold recog-
nition. Bioinformatics 22, 1456–1463 (2006)
32. Qiu, J., Elber, R.: SSALN: an alignment algorithm using structure-dependent substitution
matrices and gap penalties learned from structurally aligned protein pairs. Proteins 62, 881–
891 (2006)
33. Liu, S., Zhang, C., Liang, S., Zhou, Y.: Fold recognition by concurrent use of solvent accessi-
bility and residue depth. Proteins 68, 636–645 (2007)
34. Blaszczyk, M., Kurcinski, M., Kouza, M., Wieteska, L., Debinski, A., Kolinski, A., Kmiecik,
S.: Modeling of protein-peptide interactions using the cabs-dock web server for binding site
search and flexible docking. Methods 93, 72–83 (2016)
35. Faraggi, E., Yang, Y., Zhang, S., Zhou, Y.: Predicting continuous local structure and the effect of
its substitution for secondary structure in fragment-free protein structure prediction. Structure
17, 1515–1527 (2009)
712 E. Faraggi and A. Kloczkowski

36. Kabsch, W., Sander, C.: Dictionary of protein secondary structure: Pattern recognition of
hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983)
37. Heinig, M., Frishman, D.: Stride: a web server for secondary structure assignment from known
atomic coordinates of proteins. Nucleic Acids Res. 32(suppl_2), W500–W502 (2004)
38. Zhang, W., Dunker, A.K., Zhou, Y.: Assessing secondary-structure assignment of protein struc-
tures by using pairwise sequence-alignment benchmarks. Proteins 71, 61–67 (2008)
39. Roger, A.: Sayle and E James Milner-White. Rasmol: biomolecular graphics for all. Trends in
biochemical sciences 20(9), 374–376 (1995)
40. Faraggi, E., Xue, B., Zhou, Y.: Improving the prediction accuracy of residue solvent acces-
sibility and real-value backbone torsion angles of proteins by fast guided-learning through a
two-layer neural network. Proteins 74, 857–871 (2009)
41. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.:
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl.
Aci. Res. 25, 3389–3402 (1997)
42. Meiler, J., Muller, M., Zeidler, A., Schmaschke, F.: Generation and evaluation of dimension
reduced amino acid parameter representations by artificial neural networks. J. Mol. Model. 7,
360–369 (2001)
43. Dor, O., Zhou, Y.: Real-SPINE: an integrated system of neural networks for real-value predic-
tion of protein structural properties. Proteins 68, 76–81 (2007)
44. Xue, B., Dor, O., Faraggi, E., Zhou, Y.: Real value prediction of backbone torsion angles.
Proteins 72, 427–433 (2008)
45. Wootton, J.C.: Statistic of local complexity in amino acid sequences and sequence databases.
Comput. Chem. 17, 149–163 (1993)
Part V
Applications of Molecular Quantum
Mechanics
When Water Plays an Active Role
in Electronic Structure. Insights from
First-Principles Molecular Dynamics
Simulations of Biological Systems

Giovanni La Penna and Oliviero Andreussi

Abstract Changes of electronic structure and movements of positive holes (mostly


protons and metal ions) are closely connected in biological processes. These changes
occur in an environment mostly dominated by liquid water. Thanks to theoretical
advances in first-principles computer simulations and to high performance comput-
ers, these two ingredients can be combined to set up reliable models. This is of
particular help in understanding the role of metal cofactors in biology.

1 Introduction

Thanks to the advancement of structural biology, a large amount of high-resolution


structures of impressive protein assemblies, with sizes in the range of several tens of
nm, are now available [1, 2] (for a gate to structural biology services visit the Euro-
pean bioinformatic institute (EBI). One beautiful example is given by the structure
of photosystem II, containing several proteins assembled together in a machinery
that drives the energy of absorbed light into a charge separation.
The charge separation produced by the light absorption, is compensated by chem-
ical reactions, with the formation of an electron pump: electrons are extracted from
water molecules and injected into a mobile reductant transported within the mem-
brane (hydroquinone). This impressive machine, that allows on the earth surface the

Electronic supplementary material The online version of this chapter


(https://doi.org/10.1007/978-3-319-95843-9_22) contains supplementary material, which is
available to authorized users.

G. La Penna (B)
Institute for Chemistry of Organo-Metallic Compounds,
National Research Council of Italy, via Madonna del Piano 10,
50019 Sesto fiorentino (Firenze), Italy
e-mail: giovanni.lapenna@cnr.it
O. Andreussi
Department of Physics, University of North Texas, Denton, TX 76203, USA
e-mail: Oliviero.Andreussi@unt.edu

© Springer Nature Switzerland AG 2019 715


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_22
716 G. La Penna and O. Andreussi

harvesting of solar energy and its incorporation into chemical compounds rich in
electrons (reductant species), is performed within a protein architecture involving
also lipids, water, many ions, including transition metal ions, and other non-protein
cofactors.
Many atomic details that are missing in the crystal structure (like hydrogen atoms,
disordered water molecules and organic cofactors) can be recovered resorting to the
interatomic interactions represented via empirical equations, including also a signif-
icant portion of water molecules or membrane components that are in the assembly
environment. One of the first computer programs to perform this essential task was
named “Assisted Model Building with Energy Refinement” (AMBER) [3, 4]. But
this approach has a long history, documented in the first chapter of this book and in a
recent contribution [5] by one of the main pioneer in the field, Harold Scheraga. The
visionary approach, to describe biological macromolecules in terms of elementary
microscopic interactions, has been recently recognized by the scientific community
with the Nobel Prize in Chemistry of the year 2013. For a coincise summary of the
pathway from the beginning to our days, with a perspective view, see the comment
to Nobel Prize published by Tamar Schlick [6], while the complete Nobel lectures
can be seen in the Nobel Prize web-site.
The many scales of atomic interactions able to capture most of the driving force
for the folding of the proteins into assemblies like photosystem II, are described in
the other chapters of this book.
Going back to our example (photosystem II), other details emerge if we select
in the structure particular elements, like Mn, Fe, Mg and the cofactors that interact
with these ions (righthand side of Fig. 1). For instance: why Mn and Fe are present

Fig. 1 X-ray structure of dimeric photosystem II in cyanobacteria, PDB 3BZ2 (left) and 3BZ1
(right) [7]. The monomer on the righthand side is represented according to the PDB secondary
structure (cofactors are not represented). The monomer on the lefthand side is represented only in
its cofactor content (no proteins). The content of bound ions is indicated in the closeby regions.
PLM indicates the approximate region occupied by the phospholipid membrane. Volumes of the
different objects are arbitrary. Different colors indicate different macromolecules in the assembly.
This and the following figures are drawn with the VMD package [8]
When Water Plays an Active Role in Electronic Structure … 717

in the assembly? These elements can change the number of electrons in their respec-
tive environments. This occurs in chemical reactions: the electrons move and the
charge distribution changes when molecules are close enough in space. The change
in charge distribution is accompanied by movements of mobile charge carriers, that
in a first approximation do not explicitely change the sharing of the electrons. How-
ever, among these mobile ions, in biological systems the protons (or better the proton
distribution in the water or protein environment) are those more available.
To make the movement possible, centers that can easily change the number of
electrons in their nearby are recruited by proteins. This is the basis of the catalysis
operated by enzymes and of the electron transport operated by some proteins [9].
Photosystem II, for instance, is an assembly where the electron transport within
several metallo-porphyrines is coupled with the catalysis of the water oxidation
on one side (the site containing Mn and Ca, known as oxygen evolving center) and
quinone reduction on the other side (the bundle of α-helices at the rightmost interface
of the assembly with the phospholipid membrane).
The aim of this chapter is to emphasize one of the most advanced tools to under-
stand the role of those details that subtly affect the structure and strongly affect
the chemical reactivity of the system. These details can be hardly represented with
the tools developed for the protein portion, and deserve a special treatment that
will be introduced in this chapter. The method that reached the best compromise
between accuracy and computational affordability is the density-functional theory
(DFT) for the description of the ground-state electronic structure and, therefore, only
this approach will be described in more detail.
The chapter is more a practice, while the theoretical background is well described
in many books and reviews included as selected references. Our aim is to provide
indications on how to build models for affordable DFT calculations, what to check
during the calculations, which results are reliable and which are severely limited by
the models behind the results. At the end of this chapter, the reader should be able to
take part of a large model built on the basis of empirical potentials (possibly a set of
configurations selected according to a given statistical ensemble) and to investigate
the behaviour of the selected small portion. Carefully tailoring the focused region,
allows the investigator to explicitly account for the behaviour of the electrons in the
system, which is key to model chemical reactions.
In terms of modeling, the description of electrons in extended systems like por-
tions of proteins in contact with transition metal ions is a real challenge. This is
because the description of the quantum nature of many electrons interacting with
atomic cores and between themselves require a huge number of variables. In this
respect the development of density functional theory helped this description in a
considerable manner. Nevertheless, the number of degrees of freedom needed for
electrons greatly limit the number of atoms that may be included in a model. A
strategy based on a incremental refinement from coarse-grained empirical models to
such limited samples of atoms including electrons must be devised. The goal of this
chapter is to describe how to perform in practice and in a robust way this refinement.
718 G. La Penna and O. Andreussi

2 A Simple Example

A single transition metal ion interacting with an organic ligand, representing a protein
region, and a few water molecules, representing the solvent, illustrate some of the
features that will be the subject of this chapter. Therefore, we introduce a simple
example showing how ions and water interact in a biological environment: a single
Cu2+ ion bound to a peptide that represent a region in the N-terminus of the human
prion protein.
Let us, for the very beginning, ignore the details of how the atomic forces are mod-
elled. Let us assume we have a hamiltonian describing both atomic cores and valence
electrons. Under the assumption that the valence electrons are in the ground state (i.e.
there is no effect from excited electronic states), forces are uniquely defined and can
be computed by using the Hellman-Feynman theorem (see Ref. [10] and references
therein, for a recent simple formulation of these and related approximations). In a
few words, let us assume we can describe within the classical mechanics the motion
of atomic cores driven by the forces due to the valence electrons deposited around
the cores.
Now, let us assume we have a computer that is able to simulate the motion of these
atomic cores (the core electrons are frozen as in the corresponding isolated atoms).
In practice, we have an initial set of positions for the atomic cores, we assign to each
atomic core a reasonable velocity, and, according to the forces, we move atomic
cores using the Newton’s equations keeping electrons in the ground state. This is the
so-called molecular dynamics (MD) simulation method.
In certain cases, the presence of a single transition metal ion in the proximity of
a peptide allows unusual proton extractions away from the peptide. The process is
the rearrangement of electrons from the peptide to the ion, i.e. the ligand (a Lewis
base) binds to the ion (the Lewis acid). This local rearrangement cooperates with the
transfer of protons from the ligand to the solvent. In other words, a slight reshuffling
of positive holes (metal ions and protons) occurs to reach a stable state for the peptide
in its environment.
In video V1, a single short trajectory representing this process is displayed. The
details will be described later in this chapter, but first we let the reader visualizing
the cooperation of the formation of Cu–N bonds (Cu is the orange sphere, N is in
blue) with the release of the proton (white sphere) initially bound to one N atom
to the water molecules around the peptide. The H3 O+ ion is transiently formed and
then the positive charge is shared with the bulk water. Therefore, we recommend
the reader to watch the video several times, trying to isolate the occurring chemical
transformation.
The result of this process is the transformation of a peptide weakly interacting
with a copper ion into a more stable complex. The process is definitely assisted by
water molecules in the nearby of the solute: with no water molecules, the state of the
extracted proton can be hardly modelled.
When Water Plays an Active Role in Electronic Structure … 719

3 The State of the Art

The algorithms to simulate the statistical mechanics and the development of high
performance computers (HPC) will allow in the future the simulation of the time-
evolution of a huge number of degrees of freedom, in samples of large size. On the
other hand, many properties of condensed and fluid matter can be described in terms
of a few variables thanks to the effects of averaging over time and space of many
microscopic variables. At the end, a few collective variables can be used to represent
the environment of a more detailed system.
We are now in a stage where the two descriptions are complementary. Computer
simulations allow to observe the statistical averaging in action and to understand how
an average property emerges from the apparent disorder of the microscopic variables,
that are particles’ positions and velocities.
From this cooperation, two large classes of methods emerged:
• A fine description of one part of the system, coupled with descriptions more coarse
and approximated of the environment. This approach is inspired by the concept of
an isolated molecule perturbed, at different scales, by its environment, an approach
widely used, for instance, in spectroscopy.
• An approximate description (but not too approximate) of one piece of the system,
including an essential part of its environment. This is an approach inspired by
crystallography, where the unitary cell, sometimes including many molecules,
replicated in the three directions of space, is assumed as a good representation
of the same molecules in the liquid state of a water solution or in a more-or-less
hydrated crowdy state (like the living cell).
One example of the interplay between the statistical nature (the nature of a polar
and protic liquid) and the molecular nature (the interface between the solute and the
liquid bulk environment) is given by the behaviour of peptides and proteins in water
solution.
Even if water has been investigated in depth since a long time under every pos-
sible perspective, the peculiar role of this solvent is still an arena for experimental,
theoretical and modeling techniques. One example is the entire issue of Acc. Chem.
Res. dedicated to water in January 2012 [11]. A second example is the recent dis-
covery of a new form of ice (number XVII!) [12], thus showing the amazing extent
of structural disorder of water even in solid state [13]. The difficulty of modeling
water in confined regions and close to interfaces described at a molecular level, is
witnessed by the slow convergence towards a unified and simple approach describing
protein hydration within a mean–field approach (see the contributions to this book).
Many classes of methods are widely used to model biological macromolecules
in water, i.e. the environment more commonly used in experimental studies in vitro.
Water is also the environment where the metal ions, in different forms, live, at least for
a given time, in the intra- and extra-cellular compartments. In the following sections
the methods are briefly described together with simple examples.
720 G. La Penna and O. Andreussi

4 Discrete Solvation Models in Molecular Mechanics


and Statistics

In the ’80s the age of simulations begun in the field of liquids [14]. Simulations were
devised as virtual laboratories where to test theories for intermolecular and later
interatomic interactions. Since then, computers are routinely used as microscopes to
look at atoms in molecules embedded in their phases, depending on thermodynamic
(macroscopic) parameters projected onto atomic (microscopic) properties (positions
and velocities). Along with the years, the experimental techniques able to probe
similar atomic scales (diffractions, spectroscopies, micromanipulations, and cryo-
electron microscopy) were used together with computational models and theoretical
equations to complete the understanding of observations in different phases.
There is no doubt that empirical models for interatomic forces between water
molecules and between water molecules and peptides continuously provide infor-
mation of extreme importance for understanding biological macromolecules. As a
simple example, even crude models of water greatly improved the understanding of
the behaviour of water around proteins, up to critical conditions that can be exper-
imentally monitored [15]. The shift of these conditions induced by the nature of
water compared to other solvents are among the reasons of its strong relationship
with life (proteins, nucleic acids and polysaccharides) [16, 17].
As a nowadays ongoing evolution of such description of atomic forces, we mention
the development of polarizable empirical force-fields, in theory able to mimic also
strong interactions between water molecules and charged groups in proteins or metal
ions in competition for the same charged groups [18–21]. These models are not yet
fully reproducible and/or transferable, because of the many parameters involved, but
they will eventually converge to a big step forward in modeling “soft” biological
systems of large size.
Most of the refined models for proteins merged into water environments and
represented at a molecular level, are built on the basis of empirical models. In practice,
a sample of 216 water molecules based on the Monte Carlo chain at T = 300 K
and P = 1 bar [22] is always included in every package for molecular dynamics
simulations based on empirical force-fields. Therefore, there is a chain effect where
the more coarse model provides configurations to the more refined model. This
common procedure, on one hand decreases the time to adapt the configuration to
the improved (and usually computationally more expensive) model, but on the other
hand introduces a bias coming from the approximations of the more coarse (and
more approximated) models.
When Water Plays an Active Role in Electronic Structure … 721

5 Continuum Solvation Models in Molecular Mechanics


and Statistics

Theoretical equations, mostly including empirical parameters, are routinely used to


model the interactions between atoms mediated by the water solvent. Main results
are models where the following contributions are included: the hydrophobic effect
acting on a mechanical model of the solute i.e. the driving force of protein folding;
the electrostatic interactions in a polarizable environment combined with interatomic
forces modeled for simple liquids. These models are extensively described along
other chapters of this volume.

6 The Perspective of Quantum Mechanics

The aim of polarizable force-fields for describing strong polar interactions eventually
involving charge transfer among two sites, is a description of the first stage of the
formation of chemical bonds. The struggling difficulty of representing this event with
empirical models has been partially circumvented by density functional theory [23]
(DFT, hereafter) and by the huge improvements in computer hardware and algorithm
development for linear algebra. In the ’90s the success of DFT applications increased
its popularity, reaching a good compromise between the size of the systems and
the accuracy of the models. Later, the first-principles simulations begun when the
accuracy of atomic forces reached a high-quality level.

7 Continuum Solvation Models in Quantum Mechanics

The calculation of electronic properties for molecules in solution is still a very active
area. One of the reasons of this intense activity is to understand the many spec-
troscopical data routinely collected for molecules dissolved in liquid samples. The
theoretical and computational models for these observables require, in most of the
cases, a quantum mechanical description of the degrees of freedom that is perturbed
with the electro-magnetic field. The response to the perturbation is strongly affected
by the environment in which the observable is measured (see also Sect. 17). The
assumption that the many degrees of freedom describing the state of the environ-
ment can be averaged into a continuum or a mean–field is at the heart of the physics
of condensed matter [24]. The representation of nature in terms of a few laws and
a few variables is actually one of the major achievements of mankind. The avail-
ability of computer machines handling many rules and many variables should not
bring us too far from this great achievements. Rather, calculations teach us how to
derive simple rules and a few variables from the complicate manyfold of the model
statistics.
722 G. La Penna and O. Andreussi

The polarizable continuum model for quantum mechanics is a good compromise


between the complexity of quantum mechanics (necessary to understand and com-
pare spectra of different compounds) and the simplification of the environment of
the quantum mechanical region. This is important even when the solute is a small
molecule with a well defined geometry differing from that of a regular (spherical,
cylindrical, etc.) cavity. In the following, we sketch the theory coming from the field
of isolated molecules, while, after the solid-state approach is described (Sect. 11),
we shall come back to the most recent developments where continuum approaches
and solid-state physics are combined (Sect. 12).
Let us assume to have an ensemble constituted by a system s interacting with
its environment e, both described at a quantum mechanical level. We know from
quantum mechanics that the ensemble at equilibrium is described by a set of station-
ary states that are the eigenstates of the quantum hamiltonian (the solutions of the
Schrödinger equation, SE hereafter):

Ĥ Ψ (rs , re ) = E Ψ (rs , re ) , (1)

where rs and re are the arrays containing all the 3-dimensional microscopic vari-
ables (positions and momenta of atoms and electrons) in the system s and in the
environment e, respectively. The operator Ĥ is the hamiltonian of the system and
Ψ is the complex function describing the possible states of the system. The low-
est energy eigenvalue and the corresponding lowest-energy Ψ function solution of
equation above, describe entirely the ground state of the ensemble of particles. The
basis of every mean–field theory is to split the hamiltonian into several parts that can
be summed-up:

Ĥ (rs , re ) = Ĥ s (rs ) + Ĥ e (re ) + Ĥ int (rs , re ) , (2)

where the Ĥ int is the separable part containing the coupling between system and
environment. Assuming that the the re variables span all the relevant values (the
range Ωe ) for each single value of rs , it is possible to average re over its entire
space Ωe for each single value of rs , obtaining an effective hamiltonian that does not
depend on re , rather on its averaged effect:

Ĥ e f f (rs , Re ) = Ĥ s (rs ) +  Ĥ e Re + Ĥ int (rs , re ) dre . (3)
Ωe

The symbol Re summarizes the set of parameters describing the way the environment
is averaged.
The most effective approximation for the environment is the continuum polariz-
able medium, separated from the solute molecule by a thin layer of empty space. The
averaging is therefore not explicitly operated, rather it comes from electrostatics of
dielectric materials [24]. The polarization of the interface between the solute and the
solvent is therefore described via boundary charge elements, i.e. the discrete small
charges distributed on elementary surface elements describing the solvent accessible
When Water Plays an Active Role in Electronic Structure … 723

solute surface. The shape of this interface is usually described numerically because,
with the exception of mesoscopic approximations, the shape of the solute molecule
cannot be approximated as a single sphere or a single analytically manageable solid
object.
The quantum properties of the solute can be described via a Schrödinger equation
where the hamiltonian includes electrostatic interactions between electrons and a
continuum polarizable environment. The greatest achievement is a set of continuum
models used in quantum mechanics (QM) [25]. The most popular form is the PCM
method encoded in many QM packages [26]. The method can be used in all the
different ways to solve SE, being not limited to ground state like DFT (see below).
The different ways to describe the solute cavity, the approximations made in the
solution of the SE and the different ways to compute energy gradients (to perform
energy minimization or geometry optimization) are still improving and different
approaches are still competing for the best performance and accuracy. See here for
a recent debate on this subject [27, 28].

8 The Mean–Field in Action

In order to visualize the effect of a polarizable model for the environment, a simple
model can be devised. Let us extract a C-terminal lysine residue from a small protein,
the 1–16 segment of the amyloid-β (Aβ) peptide. Indeed, this terminal residue is very
important also in the propensity of Aβ aggregation, a challenging arena of computer
simulations. The possibility for the C-terminus of a peptide to form salt-bridges
with N-terminal partners is influenced by the alternative possibility to form intra-
molecular salt-bridges or hydrogen bonds. In the case of Lys, the protonation state at
pH∼7 is expected to be ruled by the pKa of the two groups that can carry a proton at
pH values in a range of 3 units of pH, i.e. the ammonium group of the sidechain (Nζ )
and the carboxylic group of the C-terminus (C). The N-terminus of Lys is, in this
example, blocked by the peptidic bond with the rest of the protein chain. In our model
this blocking is operated by an acyl group, and the small fragment will be indicated
as Ac-Lys. Between pH∼2 and 9 it is expected that the C-terminus is predominantly
a carboxylate group (deprotonated) and Nζ is in the ammonium form (protonated).
If we now perform the minimization of the energy of this Ac-Lys fragment in the
vacuum within a model at density functional theory level of approximation we obtain
the structure displayed in the left panel of Fig. 2. The proton is rapidly transferred
from the ammonium group in the sidechain to the carboxylate group, the neutral Lys
sidechain is hydrogen bonded to the neutralized C-terminus with two dihedral angles
in gauche state and the whole aminoacid adopts a compact structure.
The addition of the simplest level of polarizable continuum for the water solution
consists of adding to the solute described at the same DFT level, a homogeneous
dielectric with dielectric permittivity εr = 78 beyond the solvent accessible surface
of the molecule. This is performed, in the PCM method, for every configuration
iteratively built along with the energy minimization process. The minimization of
724 G. La Penna and O. Andreussi

Fig. 2 Minimal energy configurations obtained for Ac-Lys in the vacuum (left) and in the polariz-
able continuum model (PCM, right), obtained at the DFT level. A localized basis-set (6–31+G(d,p)),
the B3LYP hybrid exchange functional and a dielectric permittivity of 78 for the solvent were used
with the Gaussian09 package [29]

energy with this model produces the expected protonation state and a single gauche
dihedral angle in the conformation of the Lys sidechain (right panel in Fig. 2).
This exercise, that we then shall continue increasing the resolution for the solute
environment, shows that at low temperature the Ac-Lys fragment keeps its charged
groups separated in space, with no stress of the chain mechanics, thanks to the solvent
polarizability. The polarizability is that of liquid water at room conditions, so there is
a conceptual gap between the temperature of the solute and that of its environment.
Proper vibrational and entropic corrections can partially circumvent this gap.

9 More Than a Mean–Field: Beyond the Continuum

In many more complicate molecular systems, when liquid water is the solvent, the
problem of sampling the relevant molecular configurations (even in terms of distribu-
tion among several numerable energy minima) becomes more challenging. Among
the many sources of problems, those of particular interest here are summarized below.
Beside the effect on the electronic structure due to the polarization of liquid water
as the reaction to the electric field induced by the solute, water molecules can form
specific interactions with certain chemical groups in the solute, like hydroge bonds.
When the hydrogen bond network between water molecules changes compared to
bulk liquid water because of the presence of the solute, then the structure of water
is perturbed. The mere replacement of a portion of bulk water due to the presence
of the solute exerts, therefore, an effect on the hydrogen bond structure of water,
thus changing also the polar nature of the solvent. This effect is summarized in
the so-called “hydrophobic” effect [30, 31]. Most of the theoretical and empirical
description of such effects are well described in other contributions to this book.
A second important effect is that water is a protic solvent. Water molecules
exchange protons among them and with the solute, thus providing a buffer of discrete
charges eventually interacting with mobile charges in the solute. A detailed descrip-
tion of this effect, indeed the movement of a positive hole in a cloud of electrons
When Water Plays an Active Role in Electronic Structure … 725

(thus resembling the movement of charge defects within metals) becomes mandatory
when an eventual flux of such mobile charges is the property of interest. This is an
important issue in the oxidoreductive chemistry.

10 To the Infinity... and Beyond

The discrete nature of part of the environment close to the quantum portion of the
system can be included for certain atomic shells. This provides an elegant combina-
tion of quantum mechanics with the molecular mechanics of the environment, and
eventually with the continuum of its solvent at a longer spatial scale. The idea is to
describe at a quantum-mechanical level the portion of the system where electrons are
important (the QM portion), at a molecular mechanics level (described via a set of
empirical parameters) the portion of the system acting as a constrained mean–field
(where electronic variables are averages corresponding to each position of the nuclei,
the MM portion), and finally the farther portion of the system as a continuum (where
both electronic and nuclear variables are averaged).
The first level is known as the quantum-mechanics/molecular- mechanics approach
(QM/MM) and is nowadays a sort of routine method for addressing the catalysis in
enzymes, also in the presence of a sample of water molecules [32]. The addition of
an infinite continuum is relatively recent and is becoming attractive for computa-
tional spectroscopy of large assemblies in water solution (see for instance the GLOB
method in Ref. [33]).

11 Cutting Out a Piece: The Solid-State Approach

In solid state physics, the approach of density functional theory (DFT) within periodic
boundary conditions became a routine tool also for studying lower symmetry states,
like solid-liquid interfaces, phase transitions, fractures, samples under stress. Since
disorder is in most of the cases related to finite temperature, this latter must be taken
into account. The best review of these methods is the comprehensive book by Marx
and Hütter [34].
Density functional theory became soon a computationally affordable approach to
describe the electronic structure of extended systems in the ground state using the
same set-up developed along with the years for the solid state. In terms of modeling,
the attribute “extended” refers to systems containing a large number of atoms that
are not significantly affected by the boundaries of the system. The sample, therefore,
is repeated periodically in the three directions of space, and it is the unitary cell
of an infinite crystal. When this unitary cell is large compared to the unitary cell
of a real crystal, then the concept of super–cell is used. The interactions and the
associated thermal fluctuations within the super–cell are an approximation of those
in the infinite sample. For instance, fluctuations with wavelength larger than the
726 G. La Penna and O. Andreussi

cell sides are neglected. The accuracy of the results depend heavily on the cell
size, shape and dimensions and the reliability of the information obtained by such
models concerning the interactions, fluctuations and motions of interest must be
critically assessed. There is still a large debate on the dependence of models from
their boundaries, both in the case of mean–field and periodic boundary conditions.
Basically, in both cases the accuracy depends on how the properties of interest depend
on long–range interactions, also mediated by the solvent, between the mostly involved
particles.
The basis of DFT (see also next chapter in this book) is that in the ground state
there is a one-to-one correspondence between a given electron density ρ(r) and a
given electrostatic external potential V (R) acting on the sample of electrons. In the
specific case of a large sample of atoms, the vector R spans all the positions of atomic
cores interacting with the electrons via electrostatic forces. The unique minimum of
the energy of the system with respect to electron density identifies the one-to-one
correspondence [23, 35]:
U (R) = min E(ρ, R) (4)
ρ

where E is the energy functional of ρ once the positions R are given. The existence
of a unique potential energy U allows the derivation of forces F I = −∂U/∂R I for
each core I and evaluated at the ρ function minimizing the E functional. All the
complications of the quantum mechanics are thrown into the functional form E(ρ).
The representation of electron density is done in the Kohn-Sham form: the density is
obtained by a single determinant built from one-electron states, called Kohn-Sham
(KS) states, and indicated with ψi , with i running over the possible occupations (that
depend on the available number of electrons):
 
1
E({ψi }, R) = fi ψi∗ (r)(− ∇ 2 )ψi (r)dr
i V 2
 
1 ρ(r)ρ(r )
+ drdr (5)
2 |r − r |
 V V

+ ε XC [ρ(r)]ρ(r)dr + E e,I ({ψi (r)}, R) + U I (R)


V

with V the space of the super–cell, ρ(r) = i f i |ψi (r)|2 , E e,I the electrostatic inter-
action between electrons and atomic cores, U I the electrostatic interaction between
atomic cores, and f i the occupation (1 or 2) of each one-electron KS state. The most
exotic quantity is the functional ε XC , the exchange functional, containing all the
complications of the exclusion principle holding for electrons (while the indetermi-
nacy is in the first term in the sum). Atomic units are used for convenience. There is
no space here to enter into the details of the exchange functionals. But some words
can be spent for the practitioner. The exchange functionals are usually reported as
acronyms (PBE [36] and B3LYP [37], just to mention the most popular) coming
from those who designed the functionals. They are divided in three classes. One is
When Water Plays an Active Role in Electronic Structure … 727

the original local density approximation (LDA [23]), where the functional depends
on the electron density in a single point. Starting from this basic approximation, that
provides very poor descriptions of electron density in molecules, the non-local prop-
erties of the functional were modeled by expansions of the electron density around
the single point, providing the first generalized gradient approximations (GGA, see
Ref. [38] and references therein). The Hartree-Fock solution to the SE provides the
exact exchange functional:
 
1 1
E xH F = ψi∗ (r1 )ψ ∗j (r2 ) ψi (r2 )ψ j (r1 )dr1 dr2 , (6)
2 i, j V V r1,2

where the sum over i and j runs over occupied states and the two position vectors (r1
and r2 ) refer to the two electrons. Despite being exact for the exchange contribution,
the lack of an equally accurate description of the correlation effects has lead to the
development of hybrid approaches: a fraction of the exact exchange is combined with
the GGA approximations, with coefficients derived by fitting experimental results
or accurate calculations performed with non-DFT methods. By relying on a specific
parameterization, hybrid approaches are able to exploit the cancellation of errors in
the description of correlation and exchange contributions. Even if the hybrid scheme
is the most accurate description of the exchange functional, it is computationally
expensive, because of the matrix elements above (also involving an extension of the
KS set over zero-filling states) that must be computed. In practice the evaluation of the
exact exchange is about ten times slower than pure GGA approximations. The latter
can also include different corrections and in our examples the PBE approximation will
be used because it provides the best compromise between computational performance
and accuracy for calculations of atomic forces and their statistical effects, especially
for water at room conditions [39]. Most of the corrections for pure water then come
from the quantum nature of hydrogen atom [40].
Despite the presence of r1 and r2 in Eq. 6 and one-electron KS states in Eq. 5, we
must warn the reader that electrons are entirely dematerialized in quantum mechanics.
All variables and indeces related to one-particles are used to conveniently “repre-
sent” the quantum nature of many interacting electrons. In many books and articles
concerning the DFT method, this concept is indicated as non-locality. The concept
is, for instance, put in particular evidence when the simple Lennard-Jones approxi-
mation for interacting atoms is derived from DFT (see for instance the JuNoLo code
development and references therein [41]). Another example is the theory developed
to describe electron correlation in the ground state, also known as Hubbard-U the-
ory [42]. In non-DFT textbooks, the same concept arises from the combinations of
complicated determinants in the frame of multi-reference and configuration inter-
action methods (also called post-Hartree Fock methods, see the chapter about iron
porphyrins in this book).
The KS states ψi representing the electron density are in turn represented as linear
combinations of simpler orthogonal functions. Quantum chemists prefer to expand
the one-electron KS states in atomic contributions (or manageable representations
728 G. La Penna and O. Andreussi

of them, like several gaussian functions). This representation greatly helps in under-
standing the chemical structure of the electron density in the energy minimum, and
especially for isolated molecules (both in the gas phase or in mean–field represen-
tations of their environment). Solid state physicists prefer to use plane-waves con-
sistent with the unitary super–cell used as the sample. The advantage of this second
approach is that the representation does not depend on the variables R I , letting the
representation of ρ and the variables identifying the external potential determining
ρ, uncoupled. This allows a rigorous total energy conservation when the equation of
motions are solved using forces and velocities of each atomic core I : the existence of
invariants in dynamical methods is very useful in practice, to check for errors in code
implementation, writing equations, etc. The disadvantage is that a large super–cell
may contain a huge number of plane-waves. However, in the last 20 years, especially
because of the engineering of Fourier transform algorithms, the sizes of the afford-
able samples made this DFT approach the best performing tool for investigating also
small samples of biological models cutted away from larger empirical models.
The E functional written in Eq. 5 is a complicated potential energy functional of
a complicated, yet manageable, representation of the electron density and of atomic
core positions. The idea was then to use this potential energy in a lagrangian including
the kinetic energy of atomic cores (assumed as classical point masses) and some
artificial masses representing the KS states. This method is an extended lagrangian
method and it is known as Car-Parrinello molecular dynamics method [43]. Since
the KS states must be orthogonal at any time, an additional term representing this
constraint can be included:

L =μ |ψi (r)|2 dr
i V

1
+ Mi Ṙ2I − E K S ({ψi }, R) (7)
2 I
 
+ Λi, j ψi∗ (r)ψ j (r)dr − δi, j
i, j V

where Λ are a set of lagrange multipliers introduced to keep the orthogonality of


KS states along with the time evolution, M I are atomic masses and μ is the artificial
mass associated to each KS state. This lagrangian describes a system of electrons and
atomic cores coupled together: since the original basis of DFT stated their dynamic
independency (otherwise atomic forces are not meaningful), the chosen value of the
parameter μ must guarantee the slow time-evolution of KS states when atoms are
moved. In other words the electrons must stay in the ground state at any time. This
behaviour must be always checked during the time evolution, even if some degree
of variation may be tolerated. This check is the same required by the conservation
of total energy (or other rigorous invariant operator in case of extended lagrangians)
in whatever empirical molecular dynamics.
When Water Plays an Active Role in Electronic Structure … 729

There is no space to go into the details of the construction of a numerical solution


for equation of motions described by the lagrangian in Eq. 7, as well as the descrip-
tion of the various functionals contributing to the KS functional representation of E.
But one important point must be stressed. The possibility to describe via dynamical
equations the electronic structure following the movement of atomic cores accord-
ing to consistent first-principles forces (called “on-the-fly” Car-Parrinello molecular
dynamics), was a “numerical” revolution. Every quantum chemist struggling to reach
self-consistency (the usual way the variational principle is applied to get the ground
state of SE) for slightly distorted atomic configurations built just adding a few water
molecules around a solute, can be a witness of the dramatic change when the extended
lagrangian method is used. In the latter case, a small oscillation of the “fake” electron
kinetic energy (that associated to the mass μ, the first term in Eq. 7) allows to overtake
a lot of problematic points in configurational space. Even if it may be considered
as a great reduction of the elegance and power of the method, this mere advantage
allows the sampling of first–principles description of extended molecular systems
that would not be feasible with the usual self-consistency approach. The statistics at
room conditions (and even at extreme conditions) became suddenly affordable.
There are two other important practical points that must be emphasized about the
DFT applications in super–cells. All the electrostatic contributions to the E functional
have a 1/r dependence that make them effective even at long distances away from
each of the atomic cores. To achieve a decent accuracy in the sum of electrostatic
contributions when a mixture of positive and negative charges is involved, is a difficult
task. The evaluation of Madelung constants for regular crystals is helped by the
periodicity, but when there is disorder the generalization is not trivial (see Ref. [44]
for a clear report on this issue). In periodic systems made of neutral unitary cells the
standard Ewald summation technique allows a rigorous calculation of electrostatic
interactions: there is no cut-off for long-range tails. In practice, interactions between
charges are smoother than in isolated systems and total energy is better conserved.
This means that the effects of eventual thermostats are smaller and statistics are more
reliable. On the other hand, the infinite periodicity of the super–cell may influence
the behaviour of the molecule of interest: this molecule is still weakly affected by its
periodic images in the infinite lattice.
The other point concerns the interactions between the electrons and the atomic
cores. It is implicit that in equations above electrons are the valence electrons, while
the cores include the electrons less perturbed by the change of atomic environment
(i.e. they are frozen cores, an approximation widely used in whatever QM approach to
condensed matter). Only with this approximation systems of thousands of atoms can
be investigated. The interaction between valence diffused electrons and concentrated
electrons in the cores required a high-resolution representation of KS states, i.e. a
huge number of plane-waves. The limit in the plane-wave representation is identified
with a single parameter, the energy cut-off. All the functions of type exp(iG · r) are
used up to G = G max , with G max the modulus of wave vector corresponding to the
maximal energy E max = (h 2 /2m) G 2max . With the advent of ultrasoft pseudopoten-
tials [45], double-grid methods were developed for separating contributions affected
by the regions close to atomic cores (energy cut-off in the range of 300 Ry) from
730 G. La Penna and O. Andreussi

those within different atomic cores, where the electron density change in space is
less steep (energy cut-off in the range of 30 Ry) [46]. The real-space resolution of
the electron density became about 10 pm, i.e. that corresponding to the larger energy
cut-off.
At the end of the story, a limited set of words and numbers summarizes the type
of DFT model one is using:
1. the type of exchange functional;
2. the two energy cut-offs for the plane-wave basis-set (the two spatial resolutions
of the electron density);
3. the type of pseudo-potential used for modeling the atomic cores (related to the
above parameter).

12 Coupling continuum and super-cell approaches

As we have seen in the previous sections, two complementary approaches have been
developed to characterize solvated systems: implicit continuum models, coupled
to high-level quantum-mechanical calculations, usually performed on static isolated
system; and explicit, fully atomistic, QM or QM/MM simulations, in periodic bound-
ary conditions and using molecular dynamics as a way to sample statistical config-
urations at finite temperature. In fact, reformulations of continuum solvation have
been recently proposed in the literature [47–54] to allow the seamless coupling of
the two approaches.
The starting point of this new class of continuum approaches is the definition,
thanks to Fattebert and Gygi (FG) [47, 48], of the electrostatic free energy functional
of the system embedded in a polarizable continuum. The energy functional E of Eq. 4
becomes now a free energy functional F, because of the averaging of solvent variables
implicit in the dielectric function ε:
  

ε (ρ, R; r)
F (φ, ρ, R) = ρ (r) φ (r) + z I δ (r − R I ) φ (r) − |∇φ (r)| dr
2

I

(8)
where φ is the electrostatic potential in the simulation cell, z I are the atomic (pseudo)
charges, and the key ingredient is represented by the dielectric function, ε (ρ, R; r),
which is assumed to vary smoothly from a value of 1 (vacuum) in the region where
the QM degrees of freedom are present, to the solvent bulk value of ε0 outside of the
QM system. The above expression allows to easily derive all the important equations
related to the QM/continuum interaction by simply exploiting a rigorous and elegant
variational approach. In order to find the equilibrium minimum-energy state of the
system, the functional derivatives with respect to the different fields entering the free
energy functional must vanish. In particular, by imposing a vanishing derivative with
respect to the electrostatic potential, the generalized Poisson equation

∇ · ε (r) ∇φ (r) = −4πρ (r) (9)


When Water Plays an Active Role in Electronic Structure … 731

is obtained, which links the QM charge densities and the smooth dielectric function
with the electrostatic potential. Similarly, to optimize the electronic or ionic degrees
of freedom, functional derivatives of Eq. 8 can be computed analytically to provide the
right descent directions that will automatically include the presence of the dielectric
embedding. The solute electron density ρ is then represented in terms of one-electron
states as in the usual KS approach (see above). The KS potential used in DFT to
optimize the electronic density will be given by

δF 1 δε
(r) = φ − |∇φ (r)|2 (r) , (10)
δρ 8π δρ

while inter-atomic forces, used to perform geometry optimizations or molecular


dynamics simulations, are computed as
  
∂F ∂φ 1 2 ∂ε
= ρ (r) (r) − |∇φ (r)| (r) dr . (11)
∂R ∂R 8π ∂R

In contrast to PCM and other state-of-the-art continuum approaches for isolated


systems, where a sharp boundary is introduced, the formulation of FG relies on a
continuum embedding interface which varies smoothly over space. This difference
has important practical consequences. Its main drawback is that the electrostatic
problem is a more complex three-dimensional one, while in PCM the problem can
be projected and solved in the two dimensional boundary. On the other hand, in
simulations that exploits periodic boundary conditions and plane wave basis sets,
the use of a smooth interface avoids the need to introduce an additional numeri-
cal domain (e.g. a complex discretized molecular shaped cavity surface), while the
same structured three-dimensional grid can be used for the solution of both the QM
problem and the electrostatic one. On such a standard domain, the solution of the
generalized Poisson equation (Eq. 9) can exploit fast numerical solvers, such as fast
Fourier transforms (FFTs) or multigrid methods. Eventually, the main advantage
of approaches based on the method of FG is represented by the possibility to have
clean and analytical contributions to the forces, which is crucial to perform stable
molecular dynamics simulations.
For the purpose of reducing the number of parameters involved in continuum
solvation approaches, FG also proposed to define the dielectric function in terms of
the electronic density of the system (see Fig. 3), as opposed to standard PCM-type
models, where the continuum boundary is defined in terms of atom-centered sphere,
with each atomic species associated with a different solvation radius.
The limited number of empirical parameters involved in this class of approaches
simplifies the extension of the model to different embedding environments, for which
limited experimental data is available. This avoids the risk of having more param-
eters than observables when a comparison between calculations and experiments is
performed.
With the aim of reproducing experimental free energies of solvation, the model of
FG was coupled by Scherlis et al. [50] with a free energy functional to characterize
732 G. La Penna and O. Andreussi

Fig. 3 Self-consistent continuum solvation of an acetamide cation. Notice that the additional proton
is bound to carbonyl oxygen, as expected by the higher basicity of amide O compared to amide N. a
The self-consistent boundary is built in terms of isosurfaces of the electronic density, the transparent
surface corresponds to a value of 0.01 a.u. b The dielectric screening of the environment is effec-
tively modelled through an induced polarization density, the transparent red and blue isosurfaces
correspond to a value of plus and minus 0.001 a.u., respectively. c Value of the electronic density (in
black) and of the dielectric permittivity (in red) as a function of position along the axis visualized
in panel a. d Value of the polarization charge as a function of position along the axis visualized in
panel a

the energy penalty involved with the creation of the continuum/vacuum interface
inside the embedding medium. The cavitation energy functional was introduced,
similarly to what was done by Cococcioni et al. for the enthalpy functional [55], by
exploiting the concept of quantum surface of a QM system: similarly to the dielectric
continuum, also in this case the embedding energy is expressed as a functional of a
smooth interface function defined in terms of the QM degrees of freedom, namely

G cav = |∇s (r)| dr
When Water Plays an Active Role in Electronic Structure … 733

where the interface function, s (r) now goes from a value of 1 inside the QM region,
to a value of 0 inside the environment region.
In order to extend the capability and the accuracy of the model, following similar
approaches developed within the PCM framework, Andreussi et al. [54] substantially
revised the FG models, by improving the definition of the dielectric function, by
combining it with the enthalpy functional of Cococcioni et al. [55], and by carefully
parameterizing and testing the model on a reliable set of experimental data. The
resulting self-consistent continuum solvation (SCCS) approach proved to be close
to chemical accuracy in reproducing aqueous solvation energies of small organic
compounds [54]. Moreover, the largest deviations from the experimental results is
observed for strongly interacting functional groups, such as acidic or basic groups
or hydrogen bonding groups, for which the continuum approximation introduced by
the model is, in fact, expected to break down.
SCCS was later extended and tested on charged systems in solution [56]. With
respect to high level quantum-chemistry calculations, when using super-cell approach
one needs to take proper care of periodic boundary conditions, which can intro-
duce significant artefacts when used to model charged systems. Among some of
the most common approaches used to correct artefacts due to periodic boundary
conditions for isolated systems in vacuum, Makov-Payne, Martyna-Tuckerman and
point-countercharge methods were extended to include the presence of a contin-
uum dielectric embedding as defined in the FG or SCCS models [57]. Results on
charged systems show accuracies that are comparable to state-of-the-art continuum
approaches, but a reparametrization of the model specific for anions was shown
to be required [56]. This is probably a consequence of the poor description of the
hydrogen bond in continuum solvation together with the charge asymmetry in water
solvation, where negative and positive compounds are solvated via the hydrogen
atoms or oxygen lone pairs, respectively.
The elegant formulation of the FG derived continuum approaches allows the easy
coupling of this embedding strategy with most of the available techniques to compute
spectroscopic properties in solution, which provide key results to compare theory
and experiments. In particular, similarly to what was done in the PCM framework,
optical spectroscopies in solution can be computed [58] by exploiting linear response
approaches and assuming that the solvent dielectric screening during a fast process,
such as an electronic excitation, is the high-frequency optical one, ε∞ . Couplings
with vibrational, magnetic or core-level spectroscopies in solution also require minor
modifications with respect to the same calculations in vacuum and are the object of
on-going research.
In addition to the smooth self-consistent definition of the solvent interface pre-
sented above, smooth atom-centered definitions have been proposed in the literature
[52, 59], leading to a more tunable model that can better adapt to specific applica-
tions. For example, the possibility to specify different cavity parameters for different
atoms or portions of the QM system allows to have in the same simulation cell neu-
tral, positively and negatively charged compounds, which require different interface
parameters.
734 G. La Penna and O. Andreussi

While for many applications, smooth-interface solvation models can be equiv-


alent to other well-assessed continuum approaches, their coupling with super-cell
calculations greatly extend their range of applications. The most evident advantage
of this class of method, is to allow the simulation of solvation effects in partially
periodic systems, such as two-dimensional interfaces or one-dimensional systems.
Examples of this kind of systems are easily found in the fields of heterogeneous
catalysis and electrochemistry [60–64], but are also common in biophysical applica-
tions, where DNA strands or cell membranes represent just two examples. In order
to overcome some of the limitations intrinsic to the continuum approach without
resorting to overparameterization, one of the most promising solution is the use of
hybrid discrete/continuum simulations, where those solvent molecules that have a
more structural role are explicitly included in the QM system. Also in this case, the
possibility to run stable molecular dynamics simulations in a continuum embedding,
as granted by the methods presented in this section, is crucial to sample the statistical
properties of the explicit portions of the solvent and to provide an accurate evaluation
of the solvation free energy of the hybrid model.
Despite belonging to a family of methods that dates back to the beginning of
the twentieth century [65], there are still active progresses and advances in contin-
uum solvation approaches and in their applications to more and more complex and
exciting fields.

13 The Super–Cell in Action

In order to compare the mean–field approach of QM for an isolated system with


the super–cell approach for a small sample in a periodic environment, the Ac-Lys
exercise will be performed again. A question may be: can a periodic system be similar
to the isolated Ac-Lys molecule in the infinite dielectric medium investigated above?
The same initial configuration representing what is expected to be the minimal
energy for the isolated Lys molecule in water solution at physiological pH, is now
embedded in a super–cell with 0.8 nm of empty space between the nearest image in
the periodic crystal with an orthorhombic unitary cell. The reason for the choice of
0.8 nm as the initial distance between nearest images is dictated by experience on
similar systems: 0.4–0.5 nm of empty space around the solute in an orthorhombic
unitary cell replicated in the three directions of space is enough to let the electron
density be negligible at the boundary of the super–cell, even when the solute has a
net non-zero charge.
First we let the solute move according to a temperature T of 50 K. The molecule,
initially in the all-trans configuration for the Lys sidechain, attempts several dihedral
movements, but the torsional potential, approximated at the DFT level, does not
allow the kinking of the sidechain. If we increase the temperature to 100 K, then we
observe the kinking of the sidechain and the consequent approach of the ammonium
group towards the carboxylate, with the proton donation to it. The final configuration
is with the carboxylic group formed, the amino group hydrogen bonded to it and two
gauche defects in the Lys sidechain. Apart from details (energy differences, N–H or
When Water Plays an Active Role in Electronic Structure … 735

O–H distances, values for the dihedral angles, etc.) the final result is the same of the
minimal energy in vacuo model represented with a localized basis-set (Fig. 2, left
panel).
It is interesting to notice that the carboxylate neutralization with a proton donated
by the ammonium group of the Lys sidechain may occur via a different mechanism.
If the super–cell is smaller (for instance with 0.5 nm of minimal distance between
nearest images) there is a lower energy pathway that drives a proton towards the car-
boxylate. The ammonium group of the nearest image comes close to the carboxylate
via the rigid rotation of the all-trans configuration. This pathway occurs also at T =
50 K, no need of increasing the temperature to overtake local energy barriers. The
final configuration is, in this case, the carboxylic group of an all-trans Lys interact-
ing, via hydrogen bond, with the amino group of the nearest image Lys sidechain.
Since there is one molecule per super–cell, this means that again the Lys molecule
is amino-carboxylic. This result is an artefact of the small empty space provided
around the solute. Nevertheless, it is an indication that the molecule wants to stay in
a neutral state and if there is a proton closer than that provided by the solute itself, it
is easily obtained by the environment when available.
In order to include the effects of the water solvent, the same initial configuration
is now merged into the same super–cell filled of water molecules. The configuration
of the water molecule are extracted by the Monte Carlo simulation of an empirical
model of water, the rigid TIP3P water molecule [22]. In the merging process, the
water molecules with oxygen atom closer than 1.5 Å to any atom in the Ac-Lys solute
are removed. This set-up results in a system composed by 29 atoms of the Ac-Lys
solute and 105 water molecules in a super–cell of 1.79 × 1.35 × 1.34 nm3 .
This system is then heated up to T = 300 K. One important effect that is introduced
in this simulation is the mechanism of proton transfer within water molecules in the
liquid state and at room conditions. The proton transfer from the ammonium group to
a more basic group in the environment, including activated water molecules in the
solvent, can occur via a fast rearrangement of X-H chemical bonds (the Grotthuss
mechanism, see Ref. [66] and references therein). This kind of mechanism can be
observed even in short simulations of about 1 ps (see below). The video V2 collects,
within 60 s, 1800 configurations of the hydrated Ac-Lys fragment (it is encoded with
30 frames per second). In the table below a summary of the frames is reported.

Video time since start (s) Simulation stage Simulation time (ps)
0–5 T = 50 K 0.18
5–10 T = 100 K 0.18
10–15 T = 200 K 0.18
15–27 T = 300 K 0.42
27–40 T = 300 K, d0 = 5.5 Å 0.48
40–47 T = 300 K, d0 = 4.5 Å 0.24
47–60 T = 300 K, d0 = 3.5 Å 0.48
736 G. La Penna and O. Andreussi

In the first 15 s of the video, the heating up to 200 K of the system, it can be
observed that the water molecules do not change significantly their positions, but
rather they start changing their orientation. Despite one can say that at 200 K water
should be a crystal, the simulation, that started from a liquid sample, is not long
enough to freeze the sample. Hydrogen atoms, that are lighter than oxygen, attempts
different hydrogen bonds with oxygen in the nearby. The dynamics of hydrogen
bonds became apparent at T = 300 K (about 20 s) when several “bonds” are drawn
just because atoms become closer than 1.6 Å. At this stage also hydrogen bonds
between the ammonium H and water O become visible, as well as those around the
carboxylate. Once the system entered this dynamic regime, an external potential was
added in order to sample configurations with progressively lower Nζ -C distances.
This method is called umbrella sampling (US) and it is related with the generalized
ensembles described in many other chapters of this book. In this case the US consists
of adding a harmonic potential UU S = K /2 (d − d0 )2 with K = 200 kcal mol−1 Å−2
and a mobile equilibrium distance, d0 for that distance. Within 20 and 40 s, the
US does not alter the room temperature dynamics, with the Lys sidechain sampling
gauche states. It is at time of about 50 s that the close distance (d0 is now 3.5 Å)
starts affecting the solute. It can be observed that the C–C bonds are stretched to
distances larger than the visual cut-off (1.6 Å), witnessing the violence exerted by
the electrostatic interactions between the ammonium and the carboxylate groups to
the aminoacid scaffold. Nevertheless, the conclusion is the same of the mean–field
model: the proton cannot be transferred from the ammonium to the carboxylate, even
though in some cases the H–O bond becomes visible. The hydrogen bonds between
the two charged groups and water molecules are more likely than the hydrogen
bonds of type N–H· · · O, and the water molecules around the two groups seem not
be perturbed by this compact state. If one follows the lefthand portion of the images
(the methyl group of the acyl fragment) it can be observed how the water molecules
are more rigidly oriented along the whole minute of the video. This hydrophobic
interaction is not affected by the US potential and by the closure of the Lys sidechain.
The water rotations and translations observed in the second half of the video,
both without and with the US potential, are consistent with the experimental radial
distribution function for O–H pairs belonging to different molecules.
In both cases the structural change in the water solvent is small, thus indicating
that the Ac-Lys fragment is soluble in water. The bulk structure of water is not
dramatically affected by the Ac-Lys solute, with the cavity formed by the solute not
perturbing the water liquid sample around it. When the US is introduced, the more
significant change in the Ow–Hw RDF compared to Ow–Ow (right and left panels
in Fig. 4, respectively), indicates that the structure of water is not changing in the
average relative position of centers of water molecules (the oxygen atoms), rather in
the orientation of O–H bonds in water molecules. The more pronounced structure
of Ow–Hw RDF (right panel) corresponds to a partial rotational freezing of O–H
bonds in the region close to the charged groups that are kept close in the US region
(thin line in Fig. 4), i.e. when the Ac–Lys molecule is forced towards more compact
configurations.
When Water Plays an Active Role in Electronic Structure … 737

4 2
CP-MD CP-MD
3.5 exp. exp.
3 US 1.5 US
2.5
g(r)

g(r)
2 1
1.5
1 0.5
0.5
0 0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6
r (Å) r (Å)

Fig. 4 Radial distribution functions for O–O pairs (left panel) and H–O pairs (right panel), with
H and O atoms belonging to water molecules. Solid line is for the T = 300 K stage, while the
thin line is for the stage where the US external potential is present and d0 = 3.5 Å. Points are the
experimental data reported in Ref. [39]

Fig. 5 Time evolution of the


potential energy E K S in
Eq. 7 along the simulation of
Ac-Lys displayed in video
V2. Horizontal bars are the
thermal averages for the
same degrees of freedom in
the hypothesis of
equipartition. The in-set is
the same function for the
system simulated in the
vacuum

However, the main result of this short simulation is that since both the polar groups
(the ammonium and carboxylate) are interacting with water almost like other water
molecules replaced by them, the propensity for exchanging one of the protons among
the two groups, observed in the vacuum, is removed. Even when the H–O distance
approaches values in the range of 1.5 Å (the last 10 s of the video), the H–O bond is
not formed.
In Fig. 5, the behaviour of the potential energy along with the entire simulation
(including the heating up of the system from zero to 300 K) is displayed. The initial
decrease of energy is an effective energy minimization due to the low temperature
imposed to the system built as the sum of the minimal energy solute and a sample of
liquid water made of rigid “empirical” molecules. The atoms adapt their positions
to the chemical bonds now available in the quantum model, slightly rearranging the
interatomic distances according to the ground state electron density. The comparison
between this part of the time evolution with the in-set that displays the behaviour of
energy for the same model in the vacuum at low temperature, gives an estimate of the
energy contribution due to solvent rearrangement compared to the contribution of a
competitive solute change. In the solvent the energy decrease is of about 3000 kJ/mol
with no significant changes in the solute structure. In the vacuum, the drop of energy
738 G. La Penna and O. Andreussi

is 200 kJ/mol with a significant structural and electronic reshuffling of the solute:
this becomes more compact and the proton migrates between two groups. Even if
the number of atoms in the solvent (315) is large compared to that in the solute (29),
the amount of water molecules is small enough to be compared with a single shell
of water molecule around an aminoacid. The energy involved in the adjustment of
a mechanical model of a few water molecules interacting with a single aminoacid
to its quantum nature is about ten times that involved in a deep change of electronic
structure and consequent molecular structure in the isolated solute.
Continuing with the observation of Fig. 5, once a decent minimum of the energy
is achieved (at T = 100 K), the thermal fluctuations start to increase the energy,
transferring energy among the different degrees of freedom in the super–cell. These
include the water molecules, the major part of this thermal bath. As a significant
improvement compared to mean–field models, water molecules explicitely react to
movement of bonds in the solute. The horizontal lines represent the average potential
energy increase U = RT 2
Ndeg from the energy minimum (the zero reference in the
figure), due to the population of Ndeg = 3Nat − 6 independent degrees of freedom in
the super–cell (Nat the number of atoms in the sample). The increase of energy due
to thermal fluctuations reaches approximately the allowed average level, the better
when the temperature is large enough to facilitate equipartition.
Again, it is interesting to notice that the energy decrease operated by the addition
of the external potential (the US region) is huge for a quantum chemist, at least
about 200 kJ/mol. Notice that once the external potential is added, the total energy
is not conserved. Since in the simulation an external thermal bath is added to the
system to keep the temperature of the super–cell constant [67], the work done by
the external potential is incorporated into the potential energy of the super–cell. The
potential energy change of about −200 kJ/mol is the balance between even larger
quantities that cannot easily be separated from the sum. One is the energy decrease
due to the formation of an intramolecular hydrogen bond between the ammonium
and the carboxylate group. The other is the energy increase due to slight reorientation
of O–H bonds within the surrounding water molecules. The size of the sum of
the two quantities shows that even if the sample is very small, the involvement of
many degrees of freedom in the structural changes that accompany the external bias
spread the single collective variable into a large mechanical work, where the word
“mechanical” is used here just to indicate that there is no breaking and formation of
covalent bonds.
Indeed, the energy decrease observed in the vacuum with the proton transfer is
about 200 kJ/mol. This quantity for the 29 atoms of the solute is huge and there was
no way to disperse the same energy among the limited set of degrees of freedom of
the isolated molecule. The water solvent is therefore an excellent buffer for energy,
also when alternative formation of localized bonds is in theory possible.
The experiment reported above is not sufficient to access free energy changes,
because of the limited statistics acquired. Even perturbation methods cannot be
safely applied because the external US potential is too strong to be managed with
perturbation equations typical of umbrella sampling and reweight ing techniques.
Nevertheless, most of the methods recently developed for computing free energy
When Water Plays an Active Role in Electronic Structure … 739

changes are based on the same approach briefly described above: an external poten-
tial is added to the system, allowing the biasing of configurations over a range of
values for a convenient collective variable. In the case above, the collective variable
was the distance between a possible donor of protons (the ammonium group) and a
possible acceptor (the carboxylate group). But a wide range of collective variables
based on complicated combinations of microscopic variables (atomic coordinates)
can be exploited. The estimate of the free energy at room temperature as a function
of the collective variables allows to identify stable states, intermediates and transi-
tion states in rugged energy landscapes. Within a quantum mechanical description,
this becomes a tool for understanding a hypothetical reaction mechanism, once this
has been translated in a set of collective variables. Despite many numerical methods
have been developed for this purpose [68–70], only a few tentative applications are
reported in the context of first-principles calculations, because of the many technical
problems. Among these, we mention the effects of transient excited states (forces not
well defined) and changes of spin configuration (that is encoded in the f i populations
in Eq. 5, parameters that must be kept fixed along the simulation). In many cases, the
marks of these kind of problems are strong oscillations and drifts in the fake electron
kinetic energy, a quantity that must be kept small during the simulation in order to
guarantee a ground state definition and a proper separation among electronic and
atomic degrees of freedom.
In most cases concerning chemical reactions, the presence of an explicit buffer
of solvent is mandatory in any case. One reason is scientific: The solvent, especially
water, is the mean for coupling together portions of molecules far in space. It can
be observed also in video V2 that at a certain time one water molecule is bridging
the two charged groups. This coupling can be enhanced in the model via a collective
variable containing the two groups, while the explicit solvent, with its reaction to the
change of collective variable, introduces the necessary mean for the coupling to be
effective. Another reason is also technical: the solvent provides the explicit electrons
for transient electron pairs that allows the smooth rearrangements of electrons in a
ground state for the assembly (solute+solvent). This explicit electron bath, in addition
to the atomic bath compensating for strong oscillation of kinetic energy (the thermal
bath), is making feasible the exploitation of complicated reaction mechanisms that
in a mean–field solvent appear forbidden.

14 Metal Ions and Protons Compete

The description of a solute merged into a protic liquid like water, provided by DFT
and the dynamic algorithms sampling the statistics at room conditions emerging from
this model, are opening a new perspective in the frame of biological molecules.
The picture of the water mobility sketched by the exercise of Ac-Lys above holds
also when other kinds of positive holes dynamically attract electrons provided by
electronegative atoms, both on the protein and in the water solvent (the oxygen
atoms). This occurs when metal ions are present in the water solution in contact with
the protein. Among these ions, some are more or less spherical charges (like the
740 G. La Penna and O. Andreussi

usual assumption in empirical models for Na+ or Ca2+ ), others provide geometrical
constraints to the electron donation (the coordination chemistry of transition metal
ions like Zn, Cu, Fe, all abundant in biological environments).
Such directional interaction occurs in the N-terminal disordered region of the prion
protein, where a segment containing the aminoacid sequence HGGG is repeated sev-
eral times. The amide protons of two G residues, respectively, are given to water and
replaced with a single Cu2+ ion. A neutral Cu[HGGG] complex was first observed
by NMR in water solution at pH 7.4 (phosphate buffer, almost neutral) [71] and it
was then isolated as a crystal from the same solution [72]. The same kind of metal
ion-induced peptide neutralization, occurs in many other systems of biological inter-
est [73], thus showing that positive holes reshuffling is a common process, extremely
relevant in changing the behaviour of protein backbone when slight changes in the
protein environment occur.
So we came back to our initial example (see Sect. 2), the single Cu2+ ion interacting
with the peptide representing the HGGG repeat in the N-terminus of the human prion
protein. Now we have an idea of how the glue (the valence electrons), filling the dark
space in the video images, is represented in the computer.
In order to monitor the chain of microscopic events behind the transfer of proton
from the peptide to water, models of the HGGG peptide in contact with Cu ions has
been investigated in detail within the frame of the computational approach described
above for Ac-Lys [74, 75]. The idea was to deposit the Cu ion within the peptide
already templated as in the Cu[HGGG] crystal. The mechanism by which the amide
H atoms are extracted by the water solvent was then observed in all the microscopic
details.
In video V1 this process is displayed. The video lasts about 60 s and covers the
simulation at T = 50 K of the complex with 25 water molecules in an orthorhombic
super–cell for the time of 1.9 ps. At the beginning (a minimal energy structure of
the DFT model) one water molecule is located along the axis of the square-planar
coordination of Cu2+ . The copper ion is bonded by one imidazole Nδ (His), two
backbone amide N atoms (Gly 2, Gly 3) and the amide O atom (Gly 3) (see left

Fig. 6 The initial (left) and final (right) configurations for the Cu2+ [HGGG] complex in a super–
cell with 25 water molecules. The simulation was performed at T = 50 K for 1.9 ps. The elongated
bonds at the top-right of right panel indicate the location of the proton initially attached to N(Gly
3). Cu is displayed as an orange sphere
When Water Plays an Active Role in Electronic Structure … 741

panel in Fig. 6). Since the amide groups are protonated and the amide N lonepair is
delocalized over the carbonyl group of the peptide bond, its propensity for bonding
Cu is low. Nevertheless, the initial configuration is an energy minimum and the N–Cu
bonds are formed, as it is shown by the first few seconds of the video (and by the
time evolution of other electron parameters, data not shown here). After 10 s, the
axial water molecule moves from Cu towards the H atoms of Gly 3 (in anti to the His
sidechain), forming a hydrogen bond. At the same time, the N(Gly 3)-Cu distance
becomes shorter (the bond is drawn in the images when the distance between Cu
and a ligand is smaller than 2 Å). At about 30 s, the same water molecule moves
away from the complex carrying the proton of the amide group. Once the H3 O+ ion
moves into the layer of water molecules, the proton is passed to another molecule,
thus representing a snapshot of the Grotthuss mechanism for dissipating the excess
proton introduced in the water layer by the Cu ligand.
The proton of Gly 2 points away from the plane of the Cu ligands, interacting,
especially at the end of the video, with several water molecules. Once the ligand is
wrapped around Cu because of the two aligned Cu–N bonds, there is no chance to
avoid the further proton donation to water.

15 Water and Aminoacids Compete for the Same Metal Ion

In many cases, copper ions interact with proteins via His sidechains. Like the prion
protein, the amyloid-β (Aβ) peptide that is the major component of the amyloid fibrils
observed in the Alzheimer’s disease, is characterized by a relatively high content of
His residues in the N-terminal region. Moreover, two of these His residues are next
in the sequence, thus increasing their potential role in Cu binding. In experiments, all
the three His residues (6, 13 and 14) are affected by Cu addition and binding [76], but
the possibility of a dynamic exchange of ligand atoms around Cu strongly affects the
conformational sampling of the peptide: the more defined is the Cu binding and the
more defined is the ligand conformation because of the many constraints due to the
Cu coordination. The Aβ peptide is intrinsically disordered [77] and the interaction
with one or more metal ions changes the population of conformers, thus modifying
the propensity for mutual interactions between peptides [78]. Moreover, metal ions
can form bridges between two or more peptides. This kind of effect is common to all
disordered ligands and a large number of disordered protein regions are to date known
to interact with metal ions. The type and fluxionality of metal binding determine the
assembling of peptides and ions together, with the consequent extrusion of water
from the peptide shell.
Several models for the Cu binding to the region 1–16 of Aβ were investigated [79].
Since Cu+ is better investigated in monomeric complexes with Aβ, that complex
allows an easier characterization.
The models allow to understand which are the conditions for a single Cu to bind
three His sidechains at the same time. At the oxidation state I, there is not such a
condition: starting from a high His crowding (Fig. 7, left panel), the structure with the
742 G. La Penna and O. Andreussi

Fig. 7 Initial (left) and final (right) states for one of the models of Cu+ -Aβ(1–16) complex. Colour
scheme is like in Fig. 6. In green are emphasized those atoms that are within 2 Åfrom Cu (in orange).
Orange bonds emphasize the His residues, while in green are emphasized bonds in Asp 1 and Asp
7, activating the His sidechains

digonal coordination of Cu by two His 13–14 sidechains is formed. This is indeed


the structure found in the experiments. This coordination tends to transform Cu
in a hydrophobic center and the peptide forms a cleft between two domains (1–7
and 10–16), expelling water molecules away from Cu. Remarkably, this situation is
completely changed when Cu is oxidized: Cu binds more easily water molecules,
together with two His sidechains.
In video V3 the whole process for the extraction of one His sidechain away from
the coordination sphere of Cu is displayed.

Video time since start (s) Simulation stage Simulation time (ps)
0–5 T = 50 K 0.18
5–11 T = 100 K 0.22
11–17 T = 200 K 0.22
17–26 T = 300 K 0.32

The 3-His coordination (first 20 s) is a low energy state for the solute. To show that
the mechanical stress of the peptide is able to break a Cu–N bond it is necessary to
reach the temperature of 300 K. Moreover, in smaller models, containing separated
Ac-His-His-NHmet and Ac-His-NHmet fragments replacing the Aβ(1–16) chain, the
mechanical stress is larger in the His-His segment and the coordination of Cu+ by His
6 and His 13 is more stable than that of the His 13-His 14 segment. Therefore, with
When Water Plays an Active Role in Electronic Structure … 743

the entire 1–16 fragment the situation is completely different than in the truncated
models. Finally, no water molecules approach Cu and the radial distribution of Ow–
Cu pairs is very similar to that observed for square-planar complexes of Cu2+ , where
the interaction with axial water molecules is weak. The chance to have a ligand like
water in the plane from which His is extracted (the righthand side of the images in
the video) is low, but this can be observed only with a model including explicit water
molecules and the possibility to form new chemical bonds.

16 Water Molecules Are Often Required

A deep inspection of video V3 shows that when a stable Cu+ digonal complex is
formed, water molecules are not allowed to enter into the Cu coordination sphere.
This is not only an impression: the reduced state of Cu in this His-Cu-His coordination
is more hydrophobic than the oxidized state of Cu. The reactivity of Cu in this type of
coordination can be further probed by using the above described dynamic methods
combined with external forces, as sketched in the example of Ac-Lys molecule.
Again, we first built a configuration for Cu in the oxidized state (Cu2+ ) bound to the
amyloid peptide. The bound state has been chosen as characterized by many potential
ligand atoms at close distance from Cu, where these potential ligand atoms include
the His sidechains of the peptide (that is here represented by the short sequences
DAGGGHD and Ac-HH-NCH3 . These two short peptides are segments representing
the DAEFRHDSGYEVHHDK 16 residues in the N-terminus of the Aβ peptide
mentioned above: it is a truncation of the 1–16 Cu-binding sequence, but not too
drastic. It is known that Cu ions are bound to these residues and that chemical
properties of Cu depend on its coordination to the peptide. The important point that
we want to emphasize here is that dynamic methods like those described in this
chapter allow to include, among the potential ligand atoms, the water solvent and,
eventually, other ions and molecules dissolved in it. In the following example the Cu
ion is partially released by the peptide, with the release induced by an external force
acting on the Cu coordination number, decreasing the coordination number from 4 to
2. This bias is useful to move the ion from a coordination environment to a different
one, where the biological ligand be eventually reorganized during the process. The
valence left free to Cu is dynamically captured by the chemical species that are able
to fill with their electrons the positive hole provided by Cu, but the external force
prevents a stable transfer of electrons into the hole. In our modeling study, at the end
of this release, the oxidation state of Cu is changed, adding one electron. Thus the
Cu-Aβ complex is reduced, in a configuration that is suitable to reduction because
close to a low-valence state of Cu. In this new reduced state, the ion is readsorbed
by the peptide, again by acting externally on the coordination number, this time
increased from 2 to 4. Then, once a high-valence state of reduced Cu is achieved, the
ion is again oxidized, and the coordination number relaxed to its more natural state,
i.e. 4–5.
744 G. La Penna and O. Andreussi

Fig. 8 Simplified behaviour


of the free energy as a
function of a reaction
coordinate when a molecular
system changes the number
of electrons (oxidoreductive
process). Once an hypothesis
for the reaction coordinate be
given, molecular
configurations representing
the points indicated with
numbers can be built as
models and simulated
numerically

A schematic picture of the path that is performed by externally acting on the coor-
dination number is displayed in Fig. 8. Here a possible behaviour of the free energy
of the Cu-Aβ complex is displayed as a function of a reaction coordinate. The reac-
tion coordinate is not a thermodynamic measurable variable, because we measure,
when this is possible, an average of the coordinate. Therefore, the displayed free
energy is not a measurable work. Nevertheless, the quantities help in understanding
the chemistry of the atomic assembly as a function of its structure, the latter manipu-
lated via a suitable handle. Since many properties of an ion depend on the amount of
ligand atoms in its surrounding, we choose the coordination number as the reaction
coordinate to handle. As reminded above, it is possible to compute for a model the
free energy as a function of a chosen reaction coordinate [70], but in practice this is
not easy, because of the extremely large number of manipulated pathways required
to achieve a statistical convergence. However, this is the future and we describe in the
following a single manipulation, with some insights that can be obtained analysing
a bunch of these manipulations.
In Fig. 8, red points display the configurations sampled in the first oxidized state
(configurations 1–3 in red); blue points display the configurations sampled in the
following reduced state (4–6); green points are configurations obtained after the
second oxidation (7–8). The parabolic shape of the free energy for, respectively,
oxdized (Cu(II)) and reduced (Cu(I)) states is just the ideal representation of the
simplest approximation around the points of stability when the respective numbers
of electrons are deposited on the molecule.
The video V4 displays one of the pathways following the external change of the
coordination number and oxidation state, i.e. one pathway from point 1 to point 8 in
the schematic frame displayed in Fig. 8. The actual behaviour of the complex and its
solution environment depends on the initially chosen Cu-Aβ structure and on the rate
of the change in coordination number. Once a large number of pathways be collected,
the bias due to the initial conditions and to the way the pathway is performed, becomes
unrelevant: any other bias would produce, on average, the same effect. In practice,
When Water Plays an Active Role in Electronic Structure … 745

Fig. 9 Reduction potential 1.5


of Cu(II)-Aβ/Cu(I)-Aβ pair
with respect to the pair 1
[H+ ]qe /H2 pair. The colour
0.5
of points refers to Fig. 8

ΔE (V)
0
-0.5
-1
-1.5
1 2 3 4 5 6
CN (atoms)

the statistical weight of each sampled configuration and of the quantities averaged
over the sampled configurations slowly converges with the number of pathways and
time length of each pathway. However, a first view of the chemical properties can be
obtained assuming that each of the sampled configurations has the same weight.
In Fig. 9, the reduction potential of Cu(II)-Aβ is reported for several sampled
configurations (16 pathways). The reference of the reduction potential is the reduction
potential for the proton in pure water, computed with similar approximations. For
the details of these calculations, see Refs. [80, 81]. The colour of each point refers to
Fig. 8, thus a red point is for one of points 1–3 obtained in the trajectory, etc. There are
3 × 16 red points, 3 × 16 blue points and 2 × 16 green points. The black points are
transient points obtained during equilibration and/or driving the coordination number.
The video V4 collects, within 1 min and 20 s, 2010 configurations of the fragment
described above (it is encoded with 25 frames per second). The total simulated time is
2.41 ps. C N0 is the equilibrium coordination number of Cu imposed by the external
bias, that is a harmonic force function of the coordination number C N . In the table
below a summary of the frames is reported.
On the lefthand side there is most of the peptide, while on the righthand side there
is most of the water environment. The Cu ion is displayed as an orange sphere, while
the atoms that are closer than 2.5 Å (this is the cut-off distance used to evaluate
the coordination number) are displayed as green spheres. During the first 10 s, the
equilibration of the initial high-valence state of the ion occurs. At about 5 s, the
temperature of 300 K is achieved and after approximately other 5 s we see that one of
the three His sidechains (His 14) initially bound to Cu is released. This occurs because
there are too many ligand atoms around Cu and His 14 is the most stressed part of
the peptide forced to stay around Cu at the beginning. The stress is concentrated
in the repulsion between the two sidechains of, respectively, His 13 and His 14.
The reduction of coordination number by the external force is then attempted. This
process lasts until approximately 27 s. The configuration achieved at this point has
not the coordination number 2, that is externally forced, because the oxidized state
of the complex does not like that: this is why the approximate free energy for point
3 is larger than for point 2. However, the coordination number is 3 and Cu is bound
746 G. La Penna and O. Andreussi

Video time since start (s) Simulation stage Simulation time (ps) Point in Fig. 8
0–5 T = 50 K 0.15
5–10 T = 150 K 0.15
10–15 T = 300 K, C N0 = 6 0.15 1
15–18 T = 300 K, C N0 = 5 0.10
18–21 T = 300 K, C N0 = 4 0.10 2
21–24 T = 300 K, C N0 = 3 0.10
24–27 T = 300 K, C N0 = 2 0.10 3
30–35 T = 50 K 0.15 4
35–40 T = 150 K 0.15
40–45 T = 300 K, C N0 = 2 0.15 5
45–48 T = 300 K, C N0 = 3 0.10
48–51 T = 300 K, C N0 = 4 0.10
51–54 T = 300 K, C N0 = 5 0.10 6
54–59 T = 50 K 0.15 7
59–64 T = 150 K 0.15
64–80 T = 300 K, C N0 = 5 0.51 8

to two His sidechains and a water molecule from an approximately perpendicular


direction. This represents a configuration suitable for hosting the spontaneous (i.e.
with negative energy change) transformation of Cu2+ into the reduced form Cu+
(see end of video V3). According to the approximate scheme, this is exactly the
representation of point 3, suitable for reduction to point 4.
From time 27 to 54 s, the video displays the behaviour of the system once it is
reduced (Cu is Cu+ ) and the coordination number is progressively forced from low
(2) to high values (5). Despite the attempt to force ligands to approach Cu, only
one carbonyl atom (from Gly 5) enters into the Cu coordination sphere, achieving
a coordination number 4 that is suitable for a newly oxidized state (point 6). This
process shows that the reduced state, in this pathway, is quite reluctant to high-valence
states. Indeed, after oxidation (starting at approximately 54 s) and after the relaxation
in the new oxidized state, only water molecules enter into the coordination sphere
(end of the video, point 8). This final state represents the only possible high-valence
configuration available to the oxidized state in this pathway. The peptide ligand was
not able to reorganize efficiently around a new oxidized state and the configuration
ends into a state where the free valence of Cu are captured by water molecules in the
nearby.
The video shows that the thermal bath of water molecules provides molecules
that can suitably replace the portion of ligand atoms released by the peptide. The
exchange of ligand atoms is therefore explicitly modelled: what would happen if
water molecules were not in the nearby?
Finally, let us go back to the distribution of reduction potential for the couple
Cu(II)-Aβ/Cu(I)-Aβ displayed in Fig. 9. The distribution shows that when the coor-
dination number is low, Cu works as an oxidant, with the reduction potential rela-
When Water Plays an Active Role in Electronic Structure … 747

tively high. This is due both to the low chances for the oxidized state to have the free
valences of Cu occupied by ligand atoms, including O of water molecules, and to
the high stability of the reduced state, with Cu(I) nicely pinced by two N atoms of
either His sidechains or N-terminus (Asp 1). On the other side, when Cu has large
coordination number, Cu becomes reductant. The latter condition is assisted by the
water molecules coming from the solvent: in those cases where this be not possible,
the weak binding of Cu to the peptide would imply an oxidant activity.
The active role of water molecules in giving electrons to molecules containing
cations with variable oxidation states, is displayed by a catalyst that mimics the
oxygen evolving center in photosystem II [82]. This is a molecule containing several
cobalt-oxygen cores assembled together into a large anion. The cores, once oxidized,
stimulate different water molecules to come close together and to restore, by different
extents, the electrons lost by the metal ions in the large anion. In these conditions,
the oxygen atoms that are close in space are forced to eject the bound hydrogens
and to share their electrons in O-O bonds. The electron transfer, coupled with proton
release from metal-bound water molecules and to the structure of the large anion,
has been recently modelled with dynamic methods like those described above [83].

17 Not Yet Excited?

The role of electron excited states in chemical reactivity is fundamental. Among


the properties that are sensitive to excited states, the most evident for chemists and
biologists is the light absorbance of biological compounds characterized by delo-
calized electrons. These compounds, including the chromophores in photosystem II
(some containing Mg ions, other containing conjugated double bonds and aromatic
fragments), absorb visible and UV light. The absorption spectra depend on both
the ground and excited states. The interaction between the absorber and the envi-
ronment changes the absorption spectra, mainly shifting and scaling the maxima of
absorption. In biology, the environment includes water molecules, sometimes form-
ing extended hydrogen bonds with the solute. The lifetime of excited states induced
by light absorption can activate a new chemistry for the absorber and for its environ-
ment. This is the field of photochemistry [84]: the charge separation induced by the
light absorption in photosystem II is a chemical reaction that, in the first time, occurs
in the excited state; once the environment have been reorganized, a new ground state
is achieved and the atoms evolve according to well defined forces. The macromolecu-
lar assembly works to restore the initial ground state, driving the energy of distortions
towards other chemicals (like the quinone) present in the environment. Ideally, the
assembly makes the process cyclic, otherwise the incoming light is not only wasted,
but becomes even dangerous, producing undesired aggressive chemical species like
radicals. Therefore, the photosystem works as a pump to divert a potential source of
damages into a mobile electron-rich chemical, i.e. a fuel.
The basic ingredients to model excited states are beyond the scope of this chapter
and are discussed in other chapters of this book (see for instance the chapter about iron
748 G. La Penna and O. Andreussi

porphyrins). However, here we mention only the importance, in excited states, of the
requirement to extend the hypotheses of effective one-electron states (the Kohn-Sham
approximation) and of the two-variables description at the basis of density-functional
theory. Going beyond these approximations requires a full theory for many-body
interactions within a second-quantization theory. Some of the derived formalisms
that are more promising for large molecular systems are summarized in Ref. [85].

18 Perspectives

This chapter reports of recent results concerning the first-principles simulation of


peptides in water, their interactions with metal ions, and, above all, some indications
of best-practices for the construction of reasonable initial configurations starting from
empirical models of larger samples. Simple systems have been used as examples.
The mean–field methods, like those casted into PCM and FG formalisms, intro-
duced in the high-level quantum mechanics calculations, are probably the best tools
for extracting, within a reasonable computer time, quantitative information: to simu-
late spectra and to predict kinetic and thermodynamic parameters for a few minimal
energy configurations, when these be available. The description of explicit solvent
effects in super–cell models of small samples of water solutions is, on the other hand,
still an interpretative tool. It is more suited to indicate a pathway for a reaction or
to exclude other pathways because of some steric hindrance or fast charge recombi-
nation. Moreover, qualitative effects of temperature and concentration changes are
more directly monitored.
The perspective, at a medium scale of 1000 atoms, is not that of increasing the size
of the samples treated at a QM level, neither increasing the length of the simulations
for such samples. The perspective is to increase the statistical sampling of initial
conditions for QM models, an issue that can be attacked by combining the different
methods described in this book.
Super–cell methods will be available as a routine method in the future. This
perspective is somehow justified by history in the field of computational sciences.
In 1985 the description of liquid water as a sample of discrete molecular entities
described at an empirical level was limited to high-performance computers and to
a few hundreds of molecules. Today, including 10,000 of such water molecules is
possible on a mobile computer and such simulations are solid references for further
approximations.

Acknowledgements Several european high-performance computing infrastructures are greatly


acknowledged for the resources provided along the years, particularly NIC (DE) and CINECA (IT).
All the super–cell calculations reported here were possible thanks to the Quantum-Espresso com-
munity [86, 87] (see www.quantum-espresso.org for full documentation and many tutorials). All the
drawings and movies were made with the VMD program [8] (see www.ks.uiuc.edu/Research/vmd
for documentation and tutorials).
When Water Plays an Active Role in Electronic Structure … 749

References

1. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N.,
Bourne, P.E.: The protein data bank. Nucleic Acids Res. 28(1), 235–242 (2000). https://doi.
org/10.1093/nar/28.1.235, https://www.rcsb.org
2. Lawson, C.L., Patwardhan, A., Baker, M.L., Hryc, C., Garcia, E.S., Hudson, B.P., Lagerstedt,
I., Ludtke, S.J., Pintilie, G., Sala, R., Westbrook, J.D., Berman, H.M., Kleywegt, G.J., Chiu,
W.: EM databank unified data resource for 3d EM. Nucleic Acids Res. 44(D1), D396–D403
(2016). https://doi.org/10.1093/nar/gkv1126, https://www.emdatabank.org
3. Weiner, P.K., Kollman, P.A.: Amber: Assisted model building with energy refinement. A general
program for modeling molecules and their interactions. J. Comp. Chem. 2(3), 287–303 (1981).
https://doi.org/10.1002/jcc.540020311
4. Case, D.A., Cheatham, T.E., Darden, T., Gohlke, H., Luo, R., Merz, K.M., Onufriev, A., Sim-
merling, C., Wang, B., Woods, R.J.: The AMBER biomolecular simulation programs. J. Com-
put. Chem. 26(16), 1668–1688 (2005). https://doi.org/10.1002/jcc.20290
5. Scheraga, H.A.: My 65 years in protein chemistry. Quart. Rev. Biophys. 48(2), 117–177 (2015).
https://doi.org/10.1017/S0033583514000134
6. Schlick, T.: The 2013 nobel prize in chemistry celebrates computations in chemistry and
biology. SIAM News 46(10) (2013). https://www.biomath.nyu.edu/index/papdir/fullengths/
Nobel13.pdf
7. Guskov, A., Kern, J., Gabdulkhakov, A., Broser, M., Zouni, A., Saenger, W.: Cyanobacterial
photosystem II at 2.9 Å resolution and the role of quinones, lipids, channels and chloride. Nat.
Struct. Mol. Biol. 16, 334 (2009). https://doi.org/10.1038/NSMB.1559
8. Humphrey, W., Dalke, A., Schulten, K.: VMD visual molecular dynamics. J. Molec. Graph-
ics 14(1), 33–38 (1996). https://doi.org/10.1016/0263-7855(96)00018-5, https://www.ks.uiuc.
edu/Research/vmd
9. Bertini, I., Gray, H.B., Stiefel, E.I., Valentine, J.S. (eds.): Biological Inorganic Chemistry:
Structure and Reactivity. University Science Books, Sausalito, CA (2007)
10. Morante, S., Rossi, G.C.: A novel proof of the DFT formula for the interatomic force field of
molecular dynamics. Ann. Phys. 377(Supplement C), 71–76 (2017). https://doi.org/10.1016/
j.aop.2016.12.011
11. Bryant, R.G., Johnson, M.A., Rossky, P.J.: Water. Acc. Chem. Res. 45(1), 1–2 (2012). https://
doi.org/10.1021/ar2003286
12. Del Rosso, L., Celli, M., Ulivi, L.: New porous water ice metastable at atmospheric pressure
obtained by emptying a hydrogen-filled ice. Nat. Commun. 7, 13394 (2016). https://doi.org/
10.1038/ncomms13394
13. Bartels-Rausch, T., Bergeron, V., Cartwright, J.H.E., Escribano, R., Finney, J.L., Grothe, H.,
Gutiérrez, P.J., Haapala, J., Kuhs, W.F., Pettersson, J.B.C., Price, S.D., Sainz-Díaz, C.I., Stokes,
D.J., Strazzulla, G., Thomson, E.S., Trinks, H., Uras-Aytemiz, N.: Ice structures, patterns, and
processes: A view across the icefields. Rev. Mod. Phys. 84(2), 885–944 (2012). https://doi.org/
10.1103/RevModPhys.84.885
14. Allen, M.P., Tildesley, D.J.: Computer Simulation of Liquids. Clarendon Press, Oxford, UK
(1989)
15. Mazza, M.G., Stokely, K., Pagnotta, S.E., Bruni, F., Stanley, H.E., Franzese, G.: More than one
dynamic crossover in protein hydration water. Proc. Nat. Acad. Sci. U.S.A. 108(50), 19873–
19878 (2011). https://doi.org/10.1073/pnas.1104299108
16. Ball, P.: H2 O: A Biography. Weidenfeld & Nicolson, London (1999)
17. Ben-Naim, A.: Molecular Theory of Water and Aqueous Solutions—Part I: Understanding
Water. World Scientific, Singapore (2009). https://doi.org/10.1142/7136
18. Lamoureux, G., Roux, B.: Modeling induced polarization with classical drude oscillators:
Theory and molecular dynamics simulation algorithm. J. Chem. Phys. 119(6), 3025–3039
(2003). https://doi.org/10.1063/1.1589749
750 G. La Penna and O. Andreussi

19. Jiang, W., Hardy, D., Phillips, J., MacKerell, A., Schulten, K., Roux, B.: High-performance
scalable molecular dynamics simulations of a polarizable force field based on classical drude
oscillators in NAMD. J. Phys. Chem. Lett. 2, 87–92 (2011). https://doi.org/10.1021/jz101461d
20. Ponder, J.W., Wu, C., Ren, P., Pande, V.S., Chodera, J.D., Schnieders, M.J., Haque, I., Mobley,
D.L., Lambrecht, D.S., Di Stasio, R.A., Head-Gordon, M., Clark, G.N.I., Johnson, M.E., Head-
Gordon, T.: Current status of the Amoeba polarizable force field. J. Phys. Chem. B 114(8),
2549–2564 (2010). https://doi.org/10.1021/jp910674d
21. Senftle, T.P., Hong, M.M., Sungwook Islam, S.B., Kylasa, Y., Zheng, Y.K., Shin, C., Junker-
meier, R., Engel-Herbert, M.J., Janik, H.M., Aktulga, T., Verstraelen, A., Grama, A., van Duin,
A.C.T.: The Reax-ff reactive force-field: Development, applications and future directions. Npj
Comput. Mater. 2, 15011 (2016). https://doi.org/10.1038/npjcompumats.2015.11
22. Jorgensen, W.L., Chandrasekhar, J., Madura, J.D., Impey, R.W., Klein, M.J.: Comparison of
simple potential functions for simulating liquid water. J. Chem. Phys. 79, 926–935 (1983).
https://doi.org/10.1063/1.445869
23. Parr, R.G., Yang, W.: Density Functional Theory of Atoms and Molecules. Oxford University
Press, New York (1989)
24. Landau, L., Lifchitz, E.: Physique Statistique. MIR, Moscow, URSS (1984)
25. Mennucci, B., Cammi, R. (eds.): Continuum Solvation Models in Chemical Physics: From
Theory to Applications. Wiley, Hoboken (2008). https://doi.org/10.1002/9780470515235
26. Tomasi, J., Mennucci, B., Cammi, R.: Quantum mechanical continuum solvation models.
Chem. Rev. 105, 2999–3093 (2005). https://doi.org/10.1021/cr9904009
27. Klamt, A., Mennucci, B., Tomasi, J., Barone, V., Curutchet, C., Orozco, M., Luque, F.J.: On the
performance of continuum solvation methods. a comment on “universal approaches to solvation
modeling”. Acc. Chem. Res. 42(4), 489–492 (2009). https://doi.org/10.1021/ar800187p
28. Cramer, C.J., Truhlar, D.G.: Reply to comment on "a universal approach to solvation modeling".
Acc. Chem. Res. 42(4), 493–497 (2009). https://doi.org/10.1021/ar900004j
29. Frisch, M.J., Trucks, G.W., Schlegel, H.B., Scuseria, G.E., Robb, M.A., Cheeseman, J.R.,
Scalmani, G., Barone, V., Mennucci, B., Petersson, G.A., Nakatsuji, H., Caricato, M., Li, X.,
Hratchian, H.P., Izmaylov, A.F., Bloino, J., Zheng, G., Sonnenberg, J.L., Hada, M., Ehara, M.,
Toyota, K., Fukuda, R., Hasegawa, J., Ishida, M., Nakajima, T., Honda, Y., Kitao, O., Nakai, H.,
Vreven, T., Montgomery Jr., J.A., Peralta, J.E., Ogliaro, F., Bearpark, M., Heyd, J.J., Brothers,
E., Kudin, K.N., Staroverov, V.N., Keith, T., Kobayashi, R., Normand, J., Raghavachari, K.,
Rendell, A., Burant, J.C., Iyengar, S.S., Tomasi, J., Cossi, M., Rega, N., Millam, J.M., Klene,
M., Knox, J.E., Cross, J.B., Bakken, V., Adamo, C., Jaramillo, J., Gomperts, R., Stratmann, R.E.,
Yazyev, O., Austin, A.J., Cammi, R., Pomelli, C., Ochterski, J.W., Martin, R.L., Morokuma,
K., Zakrzewski, V.G., Voth, G.A., Salvador, P., Dannenberg, J.J., Dapprich, S., Daniels, A.D.,
Farkas, O., Foresman, J.B., Ortiz, J.V., Cioslowski, J., Fox, D.J.: Gaussian 09, Revision C.01.
Gaussian Inc., Wallingford, CT, USA (2010)
30. Muller, N.: Search for a realistic view of hydrophobic effects. Acc. Chem. Res. 23(1), 23–28
(1990). https://doi.org/10.1021/ar00169a005
31. Ben-Naim, A.: Molecular Theory of Water and Aqueous Solutions Part II: The Role of Water
in Protein Folding, Self-assembly and Molecular Recognition. World Scientific, Singapore
(2011). https://doi.org/10.1142/8154
32. Senn, H.M., Thiel, W.: QM/MM methods for biomolecular systems. Angew. Chem. Int. Ed.
48, 1198–1229 (2009). https://doi.org/10.1002/anie.200802019
33. Barone, V., Improta, R., Rega, N.: Quantum mechanical computations and spectroscopy: From
small rigid molecules in the gas phase to large flexible molecules in solution. Acc. Chem. Res.
41(5), 605–616 (2008). https://doi.org/10.1021/ar7002144
34. Marx, D., Hutter, J.: Ab Initio Molecular Dynamics: Basic Theory and Advanced Methods.
Cambridge University Press, Cambridge (2009)
35. Pastore, G., Smargiassi, E., Buda, F.: Theory of ab initio molecular dynamics calculations.
Phys. Rev. A 44, 6334–6347 (1991). https://doi.org/10.1103/PhysRevA.44.6334
36. Perdew, J.P., Burke, K., Ernzerhof, M.: Generalized gradient approximation made simple. Phys.
Rev. Lett. 77, 3865–3868 (1996). https://doi.org/10.1103/PhysRevLett.77.3865
When Water Plays an Active Role in Electronic Structure … 751

37. Becke, A.D.: Density-functional thermochemistry. iii. the role of exact exchange. J. Chem.
Phys. 98, 5648–5652 (1993). https://doi.org/10.1063/1.464913
38. Perdew, J.P., Ruzsinszky, A., Csonka, G.I., Vydrov, O.A., Scuseria, G.E., Constantin, L.A.,
Zhou, X., Burke, K.: Restoring the density-gradient expansion for exchange in solids and
surfaces. Phys. Rev. Lett. 100(13), 136406 (2008). https://doi.org/10.1103/PhysRevLett.100.
136406
39. Schwegler, E., Grossman, J., Gygi, F., Galli, G.: Towards an assessment of the accuracy of
density functional theory for first-principles simulations of water II. J. Chem. Phys. 121, 5400
(2004). https://doi.org/10.1063/1.1782074
40. Schwegler, E., Sharma, M., Gygi, F., Galli, G.: Melting of ice under pressure. Proc. Nat. Acad.
Sci. U.S.A. 105(39), 14779–14783 (2008). https://doi.org/10.1073/pnas.0808137105
41. Lazić, P., Atodiresei, N., Alaei, M., Caciuc, V., Blügel, S., Brako, R.: Junolo - Jülich nonlocal
code for parallel post-processing evaluation of VdW-DF correlation energy. Comput. Phys.
Commun. 181(2), 371–379 (2010). https://doi.org/10.1016/j.cpc.2009.09.016
42. Kulik, H.J., Cococcioni, M., Scherlis, D.A., Marzari, N.: Density functional theory in transition-
metal chemistry: A self-consistent hubbard U approach. Phys. Rev. Lett. 97(10), 103001 (2006).
https://doi.org/10.1103/PhysRevLett.97.103001
43. Car, R., Parrinello, M.: Unified approach for molecular dynamics and density-functional theory.
Phys. Rev. Lett. 55, 2471–2474 (1985). https://doi.org/10.1103/PhysRevLett.55.2471
44. Wolf, D., Keblinski, P., Phillpot, S.R., Eggebrecht, J.: Exact method for the simulation of
coulombic systems spherically truncated, pairwise r-1 summation. J. Chem. Phys. 110, 8254–
8282 (1999). https://doi.org/10.1063/1.478738
45. Vanderbilt, D.: Soft self-consistent pseudopotentials in a generalized eigenvalue formalism.
Phys. Rev. B 41, 7892–7895 (1990). https://doi.org/10.1103/PhysRevB.41.7892
46. Giannozzi, P., De Angelis, F., Car, R.: First-princple molecular dynamics with ultrasoft pseu-
dopotentials: Parallel implementation and application to extended bioinorganic systems. J.
Chem. Phys. 120, 5903–5915 (2004). https://doi.org/10.1063/1.1652017
47. Fattebert, J.L., Gygi, F.: Density functional theory for efficient ab initio molecular dynamics
simulations in solution. J. Comput. Chem. 23(6), 662–666 (2002). https://doi.org/10.1002/jcc.
10069
48. Fattebert, J.L., Gygi, F.: First-principles molecular dynamics simulations in a continuum sol-
vent. Int. J. Quantum Chem. 93(2), 139–147 (2003). https://doi.org/10.1002/qua.10548
49. Petrosyan, S.A., Rigos, A.A., Arias, T.A.: Joint density-functional theory: Ab initio study of
Cr2O3 surface chemistry in solution. J. Phys. Chem. B 109(32), 15436–15444 (2005). https://
doi.org/10.1021/jp044822k
50. Scherlis, D.A., Fattebert, J.L., Gygi, F., Cococcioni, M., Marzari, N.: A unified electrostatic and
cavitation model for first-principles molecular dynamics in solution. J. Chem. Phys. 124(7),
74103 (2006). https://doi.org/10.1063/1.2168456
51. Dabo, I., Cancès, E., Li, Y., Marzari, N.: Towards first-principles electrochemistry. arXiv
preprint arXiv:0901.0096 (2008)
52. Sanchez, V.M., Sued, M., Scherlis, D.A.: First-principles molecular dynamics simulations at
solid-liquid interfaces with a continuum solvent. J. Chem. Phys. 131(17), 174108 (2009).
https://doi.org/10.1063/1.3254385
53. Dziedzic, J., Helal, H.H., Skylaris, C.K., Mostofi, A.A., Payne, M.C.: Minimal parameter
implicit solvent model for ab initio electronic-structure calculations. Europhys. Lett. 95(4),
43001 (2011). https://doi.org/10.1209/0295-5075/95/43001
54. Andreussi, O., Dabo, I., Marzari, N.: Revised self-consistent continuum solvation in electronic-
structure calculations. J. Chem. Phys. 136(6), 064102 (2012). https://doi.org/10.1063/1.
3676407
55. Cococcioni, M., Mauri, F., Ceder, G., Marzari, N.: Electronic-enthalpy functional for finite
systems under pressure. Phys. Rev. Lett. 94(14), 145501 (2005). https://doi.org/10.1103/
PhysRevLett.94.145501
56. Dupont, C., Andreussi, O., Marzari, N.: Self-consistent continuum solvation (sccs): The case of
charged systems. J. Chem. Phys. 139(21), 214110 (2013). https://doi.org/10.1063/1.4832475
752 G. La Penna and O. Andreussi

57. Andreussi, O., Marzari, N.: Electrostatics of solvated systems in periodic boundary conditions.
Phys. Rev. B 90(24), 245101 (2014). https://doi.org/10.1103/PhysRevB.90.245101
58. Timrov, I., Andreussi, O., Biancardi, A., Marzari, N., Baroni, S.: Self-consistent continuum
solvation for optical absorption of complex molecular systems in solution. J. Chem. Phys.
142(3), 034111 (2015). https://doi.org/10.1063/1.4905604
59. Fisicaro, G., Genovese, L., Andreussi, O., Mandal, S., Nair, N., Marzari, N., Goedecker, S.:
Soft-sphere continuum solvation in electronic-structure calculations. J. Chem. Theory Comput.
13(8), 3829 (2017). https://doi.org/10.1021/acs.jctc.7b00375
60. Letchworth-Weaver, K., Arias, T.A.: Joint density functional theory of the electrode-electrolyte
interface: Application to fixed electrode potentials, interfacial capacitances, and potentials of
zero charge. Phys. Rev. B 86(7), 075140 (2012). https://doi.org/10.1103/PhysRevB.86.075140
61. Fortunelli, A., Goddard, W.A., Sha, Y., Yu, T.H., Sementa, L., Barcaro, G., Andreussi, O.:
Dramatic increase in the oxygen reduction reaction for platinum cathodes from tuning the
solvent dielectric constant. Angewandte Chem. Int. Ed. 53(26), 6669–6672 (2014). https://doi.
org/10.1002/anie.201403264
62. Hamada, I., Sugino, O., Bonnet, N., Otani, M.: Improved modeling of electrified interfaces
using the effective screening medium method. Phys. Rev. B 88(15), 155427 (2013). https://
doi.org/10.1103/PhysRevB.88.155427
63. Montemore, M.M., Andreussi, O., Medlin, J.W.: Hydrocarbon adsorption in an aqueous envi-
ronment: A computational study of alkyls on Cu(111). J. Chem. Phys. 145(7), 074702 (2016).
https://doi.org/10.1063/1.4961027
64. Sementa, L., Andreussi, O., Goddard III, W.A., Fortunelli, A.: Catalytic activity of Pt3 8 in
the oxygen reduction reaction from first-principles simulations. Catal. Sci. Technol. 6(18),
6901–6909 (2016). https://doi.org/10.1039/C6CY00750C
65. Onsager, L.: Electric moments of molecules in liquids. J. Am. Chem. Soc. 58(8), 1486–1493
(1936). https://doi.org/10.1021/ja01299a050
66. Knight, C., Voth, G.A.: The curious case of the hydrated proton. Acc. Chem. Res. 45(1),
101–109 (2012). https://doi.org/10.1021/ar200140h
67. Nosé, S.: A molecular dynamics method for simulations in the canonical ensemble. Molec.
Phys. 52, 255–268 (1984). https://doi.org/10.1080/00268978400101201
68. Frenkel, D., Smit, B.: Understanding Molecular Simulation. Academic Press, San Diego (1996)
69. Wales, D.J.: Energy Landscapes. Cambridge University Press, Cambridge, UK (2003)
70. Laio, A., Gervasio, F.L.: Metadynamics: A method to simulate rare events and reconstruct the
free energy in biophysics, chemistry and material science. Rep. Prog. Phys. 71(126), 601–622
(2008). https://doi.org/10.1088/0034-4885/71/12/126601
71. Łuczkowski, M., Kozłowski, H., Stawikowski, M., Rolka, K., Gaggelli, E., Valensin, D.,
Valensin, G.: Is the monomeric prion octapeptide repeat PHGGWGQq a specific ligand for
Cu2+ ions? J. Chem. Soc., Dalton Trans. 2002, 2269–2274 (2002). https://doi.org/10.1039/
B201040M
72. Burns, C.S., Aronoff-Spencer, E., Dunham, C.M., Lario, P., Avdievich, N.I., Antholine, W.E.,
Olmstead, M.M., Vrielink, A., Gerfen, G.J., Peisach, J., Scott, W.G., Millhauser, G.L.: Molec-
ular features of the copper binding sites in the octarepeat domain of the prion protein. Bio-
chemistry 41, 3991–4001 (2002)
73. Miura, T., Suzuki, K., Kohata, N., Takeuchi, H.: Metal binding modes of Alzheimer’s amyloid
β-peptide in insoluble aggregates and soluble complexes. Biochemistry 39(23), 7024–7031
(2000). https://doi.org/10.1021/bi0002479
74. Furlan, S., La Penna, G., Guerrieri, F., Morante, S., Rossi, G.: Ab initio simulations of Cu
binding sites on the N-terminal region of the prion protein. J. Biol. Inorg. Chem. 12, 571–583
(2007). https://doi.org/10.1007/s00775-007-0218-x
75. Furlan, S., La Penna, G.: Metal ions and protons compete for ligand atoms in disordered
peptides: Examples from computer simulations of copper binding to the prion tandem repeat.
Coord. Chem. Rev. 256, 2234–2244 (2012). https://doi.org/10.1016/j.ccr.2012.03.036
76. Hureau, C., Balland, V., Coppel, Y., Solari, P.L., Fonda, E., Faller, P.: Importance of dynamical
processes in the coordination chemistry and redox conversion of copper amyloid-β complexes.
J. Biol. Inorg. Chem. 14, 995–1000 (2009). https://doi.org/10.1007/s00775-009-0570-0
When Water Plays an Active Role in Electronic Structure … 753

77. Furlan, S., La Penna, G., Perico, A.: Modeling the free energy of polypeptides in different
environments. Macromolecules 41, 2938–2948 (2008). https://doi.org/10.1021/ma7022155
78. Miller, Y., Ma, B., Nussinov, R.: Zinc ions promote alzheimer aβ aggregation via population
shift of polymorphic states. Proc. Nat. Acad. Sci. U.S.A. 107(21), 9490–9495 (2010). https://
doi.org/10.1073/pnas.0913114107
79. Furlan, S., Hureau, C., Faller, P., La Penna, G.: Modeling the Cu+ binding in the 1–16 region
of the amyloid-β peptide involved in alzheimer’s disease. J. Phys. Chem. B 114, 15119–15133
(2010). https://doi.org/10.1021/jp102928h
80. La Penna, G., Hureau, C., Andreussi, O., Faller, P.: Identifying, by first-principles simulations,
Cu[amyloid-β] species making Fenton-type reactions in Alzheimers disease. J. Phys. Chem.
B 117, 16455–16467 (2013). https://doi.org/10.1021/jp410046w
81. La Penna, G., Hureau, C., Faller, P.: A cu-amyloid β complex activating Fenton chemistry
in Alzheimer’s disease: Learning with multiple first-principles simulations. AIP Conf. Proc.
1618(1), 112–114 (2014). https://doi.org/10.1063/1.4897690
82. Kanan, M.W., Nocera, D.G.: In situ formation of an oxygen-evolving catalyst in neutral water
containing phosphate and Co2+ . Science 321(5892), 1072–1075 (2008). https://doi.org/10.
1126/science.1162018
83. Mattioli, G., Giannozzi, P., Amore Bonapasta, A., Guidoni, L.: Reaction pathways for oxygen
evolution promoted by cobalt catalyst. J. Am. Chem. Soc. 135(41), 15353–15363 (2013).
https://doi.org/10.1021/ja401797v
84. Parsico, M., Granucci, G.: Continuum Solvation Models in Chemical Physics: From Theory
to Applications. In: Wiley, H. (ed.), Chapter Photochemistry in condensed phase. https://doi.
org/10.1002/9780470515235
85. Mosca Conte, A., Violante, C., Missori, M., Bechstedt, F., Teodonio, L., Ippoliti, E., Carloni,
P., Guidoni, L., Pulci, O.: Theoretical optical spectroscopy of complex systems. J. Electron
Spectrosc. Relat. Phenom. 189(S), 46–55 (2013). https://doi.org/10.1016/j.elspec.2013.02.002
86. Giannozzi, P., Baroni, S., Bonini, N., Calandra, M., Car, R., Cavazzoni, C., Ceresoli, D.,
Chiarotti, G.L., Cococcioni, M., Dabo, I., Dal Corso, A., de Gironcoli, S., Fabris, S., Fratesi,
G., Gebauer, R., Gerstmann, U., Gougoussis, C., Kokalj, A., Lazzeri, M., Martin-Samos, L.,
Marzari, N., Mauri, F., Mazzarello, R., Paolini, S., Pasquarello, A., Paulatto, L., Sbraccia, C.,
Scandolo, S., Sclauzero, G., Seitsonen, A.P., Smogunov, A., Paolo, U., Wentzcovitch, R.M.:
Quantum Espresso: A modular and open-source software project for quantum simulations of
materials. J. Phys. Condens. Matter 21, 395502 (2009). https://doi.org/10.1088/0953-8984/
21/39/395502, https://www.quantum-espresso.org
87. Giannozzi, P., Andreussi, O., Brumme, T., Bunau, O., Buongiorno Nardelli, M., Calandra, M.,
Car, R., Cavazzoni, C., Ceresoli, D., Cococcioni, M., Colonna, N., Carnimeo, I., Dal Corso, A.,
de Gironcoli, S., Delugas, P., Di Stasio Jr, R.A., Ferretti, A., Floris, A., Fratesi, G., Fugallo, G.,
Gebauer, R., Gerstmann, U., Giustino, F., Gorni, T., Jia, J., Kawamura, M., Ko, H.Y., Kokalj,
A., Kücükbenli, E., Lazzeri, M., Marsili, M., Marzari, N., Mauri, F., Nguyen, N.L., Nguyen,
H.V., Otero-de-la-Roza, A., Paulatto, L., Poncé, S., Rocca, D., Sabatini, R., Santra, B., Schlipf,
M., Seitsonen, A.P., Smogunov, A., Timrov, I., Thonhauser, T., Umari, P., Vast, N., Wu, X.,
Baroni, S.: Advanced capabilities for materials modelling with Quantum Espresso. J. Phys.
Condens. Matter 29(46), 465901 (2017). https://doi.org/10.1088/1361-648X/aa8f79
Electronic Properties of Iron Sites
and Their Active Forms in
Porphyrin-Type Architectures

Mariusz Radoń and Ewa Broclawik

Abstract This chapter is focused on recent advances in quantum chemical


modeling of active sites in heme proteins and iron porphyrin complexes. After intro-
ducing the computational methods (density functional theory and correlated ab initio
ones) several case studies are reviewed to show how these methods unravel the elec-
tronic structure of heme and heme-related systems; in particular, how they deal with
description of: (a) spin state energetics in ferrous and ferric complexes; (b) binding
properties of CO, NO, and O2 ligands to heme; (c) electronic structure of P450 Cpd
I and alike systems. Making conclusive calculations for the heme species requires
a balanced treatment of electron correlation, which is a great challenge for the present
computational methods. Further challenge is situated in a correct translation of the
computational results into chemical terms. Achievements of modern ab initio meth-
ods on the two fronts are highlighted and discussed in relation to DFT calculations.

1 Introduction

Electronic structure of transition metal complexes is on heart of bioinorganic and


medicinal chemistry, industrially-oriented catalysis, as well as materials science.
A number of proteins (particularly: oxygenases, transferases, and redox enzymes)
contain a transition metal ion coordinated by organic ligands as a key part of their
active site. Moreover, a number of industrially important reactions proceed on top of

M. Radoń (B)
Academic Computer Center CYFRONET AGH, Nawojki 11, 30-950 Kraków, Poland
Present Adress:
M. Radoń
Faculty of Chemistry, Jagiellonian University in Krakow,
Gronostajowa 2, 30-387 Kraków, Poland
e-mail: mradon@chemia.uj.edu.pl
E. Broclawik
Jerzy Haber Institute of Catalysis, Polish Academy of Sciences,
Niezapominajek 8, 30-239 Kraków, Poland
e-mail: broclawi@chemia.uj.edu.pl
© Springer Nature Switzerland AG 2019 755
A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_23
756 M. Radoń and E. Broclawik

transition metal sites, either in heterogeneous or homogeneous catalysis. Quantum


chemistry (QC) may provide insights into their electronic structure and properties,
desirable to understand how these fascinating systems operate at microscopic level.
Unlike experiment, QC offers a direct and unbiased “access” to the electronic struc-
ture of transition metal sites. Moreover, theoretical calculations can be applied not
only to stable (and thus experimentally well characterized) species, but also to elusive
active forms and even to transition states of chemical reactions. For these reasons, QC
has become a useful and nearly indispensable tool for today bioinorganic chemistry.
Most massive quantum chemical calculations are nowadays carried out with den-
sity functional theory (DFT) methods, for their reasonable compromise between the
accuracy of the results obtained and the computational resources needed. Despite
many successes of QC in this field (e.g., modeling of enzymatic reactions), tran-
sition metal sites are well recognized as “difficult cases” for computational treat-
ment, with large errors being occasionally reported. Quality of the results can be
frequently improved by upgrading the model: increasing the system size, accounting
for long-range interatomic forces and thermodynamics, performing statistical sam-
pling or molecular dynamics, etc. Nonetheless, there still persist intrinsic problems
with description of the electronic structure in transition metal systems (e.g., vari-
able spin state on the metal, spin coupling, nondynamical correlation), not always
satisfactorily tractable within DFT methodology. In contrast to DFT, the ab initio
wave function approach offers a very systematic treatment of electron correlation
via a series of well defined approximations, ultimately converging towards the exact
solution. A price for the accuracy and systematics of correlated ab initio methods is
their much higher computational complexity which (until very recently) prevented
their applications to sufficiently large, biologically relevant models.
The present contribution is focused on the electronic structure of iron sites in
porphyrin-type architectures and on recent progress in quantum chemical calcula-
tions for these systems. The iron porphyrin systems serve as models of active sites
in iron-containing heme enzymes, which form a large and important group of met-
alloproteins being involved in many processes, like transport and activation of small
inorganic ligands (oxygen, nitric oxide), oxygenation and degradation of organic
molecules (including drugs), as well as in electron transfer processes. Involvement
of iron porphyrins in performing many important biological functions is presumably
connected to their interesting electronic properties and their rather unique ability
to stabilize various oxidation and spin states on the iron. On the other hand, these
systems pose a notable challenge for the methods of QC, not only for achieving
a desirable computational accuracy (in order to perform conclusive calculations),
but also for understanding their complicated (multiconfigurational) electronic struc-
ture in chemical terms.
This chapter starts with a brief description of the relevant computational method-
ology, where basic concepts of QC as well as foundations of correlated ab initio
and DFT methods are shortly reviewed. By no means this part can be considered
an introduction to QC, for which purpose we can recommend the excellent textbooks,
Electronic Properties of Iron Sites … 757

treating either on QC in general [73, 125, 183] or on DFT [83]. Subsequently, recent
advances in quantum chemical calculations for various heme models are reviewed.
The achievements of correlated ab initio methods in description of these biologically
relevant iron complexes are highlighted, as well as an assessing relation of these
methods to DFT calculations.

2 Methodology

2.1 Basic Concepts in Quantum Chemistry

The fundamental goal of quantum chemistry (QC) is to provide an approximate


solution of the time-independent Schrödinger equation:

Ĥ Ψ = EΨ (1)

where Ĥ is the electronic Hamiltonian (the energy operator) of the molecular system.
The problem is then to find the energy E and the (many-electron) wave function Ψ
for the ground state and often also for a certain number of electronically excited
states. Equation (1) is obtained from the basic principles of quantum mechanics
within a, so called, Born-Oppenheimer approximation, meaning roughly that one is
interested in obtaining the electronic structure for fixed positions of the nuclei, i.e.,
for a given molecular structure. In such a way, the energy and wave function obtained
from (1) are dependent on nuclear coordinates and one is often interested in finding
the stationary points of the energy as a function of geometry because the minima
represent stable geometries of a given molecular system, while the saddle points give
transition states for possible chemical reactions. Although written in a very simple
way, Eq. (1) pose a tremendously difficult many-body problem, even for atoms and
very small molecules, not to speak about for relatively large systems, like porphyrin-
based complexes considered in this chapter. Many approximations, so called methods
of QC chemistry, have been therefore devised to provide approximate energy and
wave function (from which any molecular property can, in principle, be calculated),
that are accurate enough for the purpose of chemistry and biology, while still being
computationally tractable for large molecular systems.
The mother of all these methods is a mean field approximation to (1), also called
an independent particle model or a Hartree-Fock (HF) theory. In this approximation
the many-electron wave function (Ψ ) is given by Slater determinant (antisymmetrized
product) of one-electron functions—so called molecular (spin)orbitals. The optimal
form of molecular orbitals for a given system can be obtained by applying a varia-
tional principle of quantum mechanics, from which it follows that these variationally
optimal orbitals constitute a solution to, so called, Fock equations, i.e., an eigenvalue
problem of the Fock operator
758 M. Radoń and E. Broclawik

F̂ϕi (r) = εi ϕi (r) (2)

with eigenvalues (εi ) known as orbital energies. The Fock operator is an effective
one-particle energy operator, including not only kinetic energy and the electrostatic
energy of an electron in the field of the nuclei, but also its average interaction with
other electrons in the molecular system. This averaging of electronic repulsion—
a consequence of single-determinantal approximation to the wave function—is a
key feature and (as we shall see below) a serious drawback of this method. The
Fock equations (2) are usually solved by expanding the molecular orbitals in a pre-
defined basis set of atom-centered functions, so called atomic orbitals (AO). This
being done, the Fock equations are reduced to a generalized eigenvalue problem of
a (symmetric) Fock matrix. But, since to generate the Fock matrix one has to know
the occupied molecular orbitals, this eigenvalue problem has to be solved iteratively
until convergence is achieved in a self-consistent field (SCF) procedure.
As mentioned above, the main limitation of the HF approach comes from oversim-
plified structure of the wave function (i.e., single Slater determinant), which reflects
the average treatment of electron-electron repulsion in this method. In other words,
the electrons in the HF approximation behave very much like independent particles.
Oddly enough, they are actually not fully independent: two electrons with the same
spin cannot reach the same positions in space because the wave function would vanish
in such case. This comes because the Slater determinant wave function is (correctly)
antisymmetric with respect to exchange of two electrons. It may be shown that around
the position of a reference electron there is a deficiency in conditional probability of
meeting electrons with the same spin as compared to (unconditional) one-electron
probability; this depletion is known as exchange (or Fermi) hole. Consequently, two
spin-like electrons (↑↑ or ↓↓) show lower repulsion energy than the two electrons
with opposite spins (↑↓) occupying the same pair of molecular orbitals. This effect—
being a simple consequence of the (one-half) electronic spin and the Pauli exclusion
principle—is known as electron exchange or Fermi correlation.
The exchange (Fermi correlation) should be clearly distinguished from the other
type of electron correlation—the Coulomb correlation, herein called simply the elec-
tron correlation —which is due to deficiency of a single determinantal wave function
and thus not included in the HF method. Since the electrons are charged particles, they
should avoid each other (regardless their spins), especially at small distances, where
they repel most strongly. Proper correlation of their positions cannot be described
by a single-determinantal wave function while for a correct (i.e., correlated) wave
function there should appear a deficiency in conditional probability of meeting other
electrons around a position of a reference electron, regardless their spin. The deple-
tion in the conditional probability due to the Coulombic repulsion is defined as the
correlation hole. Therefore, electron correlation (likewise exchange) always reduces
the interelectron repulsion, thus makes a negative contribution to the total energy.
This contribution, called a correlation energy, can be formally defined as the differ-
ence between the exact energy (in non-relativistic approximation) and the HF energy
obtained in the limit of complete one-particle basis set:
Electronic Properties of Iron Sites … 759

(limit)
E corr = E exact − E HF . (3)

Although the correlation energy is usually a small fraction of the total energy, it
comprises an important (at times even a dominant) part of the energy differences rele-
vant in chemistry and biology (bonding energies, atomization energies, reaction ener-
gies and barriers, excitation energies, ionization potentials, etc). In fact, exchange and
correlation energy is sometimes even called a “nature’s glue that keeps atoms bound in
molecules” [121]. Exchange and correlation energy must be thus properly included
in all quantum chemical studies attempting to provide a quantitative description.
Nonetheless, while already a simple HF theory can correctly capture the exchange, it
is much more difficult to account for electron correlation. This can be accomplished
either by using a correlated wave function in ab initio quantum chemistry (discussed
in Sect. 2.2) or by employing an approximate exchange-correlation functional in den-
sity functional theory (DFT, discussed in Sect. 2.3). The third approach to electron
correlation, not discussed in this chapter, is quantum Monte Carlo (QMC) [15]. The
philosophy of QMC methods comes down to improving the description of quantum
systems by performing a stochastic search. At the moment, QMC methods are not
as mature as DFT and ab initio ones, and in bioinorganic area far less popular than
these approaches. Nevertheless, QMC approach is being considered promising for
transition metal systems [16, 84] and some more advanced QMC simulations for
systems of biological importance may be expected in near future.
Nonrelativistic approximation was assumed so far, which may be partly justified
by the fact that for the properties of first-row transition metals of interest in this
chapter, relativistic effects are often far less important than uncertainities due to
approximate description of electron correlation. However, these effects become more
important for heavier elements (e.g., Mo and W sites in some enzymes) and, for the
sake of consistency, they should be preferably included also for the first-row transition
metals [159].
Although very rigorous, four-components relativistic methods are available, most
practical calculations mentioned in this chapter, rely on simpler, two-component
approximation, e.g. methods based on Douglas-Kroll-Hess (DKH) transforma-
tion [10], where the relativistic effects naturally fall into two cathegories: scalar
effects and spin-orbit coupling. It is a rule of thumb that scalar effects are far more
important for calculations of potential energy surfaces with up to chemical accuracy
than spin-orbit coupling, but the latter becomes crucial for calculations of certain
molecular properties (EPR parameters, magnetic moments, etc.). Instead of being
treated explicitly, scalar relativistic effects can be alternatively described by means
of effective core potentials (ECPs) [17]. The methods of dealing with electron cor-
relation discussed in the next two sections are relevant for both relativistic and non-
relativistic calculations.
760 M. Radoń and E. Broclawik

2.2 Correlated Ab Initio Methods

2.2.1 Single- Versus Multireference Methods

In order to account for electron correlation, ab initio (Latin: first principles) methods
employ a wave function being a linear combination of many electronic configurations
(determinants), not just a single configuration (i.e., one Slater determinant) in the
HF theory. The main advantage of this approach is the accuracy and the systematics,
since methods of better and better accuracy can be used as series of well-defined
approximations that (ultimately) converge to the exact solution of the electronic
Schrödinger Eq. (1). However, when using correlated wave function, high accuracy
can be obtained only at the expense of great computational cost (both in a sense
of computational time and required resources, such as memory and disk space).
Moreover, this cost increases very rapidly with the size of the problem. For this
reason it is often simply not feasible to perform ab initio calculations at the desired
level of theory for many chemically and biologically interesting systems.
If the wave function is dominated by a single electron configuration, the HF
orbitals provide a good starting point for constructing a correlated wave function in
post Hartree-Fock methods, also known as single-reference methods. In this approach
the correlated wave function is obtained by supplementing the HF configuration with
other electronic configurations generated from the HF orbitals by virtual excitations
of a certain number of electrons out of the occupied into the unoccupied orbitals
(single excitations i → a, double excitations i, j → a, b, etc)1 :


occ. 
virt. 
occ. 
virt. 
occ. 
virt.
Ψ = Ψ0 + Cia Ψia + Ciabj Ψiab
j + jk Ψi jk + . . . . (4)
Ciabc abc

i a i> j a>b i> j>k a>b>c

Equation (4) is known as configuration interaction (CI) ansatz. The problem to be


solved here is to find the CI coefficients (Cia , Ciabj , etc) in (4). Depending on the
method applied to search for CI coefficients, the post-HF approach may be formu-
lated either as a variational (configuration interaction, CI), a perturbational (Møller-
Plesset, MP) or coupled cluster (CC) methods.
The post-HF methods provide a balanced treatment of electronic systems in which
already a single configuration (Ψ0 ) provides a qualitatively correct description, i.e.,
this configuration is dominant in a correlated wave function. For a system in which
this condition is satisfied, the electron correlation is described as a dynamical corre-
lation. The concept of dynamical correlation falls closely to an intuitive picture of
electrons which behave qualitatively like in HF approximation, but—due to a suitable
correction added to the HF wave function—they properly avoid each other at small
distances in order to minimize the electrostatic repulsion. Even though dynamical

1 One should be aware that these formally excited configurations merely serve to describe electron
correlation and have here no direct connotation to electronically excited states.
Electronic Properties of Iron Sites … 761

correlation is typically connected with this short-range effect, it also gives rise to
long range intermolecular dispersion forces, rooted in a sense in correlation effects.
A completely opposite picture arises in a situation where several electronic con-
figurations contribute to the wave function with comparable weights. In such a case
the single determinantal approximation may become qualitatively wrong. This type
of electron correlation, which is due to near-degeneracy of several electronic config-
urations, has been coined static or nondynamical correlation. In contrast to dynam-
ical correlation, the static correlation has large effect on molecular orbitals, which
should be thus preferably optimized not for a single (HF) configuration, but rather
for a multiconfigurational wave function. This is the underlying idea of multiconfig-
urational/multireference methods (discussed below), which are therefore by design
best suited to treat cases with strong nondynamical correlation.
In many real cases it is clearly not possible to unambiguously classify the electron
correlation effects as “purely dynamical” or “evidently nondynamical”. Dynami-
cal correlation is an universal phenomenon, occurring in all multi-electron systems
(both atoms and molecules). In contrast, nondynamical correlation is system-specific,
which means it occurs actually in selected systems (or only at certain geometries).
Certainly, this type of correlation appears along with the ubiquitous dynamical cor-
relation. For rare gas atoms and for closed-shell, well-behaving molecules near their
equilibrium geometry, the electron correlation is almost purely dynamical. In con-
trast, description of transition metal complexes often suffers from effects which are
described, in a broad sense, as “nondynamical correlation”. This means, in practice,
that low-order post-HF calculations, especially CI or perturbational Møller-Plesset
methods, may fail to provide a good description there. Multiconfigurational methods
clearly provide a better and more consistent approach to treat these effect. On the
other hand, higher-order post-HF methods can be still very useful in many cases.
For instance, several accurate ab initio benchmarks discussed further in this chapter
were obtained with a CCSD(T) method—i.e., a coupled cluster (CC) approach with
full (iterative) treatment of single and double excitations and approximate (noniter-
ative) treatment of triple excitations appearing in the cluster operator T̂ , defining the
correlated wave function via the exponential ansatz [9]:

ΨCC = exp(T̂ )ΦHF .

This method is often considered “a golden standard” of accuracy, providing an excel-


lent agreement with experimental data for a number of small, reference molecules.
Interestingly, the high accuracy of CCSD(T) method is retained even for species
with moderate nondynamical correlation effects, thus it can be expected that this
method should work equally well also for transition metal species [61, 107, 180],
unless nondynamical correlation effects are extreme [78]. A number of diagnostics
have been developed for single-reference methods in order to identify and filter out
such exceptionally demanding cases [78, 88] and it seems that a number of heme-
related models are still well-behaving systems for the coupled cluster calculations.
Unfortunately, the applicability of CCSD(T) is strongly limited due to its extreme
computational cost (scaling with the 7-th power of the system size). It means that
762 M. Radoń and E. Broclawik

this method cannot be (so far) applied even to the simplest iron porphyrin species,
being applicable only to their small mimics [107, 180].
The most important example of nondynamical correlation occurring in transition
metal species is left-right correlation. This type of correlation is characteristic of any
electronic pair involved in a covalent bond. The electrons paired in a bonding orbital
have a tendency to partially separate into spatially different regions in order to reduce
their interelectron repulsion. If one electron from the pair is closer to the nucleus
on the left-hand-side of the bond, the second electron is more likely found near the
nucleus on the right-hand-side. The motions of electrons are thus partially correlated
along a chemical bond, which can give rise to a long-range correlation effect when
the bond is stretched (and ultimately dissociated). The left-right correlation thus
becomes particularly critical in the dissociation limit where it warrants a physically
correct form of the wave function.
The role of this correlation can be intuitively explained on an example of hydrogen
molecule (H–H), which by homolytic breaking the σ bond, dissociates into two
neutral hydrogen atoms (H·+H·). However, if the dissociation is modeled with (spin
restricted) HF theory, the dissociation leads to a wave function which can be read as
equal mixture (superposition) of both neutral (H·+H·) and ionized (H+ + H− , H− +
H+ ) products; the latter two structures should not be present in the dissociation limit
and their appearance gives rise to erroneously too high energy. The qualitatively
correct description can be provided by a multiconfigurational wave function of the
form:    
ΨH2 = C1 (σ )2 (σ ∗ )0 + C2 (σ )0 (σ ∗ )2 , (5)

which involves not only the configuration with two electrons paired in the bonding
orbital (σ ), which is the one used in HF, but also the second configuration with the
two electrons occupying the antibonding orbital (σ ∗ ). If the two coefficients C1 , C2
in (5) are regarded as variational parameters, near the equilibrium geometry one
would find C1 ≈ 1 and C2 ≈ 0 (then the HF description is qualitatively correct),
while in the dissociation limit C1 = −C2 = 2−1/2 (then both configurations play an
important role and HF description is qualitatively wrong). Neglect of the second con-
figuration in the wave function (i.e., neglect of left-right correlation) leads, within
the (spin restricted) HF method, to a very large error of 150 kcal/mol in the dis-
sociation energy of H2 . Although in this simple (textbook) example, nondynamical
correlation becomes important only near the dissociation limit, in case of transition
metal complexes the analogous correlation effects may be pronounced already at the
equilibrium geometry (vide infra).
As may be seen already from the simple example above, a multiconfigurational
wave function can be constructed as a linear combination of all electronic configu-
rations arising from a given set of active molecular orbitals. The configurations are
obtained by distributing the available active electrons in all possible ways among the
active orbitals. One can then simultaneously optimize the shapes of the molecular
orbitals and the coefficients of the various electronic configurations. This idea is real-
ized in a complete active space method (CASSCF) by Roos [146]. Orbitals which
are not important in description of static correlation are not put in the active space, so
Electronic Properties of Iron Sites … 763

they are either doubly occupied (inactive orbitals) or virtual (secondary orbitals) in
all the configurations considered. Since a number of electronic configurations grows
rapidly with a number of active orbitals, no more than 16 orbitals can be active in
practice. A modification of the CASSCF method was later proposed as restricted
active space (RASSCF) method [93]. Herein, an active space is divided in three
subspaces: RAS1, RAS2, and RAS3. The middle one, RAS2, plays exactly the same
role as the complete active space in CASSCF calculations. The orbitals in RAS1 are
mostly doubly occupied and only a limited number of electrons (often two or four)
can be excited out of this set into RAS2 and RAS3. Likewise, the orbitals in RAS3
are almost virtual and can be occupied with at most a limited number of electrons
from RAS1 and RAS2. These restrictions serve to eliminate by hand a multitude
of high-energy, less important configurations (which, anyway, would obtain close
to zero coefficients if they were formally included in the CASSCF wave function),
thus keeping the problem computationally tractable even for relatively large active
spaces.
The CASSCF or RASSCF calculations (with suitably chosen active orbitals)
serve to capture nondynamical correlation, but they cover only a very small part
of a dynamical correlation. Therefore, the missing correlation effects are included
in subsequent calculations, most typically with multireference second-order pertur-
bation theory (MRPT2). For this purpose, the Lund group developed a CASPT2
method [5] to include dynamical correlation on top of a CASSCF wave function;
this approach was recently generalized into a RASPT2 method [93] operating with
a RASSCF-type wave function. A different MRPT2 approach was proposed by Hirao
et al. under the name multi-reference Møller-Plesset (MRMP2) method [64]. This
method differs from CASPT2 in a different choice of zero-order Hamiltonian and
the way the first-order correction to wave function is constructed. However, con-
sidering the heme- and heme-related complexes there exists much more experience
with CASPT2 than with the MRMP2 approach [24]. An alternative to these MRPT2
methods is multireference configuration interaction (MRCI) approach. Due to their
very significant cost, MRCI calculations are in practice limited to MRCISD (with
single- and double-excitations only), and even this method is too expensive to be
applied for heme models. For this purpose the simplified, difference dedicated CI
(DDCI) methods can be only applied, which rely on neglect of those excitations that
are not expected to affect the energy difference between the considered electronic
states considerably [96, 97, 100].
It is vital to notice that while CASSCF/RASSCF calculations already provide
a qualitatively correct description of the electronic structure at the correlated level
(the natural orbitals and leading configurations in the CI expansion), the ener-
getics cannot be trusted before it is corrected for missing dynamical correlation
effects (most typically by means of CASPT2/RASPT2). Before this step is done, the
CASSCF/RASSCF energetics is usually meaningless.
Concerning correlated ab initio calculations (both with post-HF and multirefer-
ence methods) it must be remembered that a large basis set of atomic orbitals is
usually required in order to obtain meaningful results. This is due to slow conver-
gence of correlation energy with respect to one-particle basis set [62]. Typically, as
764 M. Radoń and E. Broclawik

large basis set as polarized quadruple-ζ has to be put on metal and triple-ζ one on the
ligands (particularly in the first coordination sphere). Additionally, the energies are
often extrapolated to infinite basis set based on results obtained with two (or more)
basis sets of different quality [62]. Special, systematically-convergent basis sets were
devised for a balanced description of correlation energy in ab initio calculations, such
as the Dunning-type correlation consistent basis sets (cc-pVnZ, n = D, T, Q, etc) [8,
36] or atomic natural orbitals (ANO) basis sets developed by the Lund group [130,
148].
An important bottleneck in ab initio calculations with large basis sets was for
a long time situated in evaluation and processing of two-electron repulsion integrals,
whose number grows with the fourth power of the basis set size. Fortunately, this
problem has been largely mitigated by development of techniques, such as Cholesky
decomposition (CD) [7] and resolution of identity (RI) [193], that avoid explicit use
of the two-electron integrals. The CD and RI approaches, by allowing for the use
of much larger basis sets than was previously possible, have already opened a new
route in ab initio calculations for transition metal complexes. The CD approach was
used in most of the CASSCF/CASPT2 calculations mentioned in this chapter.
We anticipate that even more substantial improvements may be expected in the
near future due to developments in local approaches to electron correlation [57, 158]
(allowing to efficiently treat the electron correlation for spatially extended systems) as
well as the onset of explicitly correlated approaches, such like CCSD(T)-F12 [1, 82].
Here, the term “explicitly correlated” means that the wave function employed in these
methods depends explicitly on the interelectron distance in contrast to the traditional
approach, in which this dependence is implicit, i.e., achieved solely through mixing
of many electronic configurations as in Eq. (4). Due to their construction, explicitly
correlated methods can describe dynamical correlation very efficiently even with
moderate-sized basis sets [101]. However, the explicitly correlated methods are yet
not widely used in the field of bioinorganic chemistry. Although the CCSD(T)-F12
method has already been applied to simple transition metal systems [69], most of the
calculations for heme- and heme-related systems still employ traditional approaches.
Some barriers stem from (so far) single-reference character of explicitly correlated
approaches (like, e.g., CCSD(T)-F12 variant of CCSD(T)), whereas many interesting
bioinorganic problems in fact require multireference methods.

2.2.2 Choice of Active Space for Transition Metal Systems

The accuracy and reliability of multireference calculations rely heavily on the under-
lying active space (in CASSCF or RASSCF calculations) which should capture all
relevant effects of nondynamical correlation. Ideally, all molecular orbitals originat-
ing from the valence shells should be made active, but this is obviously not possible
for chemically interesting systems (except of very small molecules). Actually, it is
also not obligatory since typically just a few of them give rise to nondynamical cor-
relation effects. The orbitals important for description of nondynamical correlation
are not necessarily the frontier orbitals (HOMO, LUMO, HOMO−1, LUMO+1,
Electronic Properties of Iron Sites … 765

etc) obtained from a single-configurational treatment. In fact, what matters its the
character of orbitals, not their orbital energy. This points to the question of (some-
what arbitrary) active space selection. With growing computational experience, some
rules were provided for choosing the appropriate active space in transition metal
species [126, 127, 147].
A general principle is to make active all molecular orbitals with significant metal
nd character. Thus, for any covalent metal–ligand bond this rule prompts to make
active the bonding and the antibonding molecular orbital describing the bond. This is
because an important left-right correlation effect is typically connected with covalent
metal–ligand bonding. As a rule of thumb, the more pronounced is the covalent char-
acter (i.e., the larger is the mixing of metal nd with the corresponding ligand orbital),
the more important the nondynamical correlation effect should be expected [127].
Figure 1 shows contour plots of the (bonding and antibonding) molecular orbitals
involved in typical covalent metal–ligand interactions relevant for iron porphyrin
systems considered in this work. The contour plots shown refer to natural orbitals
(eigenvectors of one-particle density matrix) obtained from CASSCF calculations on
the respective species. Panels (a) and (b) show the orbitals of a σ and π components
of the Fe=O bond in iron-oxo porphyrin (one of the models discussed in Sect. 3.3).
The σ bond originates from an end-on interaction between Fe 3dz 2 and O 2pz orbitals;
the π bond originates from a side-on interaction between Fe 3dx z and O 2px orbitals.
Panel (c) shows the bonding and the antibonding orbital involved in a tetradentate
σ Fe−N bonding between the Fe atom and the N atoms of the porphyrin ring. This
type of bonding is found in all metal porphyrins as well as in complexes with other
similar macrocyclic ligands (e.g., salen, corrole, corrin, corrolazine). Even though
more complicated than the σ bonding in shown in (a), the tetradentate bonding in (c)
is classified as σ too, since it arises from an end-on overlap of Fe 3dx y with the ligand
orbital of the respective symmetry (being localized mostly on the four N atoms of
the porphyrin ring).
Occurrence of left-right correlation induces that actual occupation number of the
formally doubly occupied bonding orbital is lower than two; in a similar way, if

Fig. 1 Pairs of bonding and antibonding orbitals involved in description of covalent metal–ligand
bonds: a the σ component of an Fe−O bonding in the ferryl group; b one of two π components
of the Fe–O bond; c tetradentate σ bond between Fe and the porphyrin nitrogens. The figure gives
contour plots of natural orbitals obtained from CASSCF calculations
766 M. Radoń and E. Broclawik

the antibonding orbitals is formally vacant, its actual occupation number is some-
what larger than zero. This happens because the multiconfigurational wave function
contains (among many others) the configurations where the electronic pair has been
transferred from the bonding to the antibonding orbital (like the wave function for H2
molecule in Eq. 5). It must be mentioned that the same term “left-right correlation”
can be used to describe nondynamical correlation effects connected to various bond-
ing situations, including the three different ones shown in Fig. 1. For the tetradentate
σ Fe–Nporphyrin bond shown in (c) there is, clearly, no “left-” nor “right-hand-side”
of the bond, but the term “left-right correlation” is used in a general—not strictly
geometrical, but rather topological sense, indicating merely that there are two ends
of the bond considered: the one closer to the metal (inward) and the one closer to
the ligand (outward). It must be stressed that left-right correlation in metal-ligand
bonds is pronounced already in the equilibrium geometry. The actual bond distance
is often imposed by the structure of the ligand or results from a compromise between
metal-ligand and ligand-ligand interactions achieved in the equilibrium geometry.
In the complexes with porphyrin-like ligands, the distance between the metal and
the coordinating atoms (N atoms in porphyrins) is determined to large extent by the
geometry of the macrocycle. Thus, because of small size of the 3d orbitals for first-
row transition metals, the overlap between the metal and the ligand orbitals is often
far from optimal even for the equilibrium structure. Hence, in the sense of electronic
structure, the metal-ligand bonds may be described as “partially broken” already in
the equilibrium geometry!
Unless a metal–ligand bonding is very covalent, the bonding orbital is dominated
by a contribution from the ligand while the antibonding one by a contribution from
the metal. This situation is illustrated by the pair of orbitals (σx y , σx∗y ) shown in panel
(c) of Fig. 1. In such a case it is customary to denote the antibonding orbital simply
as metal nd and relate it to the respective nd orbital appearing in crystal field theory
(CFT) considerations. For instance, the σx∗y orbital in panel (c) of Fig. 1 could be
(and often is) denoted simply as “Fe 3dx y ” orbital, in agreement with its principal
iron-3d character. However, if the bond is more covalent (i.e., the mixing of metal
and ligand contributions more pronounced, like for orbitals depicted in panel (a)
and (b)), this classification is neither valid nor useful. It is then necessary to think
about both the bonding and the antibonding orbital as containing a significant metal
nd character. This is also a situation found in Fe–O2 and Fe–NO complexes, with
particularly covalent metal–ligand interactions, which are discussed in Sect. 3.2.
The active spaces for transition metal species should also account for a double-
shell effect, i.e., a strong radial correlation in the valence nd shell (an effect espe-
cially important for first-row transition metals due the small radial extent of their 3d
orbitals). This effect can be tackled by including in the active space a (n + 1)d-type
orbital for each occupied (not otherwise correlated) nd orbital [5, 145]. This requires
extending the active space with up to five extra vacant orbitals. For instance, at the
orientation of the porphyrin ring given above, the Fe 3dx 2 −y 2 orbital is essentially
nonbonding (i.e., not involved in covalent metal ligand interactions) and is doubly-
or singly-occupied (at least in low-lying electronic states). One should thus include
in the active space a correlating Fe 4dx 2 −y 2 orbital, which has the same shape as
Electronic Properties of Iron Sites … 767

3dx 2 −y 2 , but a larger radial extent. In contrast, for transition metal complexes it is
usually not necessary to make active the virtual orbitals with the (n + 1)s (n + 1)p
character (i.e., 4s,p for the first-row metals), even though these orbitals can be cru-
cial to properly describe atoms and small molecules in gas phase. Nonetheless, in
coordination compounds these orbitals are strongly destabilized by the ligand field;
as having too high orbital energy, they typically do not need to be included in the
active space.

2.3 Density Functional Theory

The second and completely different approach to electron correlation is pursued in


density functional theory (DFT) methods. In DFT a basic quantity is not a wave
function, but an electron density


N
ρ(r) = Ψ |δ(r − ri )|Ψ  (6)
i=1

which depends only on the three spatial coordinates (irrespective of N , the number
of electrons) and directly corresponds to a well-defined physical property. In the
pioneering work, Hohenberg and Kohn proved that, in principle, a density-dependent
energy functional E[ρ] can be used to provide the exact energy of any multielectron
system, taking care of all electron correlation effects. If this functional were known
one could perform virtually exact quantum calculations for any molecule of interest
without the need to use (very complicated) correlated wave function. But, obviously,
the precise form of the mysterious energy functional is not known explicitly and all
DFT calculations are necessarily based on its approximations.
In practice, nearly DFT calculations are based on the Kohn-Sham (KS) method,
in which the unknown energy functional is partitioned into four terms:

E[ρ] = Ts [ρ] + d3rρ(r)v(r) + J [ρ] + E xc [ρ]. (7)

Appearance of the first term, a non-interacting kinetic energy, is an essence of the


KS approach. Having recognized that it is extremely difficult to find an explicit
density-dependent functional for electrons’ kinetic energy, Kohn and Sham pro-
posed to approximate this quantity based on auxiliary molecular orbitals describing
a fictitious system of noninteracting electrons with the same electron density as in
the real system. Since the wave function of the fictitious system is given by a single
determinant, the Ts term can be easily calculated:


occ 
occ
Ts [ρ] = φi | − ∇
1 2
2
|φi  , where ρ(r) = |φi (r)|2 .
i i
768 M. Radoń and E. Broclawik

A simple integral in the second term of (7) serves to express the nuclei–electron
attraction, while the third term is a Coulomb interaction of electron density with
itself  
ρ(r)ρ(r )
J [ρ] = 21 d3r d3r . (8)
|r − r |

The last term in (7), known as exchange–correlation functional, provides a necessary


correction to the first three terms to yield in sum the exact electronic energy. As its
name implies, the E xc term covers all effects of electron exchange and correlation.
However, it also accounts for a difference between Ts and exact kinetic energy as
well as it attempts to compensate an unphysical self-interaction which is introduced
by the J term. The exchange–correlation functional is the only term in (7) that needs
to be approximated.
Variationally optimal orbitals of the noninteracting system, the KS orbitals, can
be found as self-consistent solution of the following KS equations:

F̂KS φi (r) = εiKS φi (r), (9)

where eigenvalues (εiKS ) are KS orbital energies. Equation (9) is very similar to the
analogous one in HF theory (Eq. 2) and both problems are solved using a similar
methodology (i.e., by the SCF procedure after expanding molecular orbitals in a given
AO basis set, vide supra). The difference between the HF and KS theories falls
obviously in the form of effective, one-electron Hamiltonian ( F̂ in HF vs. F̂KS in
KS theory). The KS operator is a sum of a kinetic energy operator and an effective
one-electron potential

d3r ρ(r )
vKS (r) = v(r) + + vxc (r), (10)
|r − r |

consisting of external potential v(r) (due to nuclei), the interaction with electron
density and the third term called an exchange–correlation potential, which is a func-
tional derivative of the E xc functional with respect to the density. This term takes
into account all exchange and correlation effect as much as they are (approximately)
included in E xc [ρ].
It must be stressed that, despite a formal similarity, the Hartree-Fock and Kohn-
Sham theories are physically very different: in HF the exchange is treated exactly
while correlation is entirely neglected; the KS method includes both effects albeit
approximately. However, the formal similarity of KS method to HF implies that
DFT calculations—although by construction covering correlation effects— are still
computationally robust and show a favorable scaling with the system size. Moreover,
within the KS formulation the chemically attractive concept of molecular orbital
survives in DFT. The KS orbitals are self-consistent with F̂KS (covering electron
correlation) thus, being effectively correlated, they are regarded as superior to HF
orbitals.
Electronic Properties of Iron Sites … 769

Practical use of DFT critically relies on the approximation of the exchange-


correlation functional in (7). Typically, this functional is split into exchange (E x )
and correlation (E c ) functionals (where, of course, E xc = E x + E c ), and each of
them is approximated separately. A number of approximations, so called approxi-
mate functionals are known and implemented in QC programs. An excellent review
of this subject can be found elsewhere [118, 119], while only the most important
facts are summarized here. The simplest functionals assume that the contribution of
each point to the exchange or correlation functionals is a function dependent solely
on the value of electronic density at that point:

LDA
E x,c [ρ] = d3r ex,c
LDA
(ρ(r)) . (11)

Such simple functionals, known as local density approximation (LDA) can be mod-
eled by referring to physics of a homogeneous electron gas. More complicated gen-
eralized gradient approximation (GGA) functionals employ the integrand function
depending not only on the density at a given point, but also on the gradient of density,
in order to account for inhomogeneity of electron gas:

GGA
E x,c [ρ] = d3r ex,c
GGA
(ρ(r), |∇ρ(r)|) . (12)

The, so called, meta-GGA functionals take even more complicated form, in which
the integrand function depends not only on ρ and |∇ρ|, but also on the laplacian of
the electronic density (∇ 2 ρ) or the density of kinetic energy.
The integrand for LDA exchange, exLDA , is so simple that it can be deduced from
the first principles (by considering the scaling relations for the exchange energy) and
the integrand for LDA correlation, ecLDA , can be parametrized (nearly exactly) based
on Monte Carlo simulations of a homogeneous electron gas. In contrast, the precise
form of the integrand functions for GGA and meta-GGA functionals is not known
and cannot be derived in a systematic way from the first principles. Nonetheless,
scaling relations and other known properties of exact functionals provide many clues
in this regard, facilitating wisdom creation of physically reasonable approximations
to E x and E c [119]. The approximate functionals created in this way are normally
abbreviated from the names of the authors and the year of publications. For instance,
the B or B88 symbol refers to exchange functional given by Becke in 1988 and P86 to
correlation functional created by Perdew in 1986; a complete exchange-correlation
functional being a combination of both these parts is labeled BP86. The symbol O
(e.g., in OLYP functional) refers to the exchange functional by Handy and Cohen
who called it an “optimized exchange”.
The functionals mentioned so far are all local in a sense that the integrand of
exchange and correlation energy depends on the electronic density, its gradient, and
possibly higher derivatives only at a given point. In contrast, the exact expression for
exchange energy, known from HF theory, cannot be written in such a way. The exact
exchange may be thus regarded a nonlocal functional of the electron density. Mixing
770 M. Radoń and E. Broclawik

of the exact (nonlocal) and approximate (local) exchange functionals is underlying


idea of hybrid functionals, such as the famous, three-parameter B3LYP functional:
 
B3LYP = a E exact + (1 − a) E S + b E B88 − E S + cE LYP + (1 − c) E VWN ,(13)
E xc x x x x c c
with a = 0.20 b = 0.72 c = 0.81

where E xS is the Slater exchange functional (LDA), E cVWN is the Vosko-Wilk-Nusair


correlation functional (LDA), E xB88 is Becke 1988 exchange functional (GGA), and
E cLYP is Lee-Yang-Parr correlation functional (GGA). Due to the importance and the
great impact of hybrid functionals, the functionals not containing the exact exchange
(in particular GGA functionals) are often referred to as non-hybrid or pure function-
als.
Although Eq. (13) involves three empirically-fitted parameters (a, b, c), the mix-
ing parameter a is the most important one. Perdew and coworkers brought a physical
meaning to this parameter (by using perturbation theory arguments based on the
concept of adiabatic connection) and argued why the values of a ≈ 1/5–1/4 pro-
vide a reasonably compromise for typical molecules [120]. In fact, another hybrid
functional (PBE0) employs a = 1/4, as coming from these theoretical considera-
tions and not refined any further to experiment data. However, as will be shown later
in this chapter, correct description of many transition metal systems require much
smaller fraction of exact exchange, e.g., a = 0.15 in a reparametrized version of
B3LYP, known as B3LYP* [141]. The key observation of Perdew et al. [120] that
the optimum amount of exact exchange should be dependent on the character of
electron correlation in the system, provide the underlying idea for local hybrid func-
tionals, in which one global mixing parameter is replaced by a local mixing function;
a related concept is used in range-separated hybrid functionals in which different
admixtures of exact exchange are used to describe the short- and the long-range part
of electronic repulsion [63]. Grimme developed double-hybrid functionals which (in
addition to HF exchange) contain also an admixture of MP2 correlation energy for
a better balance of correlation effects [51].
The appearance of hybrid functionals in mid 1990s is considered a great mile-
stone in the development of DFT and its perception by computational chemistry [60,
83]. While LDA was, in general, not accurate enough for chemical calculations and
the GGA functionals came reasonably closer to chemical accuracy, the hybrid func-
tionals turned out to be much more reliable. When tested against the G3/99 set of
reference atomization energies, the B3LYP functional has a mean absolute error of
4.9 kcal/mol and TPSSh (a hybrid meta-GGA functional) of only 3.9 kcal/mol (cf.
Ref. [83]). These inaccuracies should be compared with much larger and very system-
atic errors of HF (211.5 kcal/mol, underbinding), LDA (121.5 kcal/mol, overbind-
ing), and a GGA functional PBE (22.2 kcal/mol). The hybrid functionals fall thus
already close to the desired, chemical accuracy of about 1–3 kcal/mol which can be
achieved in ab initio calculations only with a great computational effort.
Despite the undeniable success of DFT, there are still many problems with its accu-
racy, both in main-group chemistry and (more serious) in transition metal chemistry,
Electronic Properties of Iron Sites … 771

strongly pointing in favor of ab initio methods. The errors of approximate functionals


for the G3/99 dataset quoted above correspond to mean errors, while the maximum
errors are significantly larger. Moreover, this dataset does not contain transition metal
species, for which the DFT errors can be larger and certainly less systematic. Many
of these problems are rooted in description of nondynamical left–right correlation
energy, seemingly included in approximate exchange functionals (LDA, GGA), albeit
rather unintentionally and thus imperfectly. Mixing these exchange functionals with
exact (HF) exchange can be regarded as balancing the description of nondynamical
correlation, but this solves the problem only partially in the presently available hybrid
functionals.
In attempt to overcome limitations of traditional DFT methods, new approaches
are still being actively developed, including functionals either heavily parametrized
on experimental data [203–205], or new ideas in description of electron correlation
like double-hybrid functionals (see above) and DFT + U approach acquired from
the solid state physics [86, 155]. Although some recent functionals successfully
eliminate many problems appearing in traditional DFT calculations, no single func-
tional systematically outperforming “the old good” B3LYP for a range of molecules,
chemical processes and properties can be recommended yet.
Recently a major progress has been done in description of dispersion (London)
interactions, not yet correctly accounted for by most of approximate exchange–
correlation functionals [60, 83]. Here a semiempirical dispersion correction by
Grimme et al., known in the literature as DFT-D approach [50, 52] appeared a par-
ticularly simple and successful one. The underlying idea is to supplement a standard
exchange–correlation functional by a damped −C6 R −6 interatomic potential which
should appropriately describe dispersive interactions upon a suitable parametrization,
the most recent and the most consistent being DFT-D3 [53].

2.4 Molecular Models

Two factors determine the overall quality of a given quantum chemical calculation:
not only the accuracy of the applied computational method, but also the adequacy
of the molecular model used. The choice of reliable (i.e., sufficiently large) model
may be thus very important for accurate description of enzymatic active sites. How-
ever, as mentioned above, the computational cost of correlated ab initio methods
mostly prevents their applications to large systems. While DFT methods can be
nowadays applied to very large models (consisting of up to several hundreds atoms),
CASSCF/CASPT2 calculation can be performed for mononuclear complexes with
up to ∼50 atoms; CCSD(T) calculations are feasible only for twice smaller models
(∼15 to 25 atoms). It means that the CASSCF/CASPT2 method is suitable to study
iron complexes with porphyrin group, but the CCSD(T) method can only be applied
to small mimics of the heme systems—in which the porphyrin is truncated to amidine
or other N-donor chelating ligands (see, e.g., Ref. [107]). Although these models are
772 M. Radoń and E. Broclawik

clearly oversimplified for direct comparison with experiment, they are still useful for
benchmarking DFT or CASPT2 against CCSD(T) [189].
The use of relatively small models (indispensable for performing efficient ab
initio calculations) may be partly justified by the fact that the electronic properties of
transition metal centers in enzymatic active sites rely predominantly on the nearest
neighborhood of the metal and to smaller extent on the distal groups (not denying their
overall importance, see below). Thus, the basic features of the electronic structure
and properties of active sites in heme proteins can be modeled by iron–porphyrin
complexes with the axial ligand(s) reflecting the iron ligation state in the active
site. This choice presents an example of a minimal cluster approach to enzyme
modeling (extendable to larger clusters when necessary) [168]. In the simplest, yet
still valuable models, the porphyrin may be considered without the side substituents,
i.e., as a porphin ligand, abbreviated as P and the missing side chains often do
not extert a direct effect on the electronic structure of heme [74]. However, even if
truncating them is a routine approximation, one should be aware that these groups
may cause a steric hindrance by which protein can indirectly modulate the properties
of heme. Moreover, propionate side chains of protoheme IX are known in some cases
to modulate the electronic structure and actively participate in electron delivery to
the iron center [55]. The axial ligands may be also truncated: e.g., the cysteine ligand
in cytochrome P450 is often modeled as a thiolate SCH− −
3 or even SH ; the histidine
ligand in myoglobin can be modeled as imidazole (Im), substituted imidazole or
even an ammonia molecule (NH3 ) [90, 138]. Many examples of such model heme
complexes appear in Sect. 3. The geometry of the complexes is usually optimized at
DFT level and used in subsequent ab initio calculations. Symmetry is often imposed
on the structures of the model complexes to accelerate the calculations and simplify
their interpretation.
The calculations for the model complexes in gas phase, even if clearly valu-
able, are of course missing important effects due to protein environment. If a distal
aminoacid group has a direct influence on the studied property (e.g., by formation
of a hydrogen bonding), it should be preferably included in the cluster model (cf.
Sect. 3.2). A different effect is polarity of an enzyme bulk, which can be simulated
by means of continuous solvation models (like PCM or COSMO, implemented in
many QC programs). An important parameter in these models is the effective value
of dielectric constant (ε) of the environment, corresponding to the interior of an
enzyme. Although ε = 5.7 is a frequent choice for enzymes, their “solvation effect”
is known to converge rather quickly with the size of molecular model, meaning that
the actual value of ε is often not so important [168]. However, the effects of protein
environment can be most accurately treated in the QM/MM approach. In QM/MM
the most relevant part of the system (corresponding to the cluster or model complex
in the previous approach) is described at quantum mechanics (QM) level; the QM
part is surrounded by the environment described at molecular mechanics (MM) level.
A great advantage of QM/MM is accounting for polarization of the QM wave func-
tion by the electrostatic field from the MM part. Moreover, this approach can describe
the effect of enzyme on the structure of the active site. In most of the QM/MM cal-
culations performed, “QM” actually stands for DFT (B3LYP) method, but there are
Electronic Properties of Iron Sites … 773

also a few studies reported in Sect. 3 in which ab initio methods were used as the
QM layer. More about various approaches to modeling of metalloenzymes and their
reactions can be found in the next chapter.

3 Case Studies

3.1 Spin State Energetics

A unique feature of many transition metal sites is their spin state isomerism: they can
adopt several spin states (with different number of unpaired electrons in the metal
nd orbitals) lying close in energy. This is apparent especially for the first-row (3d)
transition metals species and the iron ones in particular. Both ferrous (Fe(II), 3d6 )
and ferric (Fe(III), 3d5 ) complexes exist in three different spin states: the low-spin
(singlet for Fe(II) or doublet for Fe(III)), the intermediate-spin (triplet or quartet)
and the high-spin (quintet or sextet). Likewise, two (or more) spin states can also be
expected in case of high-valent ferryl species, i.e., containing an iron-oxo (Fe=O)
group with formal +IV oxidation state on Fe (3d4 ).
Which of the possible spin states is actually the ground state for a given complex
results from a delicate balance between the two counteracting factors: the splitting
of the nd orbitals by their interaction with the ligands (larger splitting favors electron
pairing) and the exchange interaction between spin-like electrons (which reduces the
electron–electron repulsion if more electrons are unpaired). In “typical molecules”
(e.g., organic or simple inorganic species), close to their equilibrium geometry, the
ground state usually contains a minimal number of unpaired electrons: either no
unpaired electrons (and the singlet ground state) for closed-shell molecules or one
unpaired electron (and the doublet ground state) for free radicals. This is because the
energy splitting between their highest lying occupied and lowest lying virtual molec-
ular orbitals is sufficiently large. However, in transition metal species the molecular
orbitals with predominant metal nd character are often close to degeneracy. In such
a case it may be energetically preferable to promote certain number of electrons
from lower- to higher-energy orbital(s), and thus to increase the number of unpaired,
spin-like electrons in the system (because the higher spin state benefits from larger
exchange stabilization, i.e., smaller electron–electron repulsion, cf. Sect. 2.1).
As a rule of thumb, a high-spin ground state is expected for metal complexes
with relatively small splitting of the nd orbitals (weak ligand fields). In contrast, in
complexes with significant splitting of the metal nd orbitals (stronger ligand fields)
an intermediate or low-spin state has lower energy, and occasionally becomes the
ground state. Iron sites in non-heme proteins (with aminoacidic ligands only) have
usually a high-spin (HS) ground state, while the intermediate-spin (IS) and low-spin
(LS) states are lying too high in energy to be accessible at ambient temperatures. In
contrast, the porphyrin ligand gives rise to larger splitting of the Fe 3d orbitals, by
strongly destabilizing the one pointing directly onto the four N atoms of porphyrin
774 M. Radoń and E. Broclawik

(3dx y ). As such, the IS and LS states are stabilized (with respect to the HS state) and
placed relatively close in energy, which is believed to be crucial for many biological
functions of iron porphyrin systems [74, 172, 197].
As the relative energy and even the ordering of the spin states is strongly depen-
dent on the coordination environment, the ground spin state may change in the course
of biochemical reactions, frequently proceeding with attachment or release of a lig-
and from a metal coordination sphere. In fact, a number of biologically relevant
transformations involve a change of spin state on the metal. These processes pro-
ceed by crossing from the energy surface of one spin state (the ground spin state
of the reactants) to another one (the ground spin state of the products). Therefore
their reaction energies (thermodynamics) as well as activation energies (kinetics) can
be dominated by relative energy of the two involved spin states [132]. Shaik et al.
recently outlined the importance of exchange stabilization of the transition state for
many reactions proceeding on d-electron metal sites—which can considerably favor
the high-spin channel (the concept of exchange-enhanced reactivity) [166]. All these
arguments show that spin state energetics is a very important issue for understanding
the properties and reactivity of transition metal species.
Unfortunately, spin state energetics for many interesting systems (including some
heme species) cannot be directly obtained from experiment. The same obviously
holds true for the relative energies of spin states along the reaction pathways. There-
fore, much effort has been put in theoretical calculations of spin state energetics.
However, although qualitative principles governing the relative energies of different
spin states (i.e., competition between a tendency to occupy the lower-lying MOs
and a tendency to maximize a number of exchange interactions) may seem intuitive,
quantitative computational prediction of the energy splitting between spin states
turns out to be surprisingly challenging. Notable difficulties are met as well in DFT
calculations, in which spin state energetics is functional-dependent (highly variable
from one functional to another) and thus often inconclusive, as for correlated ab
initio methods, where very high level of theory and a flexible basis set have to be
applied in order to obtain meaningful results [31, 42, 60]. Iron sites in heme and
heme-like coordinations are not exceptions from this general rule. In fact, the expe-
rience gathered for these systems clearly indicates that they might belong to most
difficult problems for computational treatment.
As we shall see below, a remarkable example of these challenges is already a basic
motif in many heme enzymes, an iron(II) porphyrin. This four-coordinated ferrous
porphyrin (with either tetraphenylporphin or octaetylporphin ring, but with no axial
ligands) has experimentally established triplet (i.e., IS) ground state [27, 45, 81,
103]. In contrast, five-coordinated complexes in which iron(II) is axially ligated
by an N-donor imidazole ligand, have a quintet (i.e., HS) ground state [66, 67]—
likewise the ferrous sites in deoxymyoglobin / deoxyhemoglobin (where Fe is axially
coordinated by an imidazole ring of the proximal histidine) and their functional
models [39, 44, 80, 98]. However, if iron(II) is coordinated by two such imidazole
(or histidine) ligands, the ground state changes to singlet (i.e., LS), like for six-
coordinated heme sites involved in electron transfer processes [173]. The LS ground
state is also characteristic of six-coordinated complexes obtained by coordination of
Electronic Properties of Iron Sites … 775

CO, NO, or O2 diatomic molecules to the ferrous heme groups (see Sect. 3.2). By and
large, in ferrous heme groups all the three possible spin states of Fe(II) can become
the ground state, depending on the number (and character) of axial ligands.
As discussed in Sect. 2.4, the invoked here ferrous heme sites can be modeled
as FeP (four-coordinated Fe) and FeP(Im) (five-coordinated Fe) and FeP(Im)2 (six-
coordinated Fe) complexes. Figure 2 shows principal electronic configurations for
the relevant spin states of FeP and FeP(Im) models which have been extensively stud-
ied by means of DFT and CASSCF/CASPT2 calculations (see below). The orbital
occupancies shown in this figure stem not only from theoretical calculations, but
may be also extracted from interpretation of Mössbauer, magnetic resonance (EPR,
ENDOR) or Raman spectra (see references above). In case of FeP, neither experi-
mental results nor theoretical calculations are fully conclusive in regard to the precise
identity of the lowest triplet state—it can be either 3 Eg or 3 A2g depending on the
method used; both these triplet states are shown in Fig. 2.2
Figure 2 shows that the degeneracy of the five Fe 3d orbitals becomes removed
by the ligands. In FeP, due to its high D4h symmetry, the degeneracy of Fe 3dx z and
3d yz is retained, but it is removed in less symmetric FeP(Im). The Fe 3dx y orbital
is pointing directly onto the porphyrin N atoms, therefore it is destabilized most
strongly in both systems. Analogously, the Fe 3dz 2 orbital is destabilized by the axial
imidazole in FeP(Im). In fact, going back to Sect. 2.2.2 one should rather say that
these two orbitals are involved in covalent, σ -type bonding interactions with P and
Im nitrogens, but since this mixing of metal- and ligand-based contributions is not
very large in this case, it is customary to denote the antibonding orbitals as Fe 3dx y
or 3dz 2 . In contrast, Fe 3dx 2 −y 2 is essentially nonbonding, iron-centered orbital. The
same essentially holds true for the remaining two Fe 3dx z,yz orbitals although they
have appropriate symmetry for some mixing with vacant porphyrin π orbitals.
Although the ground spin state can be identified experimentally for these ferrous
porphyrin systems (based on their Fe−N bond lengths from the crystal structures,
from interpretation of magnetic properties and spectra, etc), little is known about
energetics of their excited spin states. Such information is available for ferric heme
site in cytochrome P450, where the LS (doublet) and the HS (sextet) state of Fe(III)
lie so close in energy that their spin equilibrium is observed [174]. The ferric site
of P450 is an iron(III) porphyrin complex with axially coordinated cysteine (Cys),
which corresponds to FeP(SH) model used in many theoretical calculations. In the
resting state the water molecule is additionally coordinated to Fe (as the sixth ligand)
and the LS state is slightly favored (with low-lying HS state). In contrast, if this
water is removed (e.g., by a P450 substrate bound in the distal pocket), the ferric
complex becomes HS (with low-lying LS state). Spin state energetics can be obtained
experimentally for spin-crossover systems—like the one just invoked—where two
(or more) spin states lie so close in energy that their relative populations can be varied

2 The reader should also notice that the LS state is not shown for FeP in Fig. 2. The LS state

discussed below for this complex is a closed-shell singlet, (dx 2 −y 2 )2 (dx z ,d yz )4 , which has analo-
gous electronic structure to the singlet state in five- [FeP(Im)] and six-coordinated heme species.
However, this is not the lowest singlet state of FeP, the latter being instead an open-shell singlet,
(dx 2 −y 2 )2 (dz 2 )2 (dx z ,d yz )2 [12, 136].
776 M. Radoń and E. Broclawik

(a)

(b)

(c)

Fig. 2 The scheme of orbital occupancies in the principal configuration for the S = 1, 2 spin states
of FeP (a) and the S = 0, 1, 2 spin states of FeP(Im) (b), along with orientation of these models
in the coordinate system (c). Symmetry labels are given for the electronic states in accord with
symmetry group of the models (D4h for FeP, Cs for FePIm). The splitting of the d orbitals is shown
only schematically, not reflecting their actual orbital energies

considerably by changing a temperature [56, 186]. In such a way, an estimation of


ΔG HS→LS = 1.6 kcal/mol was obtained for the ferric state of P450 (without the axial
water ligand) [87, 174].
The numbers like the above— obtained experimentally from the spin equilibria—
correspond of course to free energy difference between the two spin states. This is
not the same as electronic energy difference typically obtained from quantum calcu-
lations. The difference between these two quantities is situated mainly in different
zero-point and thermal vibrational energies, and vibrational entropies for the spin
Electronic Properties of Iron Sites … 777

states being compared [117]. Consequently, the free energy of the HS state with
respect to the LS state is most typically lowered by a few kcal/mol as compared to
purely electronic energy difference. If important for accurate comparison of theory
with experiment, these vibrational corrections to free energy can be easily and cred-
ibly modeled based on DFT-computed frequencies [117, 138]. However, it is much
more challenging and usually more important to obtain a correct prediction of the
electronic energy difference.
Table 1 summarizes purely electronic spin promotion energies for ferrous [FeP,
FeP(Im)] and ferric [FeP(SH)] heme models obtained with various computational
methods. This table gives electronic energy differences between the ground state and
the other two spin states for each of the considered complexes, computed for optimum
geometries of the spin states (i.e, adiabatic energies). A number of papers investigated
the spin state energetics of the heme models at DFT level; Table 1 includes only
the most recent results [136, 189], which agree well with the older ones [85, 90,
151, 180, 187]. The most evident conclusion drawn from the DFT results shown in
Table 1 is that they are tremendously functional-dependent where an exchange part of
a functional seems to matter the most. In summary, the non-hybrid functionals (e.g.,
BP86, PBE) favor the low- and intermediate spin states with respect to the high-spin
state, and hybrid functionals (B3LYP*, B3LYP, PBE0) behave in a quite opposite
way. A more detailed analysis (not shown) would reveal that the lowest triplet state
found by non-hybrid functionals is 3 Eg , while the hybrid functionals (likewise ab

Table 1 Relative spin state energetics (kcal/mol) for selected heme models with respect to their
experimental ground states states: 3 IS (triplet) for FeII P, 5 HS (quintet) for FeII P(Im), and 6 HS
(sextet) for FeIII P(SH), calculated at DFT level (with various choices of exchange-correlation
functional) and at ab initio (CASPT2) level
FeII P FeII P(Im) FeIII P(SH)
3 IS→5 HS 3 IS→1 LS 5 HS→3 IS 5 HS→1 LS 6 HS→2 LS 6 HS→4 IS

PBE0 0.8a 36.6a 4.1a 14.3a 11.0 7.0


B3LYP 5.7a 34.4a −0.8a 7.2a 3.8b 2.0b
B3LYP* 8.6 34.3 −3.6b 1.1b −1.2b −1.0b
OLYP 7.7a 36.3a −2.1a 5.1a 1.9b 2.6b
BP86 16.7a 35.9a −11.3a −11.6a −16.3b −9.5b
TPSS 18.2 34.4 −12.1b −13.1b −15.8b −8.2b
TPSSh 11.7 34.8 −6.0b −2.5b −6.1b 10.3b
M06 −9.5 31.6 10.7b 17.7b 24.1b 14.9b
M06-L 1.6 29.7 4.8b 5.6b 14.5b 12.0b
CASPT2 −5.2a 35.2a 9.0a 13.6a 3.6b 7.5b
0.9c,† 37.8c,†
a Ref. [136];
b Ref. [189];
c Ref. [190]; Other values were obtained with the larger basis set (B) used in Ref. [136] for the

structures optimized at the B3LYP/def2-TZVP level;


† Based on the extended (16in15) active space additionally containing Fe 3s, 3p orbitals (see text)
778 M. Radoń and E. Broclawik

initio calculations, more of which below) point to 3 A2g . Nonetheless, both triplet
states of FeP are in all cases close to degeneracy. Interestingly, the OLYP functional
behaves in a different way than classical non-hybrid functionals (BP86), yielding
results often closer to hybrid functionals. This outstanding performance is attributed
to the exchange part of this functional, OPTX [58]. It is also useful to notice that
TPSS, a meta-GGA functional, behaves for this type of problems very much like
an ordinary GGA (BP86) [137]. It is also quite intriguing that (at least for ferrous
systems), the HS–LS and the HS–IS energy differences are more sensitive to the
choice of functional than the IS–LS one. It seems that transitions involving an electron
promotion from the Fe 3dx 2 −y 2 (nonbonding) to the Fe 3dx y (with antibonding Fe–
Nporphyrin character) are more sensitive to the choice of exchange functional than
other types of spin transitions, like those between IS and LS states in the ferrous
complexes [136, 138].
Even if for FeP the true (triplet) ground state is correctly recovered in all DFT
calculations (except M06), the spin promotion energies differ considerably from one
functional to another, and (due to lack of experimental IS→HS promotion energy)
it is virtually impossible to tell which functional performs the best. For FeP(Im)
only some functionals recover the experimental quintet ground state (after account-
ing for thermodynamics these are the functionals yielding 5 HS→3 IS energy larger
than ∼−2 kcal/mol). For FeP(SH) the correct result is quasi-degeneracy of the 6 HS
and 2 LS spin states, which is properly recovered by the hybrid functionals (PBE0,
B3LYP, B3LYP*) and OLYP. Since DFT calculations of spin state energetics are
considerably functional-dependent, and therefore not conclusive, while experiment
does not provide sufficient information, there has been a strong drive to perform ab
initio calculations for the heme models [24, 25, 127, 136, 138, 189, 190].
Along this line, the second part of Table 1 shows selected CASSCF/CASPT2
results for the considered heme models [136, 189]. The choice of active space in these
calculations conformed to standard rules for transition metal species (cf. Sect. 2.2.2).3
CASPT2 calculations correctly indicated the HS ground state in case of FeP(Im) and
FeP(SH). Surprisingly, however, they failed to predict the correct (i.e., IS) ground
state for FeP, favoring the HS one by about 5 kcal/mol (∼7 kcal/mol when taking
ZPVE into account) [136]. Although this result may seem not particularly appealing,
it represents a substantial improvement over all preceding ab initio calculations of
the IS–HS energy gap in FeP. To this matter, in 1998–9 Choe et al. obtained the
quintet state favored over the triplet state by 8.5 and 19.6 kcal/mol, based on their

3 Justso, the active space for FeP was obtained by distributing six d electrons of Fe(II) in its 3d
orbitals with added double-shell (4d), and a doubly occupied σ Fe–Nporphyrin orbital to account for
covalency of the iron–N(porphyrin) bonding. This selection led to the active space of 8 electrons
in 11 orbitals (8in11) for FeP. In case of FeP(Im), this active space was augmented with a doubly
occupied σ Fe-NIm orbital to account for covalency of the Fe–N(imidazole) bond, thus yielding
the total active space of 10 electrons in 12 orbitals (10in12) [136]. For the ferric model, the active
space was obtained by distributing five electrons of Fe(III) in its 3d orbitals with added double-shell
(4d) and three doubly occupied orbitals: σ Fe–Nporphyrin (which play the same role as in the ferrous
complexes), together with σ Fe–S and π Fe–S (describing covalency of the Fe–SH bond), yielding
the total active space of 11 electrons in 13 active orbitals (11in13) [189].
Electronic Properties of Iron Sites … 779

MRMP2 [24] and CASPT2 [25] calculations, respectively. Given that their calcula-
tions unambiguously pointed to the HS ground state, these authors even suggested
that this could be the actual spin state of FeP while the experiments are misinter-
preted [25]. However, in 2003 Pierloot [127] identified a main source of error in the
previous calculations, which was not making active a σ Fe–Nporphyrin bonding orbital
(which is necessary to describe nondynamical correlation effects associated with
the iron–porphyrin covalent bonding, see Sect. 2.2.2). Based on the “correct” active
space (8in11), she estimated the 5 A1g –3 A2g gap to 10.1 kcal/mol [127]. This result
was later refined by using a newer form of the zero-order Hamiltonian in CASPT2
(the, so called, IPEA-shifted Hamiltonian, which addressed some unbalance of the
original formulation) and larger basis set [136], providing the number quoted in
Table 1. A big role played by the σ Fe–Nporphyrin orbital in the active space well illus-
trates that spin state energetics in transition metal species has very much to do with
nondynamical (left–right) correlation connected to covalent metal–ligand bonds. In
fact, the spin states which differ in the occupation of the antibonding metal–ligand
orbitals (here: σ ∗ Fe–Nporphyrin , i.e., Fe 3dx y ) contain a different amount of left–right
correlation. This is because the antibonding orbital, once becoming singly occupied
in the HS state, cannot serve any longer as a correlating orbital for the electrons
paired in the bonding orbital as it does serve in the IS/LS state.
One might ask, whether the CASPT2 error on the IS–HS energy gap in FeP is
specific only to this system, or is it transferred to the other heme models as well.
This question was addressed by the subsequent paper from the Leuven group [189],
in which the authors compared behavior of CASPT2 (and DFT) calculations with
very accurate coupled cluster results for small heme-like models shown in Fig. 3: the
first two roughly mimicking FeP(SH) and the third one—FeP(Im). As explained in
Sect. 2.4, the CCSD(T) method is too expensive to be applied to real iron porphyrin
systems, but as the small models capture the main features of iron coordination in the
heme sites, they are good to validate performance of computational methods with
respect to the CCSD(T) calculations. Based on the comparison for these small models,
Vancoillie et al. concluded that the CASPT2 calculations are in close agreement
with CCSD(T) calculations for the ferric heme models, thus presumably they can
be trusted. However, likewise for FeP, CASPT2 seems to overstabilize the HS state
in the ferrous model by about 4–6 kcal/mol [189]. This effect is also similar to

Fig. 3 Small models of ferric (a, b) and ferrous (c) heme groups with chelating amidine ligands
(CN2 H− −
3 , C3 N2 H5 ) as mimics of the porphyrin ring, which were studied by Harvey and Olah [107]
at CCSD(T) level and subsequently by Vancoillie et al. [189] at CASPT2 level
780 M. Radoń and E. Broclawik

the CASPT2 behavior for the LS→HS promotion in a spin crossover Fe(salen)NO
complex [138]. Although salen [i.e., N , N -ethylenebis(salicylimine)] is a non-heme
ligand, this macrocycle binds iron in a quite similar mode as porphyrin (with its two
O and two N atoms instead of four N atoms). Moreover, likewise in the ferrous heme
complexes, the 2 LS→4 HS spin conversion in Fe(salen)NO comes down to an electron
promotion from the nonbonding (3dx 2 −y 2 ) to the antibonding (3dx y ) Fe 3d orbital (cf.
3
A2g →5 A1g transition in FeP or 3 A →5 A transition in FePIm, as shown in Fig. 2).
Thus, given similarity of the electronic structure, the 2 LS→4 HS spin promotion
in Fe(salen)NO complex can be regarded very similar to analogous transitions in
heme models with the axial nitrosyl ligand (discussed in Sect. 3.2). From spin-
crossover experiment, a purely electronic energy difference of 2.2 kcal/mol between
the LS (S = 1/2) and the HS (S = 3/2) state of Fe(salen)NO can be obtained [138],
whereas CASPT2 gives −4.9 kcal/mol, predicted with the basis set and active space
comparable to that for the heme systems [138].
Even though the CASPT2 spin state energetics was found somewhat deficient for
the ferrous heme systems, this is by no means general: this error is not present for the
ferric heme models as well as for many transition metal species for which CASPT2
perform very well [128, 129]. In fact, across the small ferrous and ferric models
studied in [189], CASPT2 provided a better accuracy than any of the tested DFT
methods. Furthermore, it was pointed out that even heavily parametrized Minnesota
functionals (M06, M06-L) [203–205] did not improve over traditional functionals,
like B3LYP and OLYP, which (across the studied systems) led to more systematic
behavior with respect to CCSD(T) [189]. In fact, the M06-L functional is quite
accurate for the ferrous heme but (likewise M06), it does not behave correctly for
the ferric heme: the LS is placed at nearly 15 kcal/mol above the HS state, whereas
experiment indicates that the both spin states should be close to degeneracy. In
addition, the aforementioned CASPT2 error for the spin crossover Fe(salen)NO
complex should be compared with equally large (or even larger) errors of common
density functionals. Actually, only the OPTX-based functionals (e.g., OLYP) seem
to deal with this difficult case, providing the LS–HS gap close to experiment [29,
138].
In the most recent paper from the Leuven group, the FeP system was re-examined
in more details [190]. It was shown that the triplet ground state (3 A2g ) can be cor-
rectly recovered in CASPT2 calculations for FeP by augmenting the active space
with iron semicore orbitals, i.e., 3s and 3p (providing the last CASPT2 number in
Table 3.1). Intershell correlation between the (3s,3p) and 3d orbitals of the first-
row transition metal has long been recognized as important, but in most studies this
effects was treated only at CASPT2 level, i.e., as a purely dynamical correlation. The
result quoted above, however, suggests that it cannot be properly dealt with by sole
CASPT2—meaning that the semicore 3s,3p orbitals should be preferably included in
the active space. However, making active these four extra orbitals (3s,3px,y,z ) means
a significant enlargement of the active space, which (for species more complicated
than FeP) may quickly become too large for performing CASSCF/CASPT2 calcula-
tions. Fortunately, Pierloot et al. have also demonstrated that the effect of (3s,3p)–3d
correlation in FeP can be adequately treated at less expensive RASSCF/RASPT2
Electronic Properties of Iron Sites … 781

level by keeping the 3s,3p orbitals in RAS1 subspace with up to double excitations
allowed [190]. Thus the development of RASSCF/RASPT2 approach has brought
an optimistic forecast that highly accurate (properly correlated) multireference cal-
culations can be carried not only for small FeP, but also for more complicated heme
models including axial ligands.
In the cited work [190] the effects of spin-orbit coupling (SOC) and an effective
magnetic moment of FeP in the equilibrium structure of 3 A2g were also investigated
within a sum-over-state approach [188] for a manifold of low-lying ligand field
states. Employing SOC led to a mixture of 3 A2g (68%), 3 Eg (13%), and 5 A1g (18%)
as the ground state at the equilibrium structure. The coupling with the orbitally-
degenerated triplet state (3 Eg ) and the quintet state (5 A1g ) increases the magnetic
moment
√ substantially with respect to the the spin-only value for the triplet state
(2 2μB ≈ 2.83μB ), yielding μeff = 4.43μB . This value falls in excellent agreement
with experimental estimates of 4.4–4.7μB [35, 179], supporting the high-quality
of the underlying electronic structure and energetics. Upon considering SOC the
electronic state corresponding to the 5 A1g structure remains predominantly quintet
and bears a μeff = 5.57μB . While the magnetic moment of the electronic states
changes considerably, the energy of the both SOC eigenstates is only slightly changed
as compared to the spin-orbit free states. The final estimate for the adiabatic 5 HS→3 IS
gap (including the nondynamical effects attributed to Fe 3s3p electrons and the SOC)
becomes −1.8 kcal/mol [190], which agrees with the experimental state ordering,
demonstrating that the spin state energetics can be predicted by CASPT2/RASPT2
with a chemical accuracy.
In conclusion, calculations of spin state energetics in heme systems remains a very
challenging problem for QC. At the moment no theoretical method applicable to heme
models can be fully trusted in this regard. The CASSCF/CASPT2 calculations with
standard choice of the active space may (in some cases) overstabilize the HS state;
enlargement of the active space with the Fe 3s,3p semicore orbitals was suggested as
a solution. On the other hand, the CCSD(T) calculations, though highly accurate, are
applicable only to small mimics. By contrast, DFT methods are very sensitive to the
arbitrary choice of exchange–correlation potential and no single functional can be
pointed out to perform uniformly the best. The hybrid functionals (B3LYP, B3LYP*)
and the OPTX-based one (e.g., OLYP) perform reasonably well as compared to
CCSD(T) benchmarks, although it is difficult to predict the proper ratio of exact
exchange admixture in the hybrid functionals. From a long story with FeP, one
should learn perhaps that multireference ab initio calculations have large potential,
even though they may go astray if inadequate (too small) active space is used.
We notice that a number of new methods are currently being developed which may
improve description of spin state energetics in transition metal systems. For instance,
NEVPT2 method with entirely different construction of the zero-order Hamiltonian
have been suggested as a possible alternative to CASPT2 (RASPT2) [6]. In the DFT
domain one should notice promising developments too, for instance the appearance of
double-hybrid functionals (cf. Sect. 2.3, [200]), localized, semiempirical corrections
to hybrid functionals [68], and DFT + U methods [2, 86, 155].
782 M. Radoń and E. Broclawik

3.2 Binding of of CO, NO, and O2 to heme

Binding of diatomic molecules (XO = O2 , CO, and NO) to iron sites in heme proteins
is important for respiration, sensing and regulatory processes. The ferrous heme
sites in myoglobin (Mb) and hemoglobin (Hb) are employed by all vertebrates for
storage (Mb) and transport (Hb) of molecular oxygen (O2 ) [109]. This function is
inhibited by a poisonous carbon monoxide (CO), which binds to ferrous heme much
stronger than O2 (and practically irreversibly). Nitric oxide (NO) can be poisonous
either, especially at high concentration, while at low concentrations it plays important
biological role, being involved in intercellular signaling, smooth muscle relaxation,
and other regulatory functions [185]. Important enough, sensing of NO—relevant
for most of its biological roles—comes down to its binding to a ferrous site in soluble
guanylate cyclase (sGC) [185], which initiates subsequent cleavage of the axial Fe–
Nhistidine bond (at trans-to-NO axial position) thus leading to an allosteric transition
of the sensing protein [18]. There is also much interest in bonding of O2 to a heme
site in oxygenases, like cytochrome P450 [33], heme oxygenase (HO) [94], and nitric
oxide synthase (NOS) [149]. In these catalytic cycles the O2 coordination (and its
activation, prerequisite to enter the subsequent reactions) is preceded by a reduction
of the iron from the initial ferric to ferrous state [33, 94, 149]. The process of binding
the O2 and NO molecules by native (i.e., not reduced) ferric sites, such as in the resting
state of P450, deserve not less noticeable interest [38, 178, 182], nevertheless, for the
sake of brevity the discussion below will be restricted mostly to the ligand binding
to the ferrous sites.

3.2.1 Ligand Binding Energies

For efficient discrimination between CO and O2 , the oxygen carriers (Mb/Hb) rely
on their different affinities to the ferous site, because otherwise—in terms of shape,
size, polarity, diffusion rate—these ligands are very similar, and cannot be efficiently
discriminated [109, 177]. In the above terms, NO is also very similar to CO and
O2 . But NO, as having a single unpaired electon, is more reactive and prone to
engage in specific interactions, what may seriously affect its mobility and lifetime in
biological systems. The strikingly different magnetic properties—CO diamagnetic,
NO and O2 paramagentic—are not expected to affect permeation of these ligands
towards the binding site because magnetic interactions does not play a major role in
this process. Given the arguments above, it is interesting and important for QC to
reliably reproduce the binding energies of the three XO ligands to heme by means of
electronic structure calculations. Moreover, it is all the same crucial to quantify the
role of weaker interactions of the XO ligands with distal residues in protein [11]. As
we shall see below, even if the latter interactions are rather well understood, it is still
very challenging to calculate the XO bonding energies to heme in a good agreement
with experimental data.
Electronic Properties of Iron Sites … 783

Structural features of the XO bonding to ferrous heme are well known and have
already been thoroughly discussed in the literature [26, 85, 151, 152]. All three XO
ligands coordinate to Fe(II) in an end-on manner (CO and NO via their X atom). The
CO molecule coordinates linearly (the optimal Fe−C−O angle is ∼180◦ ), while O2
and NO prefer bent coordination (with the Fe−O−O angle ∼120◦ and the Fe−N−O
angle ∼140◦ ). The structures of the heme–XO complexes, known from the crystal
structures of proteins [13, 102, 122–124, 160, 161] and their functional models [26,
28, 72], are well reproduced by DFT [11, 91, 136, 142, 152] and DFT/MM [20, 150,
171] calculations. Some functionals (e.g., BP86) were claimed to perform quantita-
tively better than the others in reproducing the experimental structures [142], but it
is noteworthy that various DFT methods predict very similar and actually reliable
structures for the heme–XO models [2].4
Although it is relatively easy to obtain reliable structures of the heme–XO com-
plexes from DFT calculations, the same is not true about the energetics of the ligand
binding. As will be shown below, the Fe–XO bond dissociation energies are strongly
functional-dependent [136] and thus not really conclusive [11, 152, 169, 181]. Two
main reasons were pointed out to rationalize the origin of these difficulties [136].
First, nondynamical correlation and dispersion effects [169] play a major role in
formation of weak and strongly covalent Fe−XO bond, thus description of the both
effects within DFT is questionable. Second, the bonding of all three ligands is accom-
panied by change of the spin state on the iron: from the HS quintet to the LS, either
singlet (diamagnetic) in Fe–O2 and Fe–CO or doublet in Fe–NO complexes. As dis-
cussed in the previous section, reliable theoretical description of spin state energetics
pose a big challenge for QC. The same problems arise for the XO binding energies
to heme, into which the energy of spin conversion on iron contributes a significant
part.
Focusing on the advances in quantum-chemical description of the heme–XO bond-
ing, one should not forget about the effects of distal protein residues on the ligand
bonding. Historically, these “protein effects” were recognized prior to any quantum
calculations: by inspection of the enzymes crystal structures and from experiments
comparing binding properties of wild-type enzyme Mb with that for its mutants
(where certain aminoacidic residues were selectively changed) and with simple heme
compounds (chelated protoheme) [109, 110, 176]. Consecutive DFT and DFT/MM
calculations [11, 20, 150] confirmed (and slightly corrected) the mechanism figured
out from earlier experiments. According to present knowledge, in case of myoglobin
the predominant effect is caused by a distal histidine (His64, shown in Fig. 4). In
deoxymyoglobin this histidine pulls a water molecule into the distal pocked, thus
binding of the XO ligands requires first to displace out this water. This costs around
1–2 kcal/mol and gives rise to an inhibition effect of this size for all three lig-
ands. However, the distal histidine also stabilizes the adsorbed XO molecules as
compared to free heme. The CO ligand is only very weakly stabilized in protein
and the inhibitory effect dominates. It was even postulated earlier that CO may be

4A minor exception from this rule, quoted for structures obtained from some hybrid functionals
for the Fe–O2 complexes, is discussed in Ref. [136].
784 M. Radoń and E. Broclawik

Fig. 4 View on the active site of oxymyoglobin (heme-O2 complex) from physeter catodon (PDB
code: 1MBO), showing the distal histidine (His64). The dashed lines indicate a possibility of
hydrogen bonding (the hydrogen atom not shown in the X-ray structure) between the distal His64
and the adsorbed O2 ligand bearing partial superoxide character due to participation of the Weiss-
type resonance structure, FeIII −O−
2 (see Sect. 3.2.2)

destabilized in protein by distorting the linear Fe–C–O structure, but a recent


DFT/MM study did not confirm this effect [150]. For NO there is a weak stabi-
lizing effect that nearly exactly compensates the energy cost to remove the water. On
the contrary, bound O2 forms a strong hydrogen bonding with the distal histidine (cf.
Fig. 4) which overall gives rise to a significant stabilization. It is thus well established
that O2 binds much stronger to heme proteins than to free heme, nevertheless, some
controversies still exist about the size of this protein effect for the O2 binding: the
theoretical calculations (either DFT or DFT/MM) notoriously predict much higher
stabilization (between 8 and 10 kcal/mol extra stabilization) [11, 170, 171] than is
actually observed experimentally (only 2.5–3.8 kcal/mol) [109, 176]. This disagree-
Electronic Properties of Iron Sites … 785

ment can be due to either limitations of the theoretical calculations (e.g., too small
models, not considering entropic effects properly) or to erroneous or misinterpreted
experiments. Therefore, it was proposed to estimate protein effect on the O2 binding
as an average of the available theoretical and experimental results, i.e., to assume
a value of ∼6 kcal/mol for protein effect [136]. Qualitatively similar effects as for
myoglobin are expected to play a role also for hemoglobin [161].
In brief, the predominant protein effect shows up in changing the ligand binding
energies as compared to free heme due to hydrogen bonding with distal histidine.
Consequently, the heme–O2 complexes acquire a significant extra stabilization in pro-
tein environment (roughly ∼6 kcal/mol), the CO bonding is inhibited by ∼1 kcal/mol,
while the NO bonding energy is nearly not affected by the protein environment as
compared to protoheme. Moreover, protein has also some effect on the molecular
structures of the Fe–XO complexes, most pronounced on the labile degrees of free-
dom (rotation of the XO group, orientation of the histidine imidazole) which are very
sensitive to weak interactions; nevertheless, it does not change the general features
of XO coordination to heme.
A comparative study of XO bonding to heme models was carried out by Radoń and
Pierloot [136] at DFT level (with several exchange–correlation functionals) and with
CASPT2 method. In this study, the heme group was modeled as porphin ring (P) with
axially coordinated imidazole ligand (Im), in accord with the FeP(Im)(XO) models
(X=C, N, O). In addition to six-coordinated complexes, five-coordinated FeP(XO)
complexes (i.e., without the axial Im) were also studied (Fig. 5). Geometries of the
complexes were optimized at DFT level (PBE0 and BP86 functionals) and used in
subsequent CASSCF/CASPT2 calculations. At the both levels of theory the largest
basis set used corresponded to a polarized quadruple-ζ quality on Fe and polarized
triple-ζ on the ligands.5
The bond dissociation energies (BDEs), i.e., energies of the reaction

heme − XO −→ heme + XO, (14)

where “heme” stands for either FeP or FeP(Im), were calculated assuming the ground
spin state for all molecules: high-spin for FeP(Im), intermediate-spin for FeP, and
low-spin for all the heme–XO complexes. The computed BDEs were corrected for the
difference in zero-point vibrational energies between the products and the reactant in
(14) as well as for basis set superposition error (BSSE). Despite using large basis set,
the BSSE correction at CASPT2 level was still found considerable, 7–9 kcal/mol,
much larger than at DFT level. However, since the iron spin state is changed upon
ligand binding, the binding energies with respect to the low-spin state (not the actual
ground state) of the respective heme were calculated in addition. The relation between
the binding energy with respect either to the ground state (ΔE BDE ) or the singlet state
(0)
(ΔE BDE ) of heme is given by:

5 In
CASPT2, due to its large computational cost, higher polarization functions were removed for
H and C atoms of P and Im ligands, keeping the fully polarized triple-ζ quality only in the first-
coordination sphere of Fe and on the XO ligand.
786 M. Radoń and E. Broclawik

Fig. 5 Structures of the ferrous heme models (a, e) and of their complexes with the XO ligands
(b–d, f–h)

(0)
ΔE BDE = ΔE BDE − ΔE sp , (15)

where ΔE sp is adiabatic spin-pairing energy, i.e., the energy difference between the
low-spin and the actual ground state of the respective heme model (see Table 1 in
(0)
Sect. 3.1). The ΔE BDE contribution can be interpreted as “intrinsic” binding energy
with respect to heme promoted to the spin state with most similar electronic structure
to that in the heme–XO adduct [136].
The active space for CASSCF/CASPT2 calculations chosen in Ref. [136] con-
formed to general rules for transition metal compounds (outlined in Sect. 2.2.2).
The spin-pairing energy (ΔE sp ) in FeP and FeP(Im) was obtained using the same
active spaces as described in Sect. 3.1. In FeP(XO) and FeP(Im)(XO) complexes,
new covalent metal–ligand bonding is found (i.e., Fe–XO), giving rise to additional
nondynamical correlation effects which can be accounted for by extending the active
space with the appropriate valence orbitals on XO: σ , π , and π ∗ . However, including
all above orbitals on top of the active spaces of the parent heme complexes turned out
to be impossible not only for the unacceptable computational cost of the CASPT2
calculations with such a large active space, but also for unfavorable orbitals rota-
tions experienced already during the CASSCF step (where orbitals with occupation
number close to either two or zero tended to rotate out of the active space in favor
of Fe 3s,3p core orbitals or antibonding ligand-centered orbitals). Therefore, after
doing many test calculations, some of these less important orbitals were removed
from the active space (made either inactive or virtual), allowing to find tractable and
computationally stable active spaces for description of heme–XO complexes [136].6

6 First,
since the occupation of the Fe 3dx y is always small the corresponding double-shell Fe 4dx y
was removed from the active space. For XO=CO the Fe 3dz 2 was also practically unoccupied,
thus the corresponding Fe 4dz 2 was not active either. In contrast, all the five CO orbitals (σ , π ,
and π ∗ ) appeared necessary, which led to active space of 14 electrons in 14 orbitals (14in14) for
FeP(CO) and 16 electrons in 15 orbitals for FeP(Im)(CO). For XO=NO or O2 , the explicit σ orbital
to describe σ -donation became less important thus only the π and π ∗ orbitals of NO and O2 were
Electronic Properties of Iron Sites … 787

This example illustrates that although the general principles governing the choice of
active orbitals (Sect. 2.2.2) are rather clear and quite intuitive, it is often not trivial,
neither straightforward, to find a well-balanced active space for larger complexes.
The active space constructed for each heme–XO species was used to calculate the
(0)
bonding energy ΔE BDE , by subtracting the CASPT2 energy of the respective complex
from the sum of CASPT2 energies of the isolated XO and the proper heme species
(in the closed-shell singlet state). Subsequently, the ΔE sp term was subtracted to
yield the bonding energy ΔE BDE by Eq. (15).
Table 2 shows the ligand binding energies calculated in Ref. [136] versus their
experimental estimations. The experimental BDE in gas phase is directly available
only for the FeP(NO) complex [23]. By contrast, the experimental BDEs given for
the FeP(Im)(XO) complexes are those estimated earlier by Blomberg et al. [11]
from kinetic (dissociation barriers) and thermodynamical (equilibrium constants)
data for either chelated protoheme or myoglobin [109, 176]; the latter are consistently
corrected for the absence of the protein environment in the present computational
model (vide supra).
Already a first look at Table 2 shows that the BDEs calculated at DFT level are
very sensitive to the exchange-correlation functional, particularly to its exchange
part. For all three ligands, the hybrid functionals (PBE0, B3LYP, B3LYP*) give
much lower bonding energies than the nonhybrid functionals (BP86, PBE, OLYP).
The BDEs from hybrid DFT methods clearly correlate with the amount of exact
exchange included in various functionals; in general the BDEs from PBE0 (25%)
are smaller than the BDEs from B3LYP (20%), which are smaller than the BDEs from
B3LYP* (15%). This simple trend does hold for all complexes except FeP(O2 ). It can
be observed that classical non-hybrid functionals (e.g., BP86) profoundly overbind in
all cases which is their rather typical behavior. On the contrary, the hybrid functionals
(except B3LYP*) greatly underbind all the three ligands. Rather poor performance
of the famous B3LYP functional is partially corrected by its reparametrized version,
B3LYP*. While the binding energies are clearly improved for CO and O2 , B3LYP*
still underbinds NO, both in FeP(NO) and in FeP(Im)(NO). In sum, the best DFT
BDEs were obtained with OLYP and B3LYP* functionals, but even these two func-
tionals have still noticeable difficulties: OLYP in reproducing the Fe–O2 BDE, while
B3LYP* in reproducing the Fe–NO BDE. In case of OLYP and O2 , the discrepancy
might, in fact, arise from limited accuracy of the experimental data (i.e., problems in
estimation of the protein effect for O2 , vide infra). Nonetheless, in case of B3LYP*
the problems appearing for both FeP(NO) and FeP(Im)(NO) indicate a failure of this
functional in providing the correct Fe–NO BDE.

made active. On the other hand, both in oxyheme and nitrosylheme complexes the Fe 3dz 2 was at
least partially occupied, thus the corresponding Fe 4dz 2 double-shell orbital was found important
and added straight for the FeP(NO) and FeP(O2 ) complexes, which led to active spaces of (13in14)
and (14in14), respectively. On the contrary, for FeP(Im)(NO) and FeP(Im)(O2 ), adding it on top
(0)
of Im σ turned out to be unfeasible. Thus, here the effect of Fe 4dz 2 on the binding energy ΔE BDE
had to be estimated from separate calculation with Fe 4dz 2 either active or not, but without Im σ
active, and used as a mere correction to the results obtained with Im σ active and Fe 4dz 2 virtual
[i.e, employing (15in14) for FeP(Im)(NO) or (16in14) for FeP(Im)(O2 )].
788 M. Radoń and E. Broclawik

Table 2 BDEs (kcal/mol) for heme–XO complexes (X=C, N, or O)


CO NO O2
FeP(XO) complexes
PBE0 4.0 5.8 0.8
B3LYP 2.8 6.7 −1.6
B3LYP* 9.4 16.1 0.4
OLYP 16.7 28.8 7.5
BP86 26.5 38.1 16.1
CASPT2 16.0 31.7 9.9
Exptl [23] 26.6a
28.9b
FeP(Im)(XO) complexes
PBE0 7.7 2.9 −0.8
B3LYP 9.9 7.3 3.8
B3LYP* 19.0 17.4 11.7
OLYP 17.4 20.3 5.1
BP86 40.6 42.7 27.4
CASPT2 19.4 21.6 9.9
Exptl [11] 18.1d 12.3c
19.5e 22.8e 10.1e
Adapted from Ref. [136]
a Radiative association;
b Associative equilibrium;
c Dissociation barrier;
d Estimated from c and a ratio of the CO/O equilibrium constants;
2
e Dissociation barriers in Mb (corrected for protein effect)

By contrast CASPT2 method provides satisfactory results for all three ligands and
for both five- and six-coordinated complexes. While in case of FeP(NO) the CASPT2
BDEs are too large by 2–5 kcal/mol, the CASPT2 results for the six-coordinated
complexes are very close to experimental estimations (and systematically some-
what smaller). The overall good performance of CASPT2 is achieved because this
multireference method (with the present, balanced choice of active spaces) provides
a correct description of the heme–XO bonding in which static correlation plays a great
role. A particularly good (nearly quantitative) agreement is obtained for CO and NO
bonding, but this might be (in part) due to error cancelation (vide infra).
A more in-depth discussion provided in Ref. [136] reveals the role of both contri-
(0)
butions, namely ΔE BDE and ΔE sp , in determining the actual value of BDE for various
(0)
computational methods. In brief, the hybrid DFT methods lead to much lower ΔE BDE
(i.e., weaker bonding) than the nonhybrid ones (i.e., stronger bonding), consistently
for both five- and six-coordinated complexes. In contrast, the behavior of spin-pairing
energy (ΔE sp ) is different for both types of complexes (in line with discussion in
Sect. 3.1): for FeP the spin-pairing energy is nearly constant for various function-
Electronic Properties of Iron Sites … 789

als (as corresponding to an IS → LS promotion) while for FeP(Im) this term is


considerably functional-dependent (since it corresponds to a HS → LS promotion).
(0)
Interestingly, for six-coordinated complexes the inaccuracies of the ΔE BDE and ΔE sp
terms within one exchange–correlation functional tend to accumulate rather than to
cancel out, explaining why the results obtained for FeP(Im)(XO) are more sensitive
to the applied DFT method than those for FeP(XO). The analysis of these two con-
tributions to BDE also sheds some light on a good agreement with experiment of
the CASPT2 BDEs for the FeP(Im)(XO) complexes. In this case the nearly perfect
agreement may be rooted, to some extent, in fortuitous error cancelation. Indeed,
according to the discussion in Sect. 3.1, the CASPT2 method—with the present
choice of the active space—is expected to exaggerate the ΔE sp term for FeP(Im) by
(0)
a few kcal/mol. Thus, there must appear a comparable error in ΔE BDE (i.e., bond
overstablization) since the BDE (the difference of both terms, cf. Eq. 15) is nearly
correct. This conjecture is in line with the BDE for five-coordinated FeP(NO) com-
plex, where the spin-pairing energy should be accurately reproduced by CASPT2,
thus overestimation of BDE for this complex mainly reflects a CASPT2 tendency to
(0)
overbinding (in ΔE BDE ) by a few kcal/mol.
Among the three XO ligands, O2 is found to form the weakest bond in all cases.
The Fe–O2 BDEs predicted with hybrid functionals are particularly low, at times
even negative (after subtracting the corrections for BSSE and ZPVE). It must be
added here that the DFT Fe–O2 BDEs reported in [136] were consistently corrected
for spin contamination. This correction was applied since the DFT calculations for
these complexes yield a broken-spin wave function with significant antiferromagnetic
coupling between Fe and O2 (more about the electronic structure of the heme–O2
complexes in Sect. 3.2.2). This correction, amounting to several kcal/mol, increases
the O2 binding energy. It is noteworthy that without the correction all the DFT meth-
ods (except of BP86 and PBE, which strongly overbind for all complexes) actually
fail to provide a positive value of the Fe–O2 BDE! Thus, the DFT Fe–O2 binding
energies in fact rely entirely on an approximate correction term. This was pointed
out as an argument in favor of multireference approaches (like CASSCF/CASPT2)
where the problem of spin contamination is inexistent [136].
Other interesting observation is that NO binds stronger than CO in both five- and
six-coordinated complexes (for the latter species there is no experimental evidence for
the existence of five-coordinated Fe–CO complexes, which indirectly implies they are
less stable than the Fe–NO complexes). This trend is well reproduced by all methods
for five-coordinated complexes. However, for the six-coordinated complexes, the
tested hybrid functionals in fact predict CO to bind stronger than NO. This problem
is not alleviated in the reparametrized B3LYP* functional. In contrast, CASPT2
method and all of the tested nonhybrid functionals predict the relative BDEs of these
two ligands in a good agreement with experimental data.
Yet another interesting phenomenon is the trans effect of the axial Im on the
ligand binding to heme. In contrast to CO, O2 (and other ligands), NO binds stronger
to four-coordinated heme [FeP] than to five-coordinated heme [FeP(Im)] [46, 185].
Based on experimental data in Table 2, the negative trans effect amounts to 4–
790 M. Radoń and E. Broclawik

6 kcal/mol, which is overestimated by the present CASPT2 and OLYP calculations


(8–10 kcal/mol; BP86 providing in this case a better agreement), and underestimated
by the hybrid DFT calculations—with B3LYP and B3LYP* incorrectly pointing to
positive trans effect for NO. Reliable description of the negative trans effect is
quite important for its biological context. It has been argued that the negative trans
effect for NO is a driving force for subsequent release of the axial histidine ligand
from the six-coordinated NO complex; the latter reaction is, in turn, believed to be
an initial step in activation of soluble guanylate cyclase (sGC) by NO, thus relevant
for involvement of NO in signaling and vascular smooth muscle relaxation [185].
The too weak bonding of NO to ferrous heme predicted by hybrid functionals
was also noted by Siegbahn and his coworkers [11], where it was considered a cum-
bersome issue for further investigation. Olah and Harvey further investigated this
problem for small models of both ferrous and ferric heme complexes (obtained by
coordination of NO to the models previously shown in Fig. 3), within a coupled clus-
ter CCSD(T) and DFT methodology [107]. It was found that both for the Fe(II) and
the Fe(III) species, the hybrid functionals (B3LYP, B3PW91) underestimate Fe–NO
BDEs by as much as 8–11 kcal/mol in comparison to CCSD(T) calculations. The
large error on BDEs from hybrid functionals was attributed to improper description
of left–right correlation energy connected to the Fe–NO π bonding [107]. This view-
point generally goes in line with a subsequent study on the electronic structure of
the Fe–NO complexes [138], whose selected results will be summarized in the next
section.
In the recent paper, Siegbahn et al. [169] pointed out the importance of van der
Waals (dispersion) effects for the problem of heme–XO bonding. The dispersion
effects are certainly included in CASPT2 calculations, but their description in DFT
is more problematic (see Sect. 2.3). The authors applied the empirical van der Waals
correction of Grimme [50] on top of B3LYP and B3LYP* functionals, employed
in dispersion-corrected (DFT-D) methods: B3LYP-D and B3LYP*-D. The attractive
van der Waals effect (estimated as a difference between B3LYP-D and B3LYP or
B3LYP*-D and B3LYP*) was found significant: 9.7 kcal/mol for CO, 9.3 kcal/mol
for NO, and 7.7 kcal/mol for O2 . Such a sizable effect is attributed to interaction of
the adsorbed XO molecule with a number of atoms of the porphyrin ring [169]. The
authors found the B3LYP-D energies in a good agreement with experimental data
for CO and O2 , but not for NO, for which a better result was obtained from B3LYP*-
D. Thus, although the correction for van der Waals interactions clearly improves
the hybrid DFT results, neither B3LYP-D nor B3LYP*-D methods predicted correct
results for all the three XO ligands. Apparently, the correct description of Fe–NO
BDE requires smaller amount of exact exchange (15%) than is needed for the Fe–CO
and Fe–O2 complexes (20%).
We should also mention here a study by Ribas-Ariño and Novoa [142], who per-
formed CASSCF/CASPT2 calculations of the energy profile for O2 coordination to
an isolated heme site. The model employed was FeP(Im) + O2 with BP86 geometries
used along the energy profile. Consistent with a change of the ground spin state upon
ligand bonding, the lowest energy pathway was found to involve a crossover between
the singlet spin state (characterizing the ground state of the oxyheme complex) and
Electronic Properties of Iron Sites … 791

the triplet, quintet, and heptet spin states (characterizing the asymptotic limits with
isolated quintet FeP(Im) and triplet O2 species as dissociation products). From the
energy of the short-distance minimum (Fe–O2 distance of 1.8 Å) on the singlet curve,
the authors estimated the Fe–O2 BDE as 14.9 kcal/mol.7 However, the energy curves
for the higher spin states (triplet, septet) generated additional minima for longer Fe–
O2 distances, corresponding to weakly bound deoxyheme...O2 complexes. These van
der Waals minima were not observed in the previous DFT calculations, presumably
because DFT does not describe the dispersion forces properly [22]. The existence of
the long-distance minima was pointed out as important for the kinetic reversibility
of the heme–O2 binding.

3.2.2 Electronic Structure of the Oxyheme (Fe–O2 ) Species

Concerning the ligand binding to heme, an additional challenge for theory is to


properly describe and understand the electronic structures of the heme–XO adducts.
While the Fe–CO bonding is well described by an interplay of standard σ -donation
and π ∗ -backdonation mechanisms (very similar to that extensively discussed for
typical metal–carbonyl complexes) [41, 92] the bonding mechanism is much more
interesting for the Fe–O2 and Fe–NO species. Both O2 and NO are open-shell and
noninnocent ligands, whose partially occupied π ∗ orbital strongly mixes with the Fe
3d orbitals, thus giving rise to complicated electronic structures.
Long before any calculations could be performed for heme-like systems, there
were many attempts to rationalize the electronic structure of oxyheme (Fe–O2 ) by
means of simple models. Already in 1936 Pauling and Coryell proposed that iron
site in oxyheme may be viewed as a pseudo-octahedral complex in which Fe(II) is
in the LS (singlet state), coordinated by five N atoms (four from porphyrin and one
from axial imidazole) and by one O atom from the O2 molecule. The O2 fragment
is promoted to its excited (singlet) state and the proximal O atom coordinates Fe
by the doubly occupied sp2 hybrid [116]. This picture of bonding in oxyheme thus
corresponds to the following resonance structure:

FeII (S1 = 0) − O2 (S2 = 0).

The Pauling model is consistent with diamagentism of oxyheme and explains the
bent geometry of the Fe–O–O fragment. Nonetheless, based on spectroscopic data
and chemical properties of synthetic Fe–O2 complexes, Weiss argued [194] that the
iron in oxyheme is most likely oxidized to Fe(III), while O2 is reduced to a superoxide
form (O− 2 ). The resonance structure proposed by Weiss to better describe electronic
properties of oxyheme was thus

FeIII (S1 = 1/2) − O−


2 (S2 = 1/2),

7 It
should be noted that this number does not include a BSSE correction, in view of Ref. [136],
expected to reduce the BDE considerably.
792 M. Radoń and E. Broclawik

where the two unpaired electrons—one on the low-spin Fe(III) and the other
on superoxide—couple antiferromagnetically to yield the global singlet state. Yet
another formulation of oxyheme as

FeII (S1 = 1) − O2 (S2 = 1),

i.e., Fe in its IS (triplet) state antiferromagnetically coupled with O2 in its triplet


state, was suggested by McClure [95]; this model was later used by Goddard and
Olafson to establish their ozone-like model [43]. The appearance of these different
bonding models fueled a long-lasting debate on which of them provides the most
correct description of bonding in oxyheme. As we shall see, theoretical calculations
were (until very recently) not conclusive in supporting any particular bonding model,
leaving these questions even more problematic.
The DFT and DFT/MM calculations on oxyheme models are commonly inter-
preted in favor of the Weiss model [150, 151]. Indeed, the (spin-unrestricted) DFT
calculations point to open-shell singlet electronic structure in which the antiferromag-
netic coupling between the Fe and O2 fragments is clearly present. A spin population
of about one (spin-up) electron is typically found on the Fe and about one (spin-down)
electron on the O2 fragment, strongly suggesting that these fragments are in their
FeIII and O−2 oxidation states, i.e., precisely as postulated by Weiss. Note, however,
that the actual spin populations are somewhat functional-dependent, as pointed in
Ref. [136]. Moreover, rather inconsistent with the Weiss model, the charge on O2 is
always found less negative than −1e (Mulliken charges from −0.4e to −0.3e were
found using different functionals [75]).
Since DFT does not deal particularly well with the description of Fe–O2 bonding
energy (vide supra), it is natural to seek for a critical assessment of DFT with ab initio
calculations also in regard to the bonding picture. However, early calculations with the
wave-function-based methods [99, 198, 199] did not agree with each other, and were
claimed to support either Pauling or McClure picture [162]. In a more recent study
by Jensen et al. [77], CASSCF calculations were carried out for the FeP(Im)(O2 )
model. Based on analysis of the principal contributions to the CASSCF wave function
and a very small negative charge on the O2 moiety (−0.2e), the authors concluded
that the multiconfigurational wave function was dominated by the Pauling resonance
structure, whereas the contribution of the Weiss structure was found negligible. This
result was used to question the quality of DFT calculations, since in the latter the
Weiss structure is dominant. However, the original conclusion was soon revised in the
erratum to the mentioned paper [76]. In short, after noticing complications stemming
from inherently delocalized character of Fe–O2 molecular orbitals obtained from
the CASSCF calculations, the authors admitted that the multiconfigurational wave
function of oxyheme can be read as a mixture of Pauling and Weiss resonance
structures, but they did not attempt to quantify the role of the two contributions [76].
A consistent description of the Fe–O2 bonding in oxy-Mb was provided by Chen
et al. [20]. The significance of this study was further underlined in a recent com-
mentary by Shaik and Chen [162]. Unlike most other studies, this one rigorously
took protein environment into account by means of CASSCF/MM calculations for
Electronic Properties of Iron Sites … 793

a realistic oxymyoglobin model. The QM region in the latter calculations was lim-
ited to FeP(Im)(O2 ) and the distal histidine (modeled as imidazole), whereas the
rest of the protein was simulated by point charges obtained from DFT:B3LYP/MM
calculations. The DFT/MM calculations were also used to optimize the structures for
subsequent use in the CASSCF/MM energy calculation. The (14in12) active space
was used, similar to the active spaces used in the other studies of oxyheme com-
plexes [77, 142, 198] and the one described in the previous Sect. [136], however,
without the σ Fe–NIm orbital and including only two Fe double-shell orbitals.
Figure 6 shows the three key orbitals describing the Fe–O2 bond in oxy-Mb
reprinted from the study of Chen et al. [20]. The middle one (labeled φ4 in Fig. 6) is
a bonding orbital corresponding to overlap of the O2 π∗ (i.e., the one lying in the Fe–
O–O plane) with Fe 3dz 2 . This molecular orbital thus describes a σ bonding between

the Fe and the proximal O atom. The corresponding antibonding orbital (σFe−O ) is
not shown in Fig. 6 for clarity although the both orbitals were active in CASSCF. The
other two orbitals (labeled φ3 and φ8 in Fig. 6) are the bonding and the antibonding
combinations arising from interaction of O2 π⊥∗ (i.e., the one perpendicular to the
Fe–O–O plane) with Fe 3d yz . These two orbitals thus describe a π bonding between
the iron and the O2 fragment. The CASSCF wave function of the oxy-Mb model is
dominated (∼90%) by the the two main configurations, both having the σFe−O orbital

doubly occupied (and the corresponding antibonding σFe−O vacant), but differing in

occupation of the π (Fe–O2 ) and π (Fe–O2 ) orbitals:
   
Ψoxy-Mb = C1 . . . (πFe−O2 )2 (πFe−O

2
)0 − C2 . . . (πFe−O2 )0 (πFe−O

2
)2 (16)

(where three dots represent the closed-shell part of the wave function, common for
the two configurations). We notice that a qualitatively similar, two-configurational
description of the oxyheme was also established in the earlier CASSCF studies [77,
198], prior to the cited work by Chen et al., however, it was not always interpreted
correctly. In particular, although the wave function in (16) is dominated by a closed-
shell configuration (|C1 | > |C2 |), it does not describe the closed-shell electronic
structure and one must not automatically view this wave function as conforming to
the Pauling model (cf. Ref. [76, 77]).
Being well aware of these interpretational caveats, Chen et al. transformed the
two-configurational wave function of (16) into a generalized valence bond (GVB)-
type wave function with two new orbitals, obtained as combinations of π (Fe–O2 ) and

Fig. 6 The three key


orbitals describing the
Fe–O2 bond in oxy-Mb.
Plots of the natural orbitals
obtained from CASSCF/MM
calculations by Chen et al.
Reprinted with permission
from [20] and kindly
provided by Sason Shaik and
Hui Chen. Copyright (2008)
American Chemical Society
794 M. Radoń and E. Broclawik

π ∗ (Fe–O2 ). These transformed (GVB-type) orbitals, shown in Fig. 7, turned out to be


rather well localized Fe d yz and O2 π⊥∗ fragment orbitals. After the transformation, the
two-configurational wave function (16) becomes an open-shell singlet configuration
(composed of two Slater determinants):

1  
Ψoxy-Mb = √ . . . (d yz )↑ (π ∗ )↓ (17)

2 1 + S2

where the arrows (↑, ↓) represent singlet-pairing (antiferromagnetic coupling)


between the two singly-occupied orbitals, Fe 3d yz and O2 π⊥∗ . In Eq. 17, the S quantity
in the normalization
 constant
 is the overlap integral between the GVB transformed
orbitals, S = d yz |π⊥∗ , which amounts to 0.25 in oxy-Mb and somewhat more (0.33)
for the gas-phase FeP(Im)(O2 ) model. In the latter case the delocalization tails of
the Fe 3d yz and O2 π⊥∗ fragment orbitals are more pronounced, giving rise to larger
overlap.
Thus, the GVB-type analysis of the CASSCF wave function revealed that the
bonding mechanism in oxyheme is essentially of Weiss type, involving a charge
transfer (FeII + O2 −→ FeIII + O− 2 ) and a subsequent π coupling between the radi-
cals generated on Fe and O2 [20]. This general view holds true as well for oxy-Mb as
for FeP(Im)(O2 ) model (gas-phase), nevertheless, two significant effects of the pro-
tein environment could be identified. First, due to larger overlap between the GVB
orbitals for the gas-phase model, the pairing of electrons in the π component of the
Fe–O2 bonding is stronger in the gas-phase than in oxy-Mb. Second, the protein
notably polarizes the σ component of the Fe–O bond: whereas in the gas-phase the
σFe−O bonding orbital contains a comparable contribution of both Fe 3d and O 2p, it

Fig. 7 Plots of the two


GVB-type orbitals
describing the π (Fe–O2 )
bond (a) in CASSCF/MM
calculations for oxy-Mb,
(b) in gas-phase CASSCF
calculations for
FeP(Im)(O2 ). Reprinted
with permission from [20]
and kindly provided by
Sason Shaik and Hui Chen.
Copyright (2008) American
Chemical Society
Electronic Properties of Iron Sites … 795


is more alike oxygen-based orbital in oxy-Mb (simultaneously the σFe−O antibonding
orbital is dominated by Fe 3dz 2 ). Consequently, the character of the σ component of
the Fe–O2 bond changes from a typically covalent in the gas-phase model to nearly
dative in oxy-Mb [20].
By considering not only the π (Fe–O2 ) and π ∗ (Fe–O2 ) orbitals, but also the
doubly-occupied σFe−O orbital involving a combination of Fe- and O2 -centered
orbitals (cf. Fig. 6), Chen et al. refined the GVB description of Fe–O2 bonding.
The authors obtained a more complicated GVB-type wave function with the three
leading terms, which were identified as the three (Pauling, Weiss, and McClure) res-
onance structures. Nevertheless, the Weiss structure was still identified as the most
important one for oxy-Mb, with a noticeable admixture of the McClure structure and
only a small contribution of the Pauling structure [20]. Chen et al. also focused on
the charge on the O2 fragment. This charge was found to vary from −0.2e in the gas
phase (CASSCF) to about −0.5e in oxy-Mb (CASSCF/MM), but never reaching the
value of −1e, which might be (naively) expected for the Weiss-type bonding. Quite
similar fractional negative charges were also found earlier in DFT and DFT/MM
calculations. This effect was attributed to the σ component of the Fe–O2 bonding,
which results in partial back-donation of electrons from O− III
2 to Fe , thereby reducing
the negative charge on O2 [20].
The CASSCF description of the Fe–O2 bonding thus turned out to be quite close
to the earlier DFT suggestions (vide infra). Chen et al. have also demonstrated a good
correspondence between natural orbitals from DFT and from CASSCF as well as
between natural spin orbitals from DFT and GVB transformed CASSCF orbitals [20].
The similarities do not seem accidental. In fact, both computational approaches
point to essentially the same bonding picture in case of oxyheme complexes, albeit
pinpointing this similarity requires a proper “reading” the multiconfigurational wave
function in a VB-type language [162].

3.2.3 Electronic Structure of {FeNO}7 Complexes

Similar interpretational problems to those with electronic structure of oxyheme, arise


also for nitrosylheme as well as for other iron(II)–nitrosyl complexes, denoted by
Enemark and Feltham as {FeNO}7 complexes [40, 196]. In this notation the super-
script 7 denotes a sum (6+1) of six 3d-electrons from Fe and one unpaired electron
from NO, which are dully distributed in combinations of the Fe 3d and the NO π ∗
orbitals. Therefore, the Feltham-Enemark notation does not attempt to specify the
oxidation state on Fe and NO, which correctly pinpoints the covalent character of
the Fe–NO bonding, and still leaves room for more precise assignments. Various for-
mulations of {FeNO}7 complexes, ranging from FeI − NO+ through FeII − NO0 up
to FeIII −NO− , appeared in the literature. Remarkably different views are supported
by various spectroscopic data and by various interpretations of the calculations (e.g.,
[46, 49, 140, 191]). In these studies, the {FeNO}7 complexes with various ligands
were considered, with both heme and nonheme architectures. Moreover, some of
these {FeNO}7 complexes have a low-spin (S = 1/2) ground state, like the heme
796 M. Radoń and E. Broclawik

complexes, whereas others adopt a high-spin (S = 3/2) ground state. Therefore, it is


not a priori clear whether this rich variety of diverse assignments should be attributed
to actual differences between various {FeNO}7 complexes (caused by different lig-
ation of the iron and/or different spin state), or simply reflect usage of different
experimental or theoretical methodologies.
The problem may be partly rooted in the lack of precise assignment of the
atom/fragment oxidation state by the standard QC (similar to other electronic proper-
ties, not uniquely defined for atoms/fragments in a molecule), and thus being a some-
what elusive concept. Nevertheless, one must admit that the notion of oxidation state
is a powerful and very useful concept in chemistry, providing a convenient framework
for interpretation and rationalization of many experimental data. Thus, a firmly estab-
lished and conforming to chemical intuition assignment of effective oxidation states
in {FeNO}7 complexes seems to be highly desirable for understanding the bonding
mechanism in these species (even more so since oxidation state of the fragment is
also not directly measurable by experiment).
The Fe and NO oxidation states in {FeNO}7 complexes are often judged by
inspection of the spin densities (or Mulliken spin populations) on the Fe atom
and NO fragment. However, as Ghosh et al. pointed out, the spin density distribu-
tions obtained from DFT calculations for the {FeNO}7 complexes are considerably
functional-dependent [29, 30, 65]. This is illustrated in Fig. 8, which shows the DFT
spin densities for five- and six-coordinated heme-nitrosyl complexes, FeP(NO) and
FeP(NH3 )(NO), as well as a non-heme Fe(salen)NO complex (all in the S = 1/2
state). In case of five-coordinated complexes [FeP(NO), Fe(salen)NO] the nonhy-
brid functionals (e.g., BP86) predict the spin density almost entirely localized on Fe,
leaving only very little spin population for NO. In contrast, hybrid functionals (e.g.,
B3LYP) point to significant spin polarization in five-coordinated complexes, with
an excessive spin-up density localized on Fe and a compensatory spin-down den-
sity localized on NO, qualitatively similar to the one observed for oxyheme species

Fig. 8 Contour plots of spin densities obtained from B3LYP (hybrid DFT) and BP86 (nonhybrid
DFT) methods for FeP(NO), Fe(salen)NO, and FeP(NH3)(NO) complexes (all in the S = 1/2 spin
state). The red/green color indicates excessive spin-up/spin-down density. The annotated values are
Mulliken spin populations on Fe and NO fragments. Based on Refs. [29, 138]
Electronic Properties of Iron Sites … 797

(Sect. 3.2.2), which suggests a noticeable antiferromagnetic coupling between the Fe


and NO unpaired electrons. However, in contrast to the Fe–O2 species, this antifer-
romagnetic coupling is suggested solely by the hybrid functionals. Oddly enough,
coordination of the axial N-donor ligand (like NH3 or Im) in six-coordinated heme
model causes a significant reorganization in the spin density. The spin density is
pumped from Fe to NO and the spin polarization disappears (in case of B3LYP).
Although the B3LYP and BP86 spin densities for FeP(NH3 )(NO) in Fig. 8 may look
qualitatively similar to each other, they are in fact very different: the hybrid func-
tional pointing to a much higher spin population (∼80%) on NO than the nonhybrid
one (∼40%).
In order to rationalized this complicated picture, Radoń et al. compared the DFT
spin densities with the CASSCF ones [138]. The authors considered the model com-
plexes as well in the doublet (S = 1/2) as in the quartet (S = 3/2) spin state. Figure 9,
adopted from [138], gives a comparison of the Mulliken spin populations on Fe and
NO calculated with a variety of DFT methods and CASSCF. As observed previously,
various functionals point to diverse spin distributions for the S = 1/2 spin state. In
contrast, for the S = 3/2 spin state of the studied complexes, all tested functionals
consistently provide a polarized spin density, suggesting a noticeable antiferromag-
netic coupling. However, hybrid functionals again point to larger spin polarizations
than those from nonhybrid methods. For all three complexes and the both spin states
included in Fig. 9, the nonhybrid functionals (BP86, OLYP) reproduce the CASSCF
spin populations most closely. This similarity holds true not only for Mulliken spin
populations, but also for the contour plots of the respective spin densities (which
are provided in Ref. [138]). Noteworthy, the spin density distributions emerging
from CASSCF and nonhybrid DFT calculations for FeP(NO) and FeP(NH3 )(NO)

1
B3LYP
B3LYP* Fe(P)NO
Fe(P)(NH3)NO
0.5 OLYP
BP86 CASSCF Fe(salen)NO
NO spin population

CASSCF BP86
0 BP86 CASSCF
OLYP S=3/2
OLYP

-0.5 BP86 CASSCF


B3LYP* CASSCF
S=1/2 BP86 CASSCF
BP86 OLYP
B3LYP* OLYP
B3LYP OLYP
-1 B3LYP*
B3LYP*
B3LYP B3LYP* B3LYP
B3LYP
B3LYP
-1.5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Fe spin population

Fig. 9 CASSCF and DFT Mulliken spin populations on NO and Fe for FeP(NO), FeP(NH3 )(NO),
and Fe(salen)NO. The points for S = 1/2 complexes gather along the x + y = 1 line indicating that
both populations sum up to one unpaired electron; in contrast, the points for S = 3/2 complexes lie
slightly below the x + y = 3 line since a part of the iron spin population leaks to the macrocycle
via a covalent σ Fe–(macrocycle) bonding. Adapted with permission from [138]. Copyright (2010)
American Chemical Society
798 M. Radoń and E. Broclawik

species agree well with interpretations of EPR and MCD spectra for similar (i.e., five-
and six-coordinated) heme-nitrosyl species [138]. In contrast, the hybrid functionals
(B3LYP, B3LYP*) point to excessive spin polarization, which is more pronounced
for B3LYP than for B3LYP* since the former contains more exact exchange than the
latter. This can be regarded a way the hybrid functionals try to simulate nondynamical
(left-right) correlation [138].
The work cited above [138] provided a detailed description of several {FeNO}7
complexes at multiconfigurational level. In addition to CASSCF studies on the
heme models [FeP(NO), FeP(NH3 )(NO)] and the Fe(salen)NO complex discussed
so far, two other, experimentally characterized {FeNO}7 species were discussed,
namely: [Fe(T)NO]− [where T is tris(carbamoylmethyl)amine, a “tripodal ligand”]
and [Fe(H2 O)5 NO]2+ (a complex obtained in the “brown-ring” reaction) [140, 191].
For all these complexes the active spaces were chosen according to the standard rules
(Sect. 2.2.2) and composed of Fe 3d, double-shell 4d, NO π ∗ orbitals, supplemented
with up to two σ -type orbitals to describe the covalent bonding with equatorial and
axial ligands.
Figure 10 shows—on the example of FeP(NO)—the key molecular orbitals
involved in description of the Fe–NO bonding in the studied {FeNO}7 complexes.
Typically of heme complexes, one of the Fe 3d orbitals (dx 2 −y 2 ) essentially does
not interact with the ligand orbitals, while the other one (dx y ) is strongly desta-
bilized by the equatorial ligands. The remaining three Fe 3d orbitals (dz 2 ,dx z ,d yz )

are involved in grossly covalent interactions with NO πx,y orbitals, leading to two
∗ ∗
bonding orbitals (d, πx,y )b , two antibonding orbitals (d, πx,y )a , and one orbital in
the middle with a nonbonding character, (d, πx∗ )n . A qualitatively similar bonding
picture was established for all studied complexes, with minor differences, e.g., in
the shape of the (d, πx∗ )n nonbonding orbital, slightly depending on the ligands and
the spin state [138].8 Apart from the two-component π -type bonding between Fe
and NO, there is an additional σ -type bonding, described by a weakly covalent (pre-
dominantly dative) interaction of the nitrogen lone-pair orbital with the Fe 3dz 2 . To
account for slight covalent character of this interaction, the occupied σ Fe–Naxial
orbital, with predominant nitrogen lone-pair character (not shown in Fig. 10), was
included in the active space in Ref. [138].
Figure 10 gives also the principal electronic configurations appearing in the
CASSCF calculations for the LS (S = 1/2) and the HS (S = 3/2) state of the studied
complexes. These configurations cover about 75–80% of the CASSCF wave func-
tion for the LS state and only ∼60% for the HS state. The remaining 20–40% part
of the wave function is distributed over many configurations, none of them reaching
a contribution larger than 10%. Among most important ones are the doubly-excited

configurations with the electronic pair transferred from one of the bonding (d, πx,y )b

orbitals to one of the antibonding ones (d, πx,y )a . The large role played by these
configurations is reflected in natural occupation numbers of the involved orbitals
(only about 1.7–1.8e for the bonding orbital and 0.2–0.3e for the antibonding one)

difference is most pronounced for the [Fe(H2 O)5 NO]2+ complex, in which due to the linear
8 The

Fe–N–O coordination the character of the nonbonding orbital is changed to pure Fe 3dz 2 .
Electronic Properties of Iron Sites … 799

Fig. 10 CASSCF natural


orbitals and principal
electronic configurations for
the S = 1/2 and S = 3/2
spin states of FeP(NO).
Reprinted with permission
from [138]. Copyright
(2010) American Chemical
Society

and clearly indicates a significant nondynamical left–right correlation appearing for


the Fe–NO bond [138]. One may notice many similarities between the electronic
structure of {FeNO}7 species and that of oxyheme discussed in the previous section.
However, while the Fe–O2 bonding is fairly well described with two-configurational
wave function (Eq. 16 in Sect. 3.2.2), the situation in Fe–NO complexes is more com-
plicated. Shaik et al. pointed out that it may be viewed as a pair of two-configurational
wave functions, each based on either the (d, πx∗ ) or (d, π y∗ ) bonding and antibonding
combination, and thus corresponding to one of the two π components (the x z- or
yz-one) of the Fe–NO bonding [162].
In order to highlight the electronic structure of {FeNO}7 species at multicon-
figurational level, Radoń et al. reinterpreted the CASSCF wave function by using
localized active orbitals. The authors found that a standard localization procedure (a
Cholesky algorithm) transformed the five bonding/antibonding/nonbonding (d, π ∗ )

orbitals (shown in Fig. 10) into nearly pure Fe 3dx z,yz,z 2 and NO πx,y fragment
orbitals. Since these fragment orbitals come from a unitary transformation of the
active orbitals, the CASSCF wave function is unchanged. However, its interpretation
is highly simplified, hence after the transformation to the localized active orbitals, the
principal contributions to the multiconfigurational function can be easily identified
as the VB-type structures.
After the transformation, the wave function for the HS (quartet) state is dominated
by the following three configurations
800 M. Radoń and E. Broclawik
 
4
Φ1 =  (dx 2 −y 2 )↑ (dx y )↑ (dz 2 )↑ (dx z )↑ (d yz )↑ (πx∗ )↓ (π y∗ )↓ , (18a)
 
4
Φ2 =  (dx 2 −y 2 )↑ (dx y )↑ (dz 2 )↑ (dx z )↑ (d yz )2 (πx∗ )↓ (π y∗ )0 , (18b)
 
4
Φ3 =  (dx 2 −y 2 )↑ (dx y )↑ (dz 2 )↑ (dx z )2 (d yz )↑ (πx∗ )0 (π y∗ )↓ , (18c)

the first one with Fe(III)–NO− character and the other two with Fe(II)–NO0 charac-
ter.9 All the three quartet VB-type resonance structures given in (18) involve a pairing
between the singly-occupied orbitals with Fe 3d and NO π ∗ character. The antifer-
romagnetic coupling of HS Fe(III)/Fe(II) with NO− /NO0 explains the origin of
significant spin polarization in the HS state (with majority spin-density on Fe and
minority spin-density on NO, cf. Fig. 9). It must be stressed that the spin polarization
in the quartet state cannot be understood by referring to only the principal electronic
configuration (cf. Fig. 10), since it has just three unpaired electrons on Fe and no
singly-occupied orbitals on NO. Likewise for the Fe–O2 species (see above), the anti-
ferromagnetic coupling arise from admixture of other configurations and becomes
more evident after transforming the wave function to the localized orbitals.
The analogous analysis of the CASSCF wave function for the LS (doublet) state of
the studied complexes produced even more configurations with comparable weights
than found for the quartet states; most of these configurations had either Fe(III)–
NO− or Fe(II)–NO0 character. Likewise for the HS state, some of these configu-
rations were found to describe the antiferromagnetic coupling (bond pair) of the
Fe(III) (SFe = 5/2) with the NO− (SNO = 1) fragments or the Fe(II) (SFe = 1) with
the NO0 (SNO = 1/2) fragments. However, other configurations described a (local)
singlet state of Fe(II) and an unpaired electron localized on NO0 [138]. To deal with
a large number of configurations collective weights of all configurations belonging
to a given resonance structure (e.g., all the configurations with Fe(III)–NO− char-
acter) were calculated, thereby rendering weights of various participating resonance
structures. In addition to the two already mentioned (FeIII −NO− , FeII −NO0 ), the
other two resonance structures were also identified (FeI −NO+ , FeIV −NO2− ), albeit
with very small weights. The summary of this analysis is depicted as a histogram
plot in Fig. 11. It turned out (rather surprisingly) that all studied {FeNO}7 complexes
are best described as roughly equal mixtures of the FeIII −NO− and FeII −NO0 res-
onance structures. This conclusion qualitatively agrees with Mössbauer spectra of
a number of known {FeNO}7 (S = 3/2) complexes, where the iron isomer shift is
placed consistently between the values characteristic of Fe(II) and Fe(III) states [144].
This wave function composition only slightly depends on the iron ligation; among
complexes with various ligands only the “brown ring” complex [Fe(H2 O)5 NO]2+
appears to have a predominant FeII − NO0 character (still, though, with considerable
participation of the FeIII − NO− structure). Moreover, a contribution of the previ-
ously suggested Fe(I)–NO+ structure [14, 49, 134] is very small, just a few %. In
other words, the “Fe(I)–NO+ ” description should merely be regarded as a formal
one, whereas in fact strong d → π ∗ backdonation repopulates the “empty” NO π ∗

9 The
assignment of oxidation states for a given VB-type structure comes down by calculating the
number of electrons in the Fe 3d and NO π ∗ fragment orbitals.
Electronic Properties of Iron Sites … 801

Fig. 11 Composition of the S=1/2 S=3/2


CASSCF wave function of 100
various {FeNO}7 species in
80

Contribution, %
terms of VB-type resonance
structures. Adapted with
60
permission from [138].
Copyright (2010) American
40
Chemical Society
20

Fe O

Fe )NO
Fe

Fe
Fe en)

Fe NO O
Fe len

Fe )(NH
(s

(s
(P

(P
(P NO

(H -
(P )NO

(T
al

a
)N

)(N

2
O
)5
H3

N
3
)N

)N

O
2+
O
2- 0 rest
Fe(IV)-NO- Fe(II)-NO+
Fe(III)-NO Fe(I)-NO

orbitals (of a hypothetical NO+ ) locating the effective iron oxidation number between
II and III.
The authors of Ref. [138] also pointed out that large differences between the spin
densities of the studied complexes are not reflected in changes of the effective Fe and
NO oxidation states. For instance, the effective oxidation states of Fe and NO groups
are sensitive neither to coordination of the axial ligand [FePNO → FeP(NH3 )NO] nor
to the change of the spin state, although these modifications change the spin density
distribution drastically. The latter fact is fully understandable since the doublet →
quartet transition rests simply on a spin promotion on Fe (dx 2 −y 2 → dx y , cf. Fig. 10),
with little participation of NO, which correlates well with very similar N–O distances
and the NO stretching frequencies in the both spin states. Based on the summarized
results, it must be noted that a common practice of taking spin densities as a measure
of oxidation states may be unjustified for the studied {FeNO}7 complexes, even if
the spin densities from modern ab initio methods can nowadays be trusted. Finally,
we notice that the presented approach to the assignment of the Fe and NO oxidation
states in {FeNO}7 complexes may be generalized to other metal–nitrosyl complexes.
For instance, an analogous approach as in Ref. [138] was used recently by Wieghardt
and coworkers for Tp*M(NO) species (where M = Co, Ni) [184].

3.3 High-Valent Iron-Oxo Species

High-valent iron-oxo porphyrins are involved in catalytic oxidation reactions per-


formed by cytochrome P450 enzymes in nature and by some synthetic iron-porphyrin
complexes in laboratory [54]. Cytochromes P450 enzymes are by far the most
important examples of oxygenases, performing several types of oxidative processes
(hydroxylation of alkanes and arenes, epoxidation of alkenes, and sulfoxidation of
802 M. Radoń and E. Broclawik

alkyl sulfides), often with remarkably high regio- and stereoselectivity [111, 164].
It is estimated that ∼75% of drugs in clinical use are metabolized by P450—as well
as steroids, carcinogens, and many other xenobiotics [89]. Apart from the great bio-
logical importance, the interest of scientific community in P450 enzymes and the
related iron porphyrin complexes stems also from their unique catalytic properties
(e.g., a propensity to activate inert C–H bonds in aliphatic systems), which may
provide inspirations for designing new types of catalysts (synthetic complexes or
enzyme mutants) capable of performing organic transformations that are currently
considered very difficult or impossible.
During the catalytic cycle, the active site of a P450 enzyme is converted (in several
steps) from the resting (ferric) state into a very reactive, short-living intermediate
that is capable of transferring the oxo group to an organic compound. This active
species is usually assumed to be a ferryl [iron(IV)-oxo] porphyrin π radical cation,
(FeIV O)(P.+ ), which is known in the literature as Compound I (Cpd I). Formation
of P450 Cpd I has never been observed under catalytic turnover conditions (where
the last observable species is a precursor of Cpd I, so called Cpd 0 [32]), although
an iron(IV)-oxo porphyrin cation radical species can be generated (via a peroxide
shunt pathway) and detected spectroscopically by mixing an enzyme with a proper
oxidant under stopped-flow conditions [37, 79, 175]. Not observing P450 Cpd I in
action forms an indirect evidence for its high oxidative reactivity (which precludes
its accumulation during the catalytic cycle). On the other hand, however, synthetic
iron(IV)-oxo porphyrin cation radical complexes were, paradoxically, characterized
as sluggish oxidants in hydroxylation of C–H bonds [113]. Moreover, the Cpd I
species of a chloroperoxidase (CPO) enzyme (with very similar active site as in
P450s)—which is more stable than Cpd I of P450 enzymes and was obtained in
higher yields—turned out to be much less reactive towards C–H oxidation than
might be implied given a high reactivity of the P450 enzymes in analogous reac-
tions [143, 202]. These ambiguous kinetics results —together with elusive character
of the active intermediate—made some authors to suppose that Cpd I might not be
the actual active species of P450 enzymes. Instead, a ferric hydroperoxy species [19]
or a perferryl [i.e., iron(V)-oxo] electromer of Cpd I (more of which below) [115,
167, 192] have been proposed. Whereas the first possibility has been essentially ruled
out by experimental results [32] and theoretical calculations [164], the second one
is still intriguing, and to some extent supported by recent calculations (vide infra).
Only very recently Rittle and Green obtained a P450 Cpd I (from CYP 119 enzyme)
in much higher yields than in all previous studies with the aid of rapid freeze-quench
technique [143]. They proved that the spectral properties (UV/Vis, EPR, Mössbauer)
of the captured active species are consistent with its ferryl porphyrin cation radical
formulation [(FeIV O)(P.+ )]. Moreover, the authors were able to demonstrate that
the chemically generated Cpd I can hydroxylate aliphatic C–H bonds with high
efficiency, as obviously required for the active species of cytochrome P450 [143].
Due to elusive character of Cpd I and difficulties in providing experimental data
(especially before the cited work by Rittle and Green), much knowledge about the
physical and chemical properties of this active species was derived from theoretical
calculations [163, 164]. The calculations were extremely helpful already in iden-
Electronic Properties of Iron Sites … 803

tifying the electronic structure of the active oxidant species as iron(IV) porphyrin
radical cation. Moreover, the calculations fundamentally contributed to formulating
the mechanism of catalytic oxidation for different types of organic substrates. A num-
ber of authors currently attempt to apply DFT calculations on P450 models (more of
less directly) in predictive analyzes of drug metabolism [87, 89, 154]. Although the-
oretical modeling of enzymatic activity is certainly a fascinating subject, it is out of
scope of this contribution and cannot be covered here (comprehensive reviews can be
found elsewhere [112, 164, 165]). Instead, this section is focused on the description
of electronic properties of Cpd I and of the related model complexes with high-valent
iron. The primary purpose here is to summarize recent advances on this front and to
highlight new and important conclusions obtained from ab initio methods.

3.3.1 Electronic Structure of Cpd I: A Triradicaloid Species

In the active site of P450 or CPO enzymes the iron-porphyrin system is axially
coordinated by the cysteine ligand trans to the oxo group. The Cpd I species can be
thus modeled as Fe(=O)P(SH) or Fe(=O)P(SCH3 ) complexes, in which the axial
cysteine is truncated to SH− /SCH− 3 (see, e.g., [104, 153]). More extensive models
with better representation of the cysteine and the porphyrin side chains were also
used in DFT and DFT/MM calculations (see, e.g., [156, 157]). However, the basic
electronic features of the Cpd I electronic structure are qualitatively reproduced for
the simplified models, albeit with noticeable effect of enzyme environment on the
electronic structure (vide infra).
In DFT calculations the ground state of Cpd I appears as a triradicaloid species:

two unpaired electrons (residing in quasi-degenerate πFe=O orbitals) are coupled to
a local triplet state on the ferryl group, whereas a remaining unpaired electron is
delocalized on the ligands and is only weakly coupled with the FeO triplet [163,
164]. The character of the ligand radical turned out to be so much dependent on the
environment that Cpd I was called a chameleon species [105, 106]. For the small
models in gas phase the unpaired electron is described by a molecular orbital being
a combination of cysteine-based σS and the porphyrin a2u orbital. However, after
extending the models to account for the enzyme environment, the singly occupied
ligand orbital changes to mostly porphyrin a2u with a little σS admixture. This results
in a noticeable transfer of the spin density from the thiolate sulfur to the porphyrin
ring and, simultaneously, an increase of sulfur negative charge. Qualitatively similar
effect has been found not only in QM/MM calculations for realistic enzyme mod-
els [156], but also for simple model complexes [Fe(=O)P(SH)] with two ammonia
molecules added in the vicinity of the cysteine group (to model the S· · ·H−N hydro-
gen bonding interactions occurring in the enzyme environment) or even with the
model complex embedded in a continuous solvation medium (with dielectric con-
stant a ε = 5.7 or larger) [104, 105, 135]. In any case, hydrogen bonding and/or polar
environment reduces electron-donor properties of cysteine ligand, thus enhances the
stability of porphyrin π radical species as compared to the situation in gas phase.
Several alternative electronic states of Cpd I were also identified in the DFT stud-
804 M. Radoń and E. Broclawik

ies. These states showed similar local triplet on the ferryl group, but varied in the
character of a ligand radical, which was located either in a1u porphyrin π orbital or
in πS cysteine-based orbital (where πS is a nonbonding orbital of the sulfur atom
perpendicular to the Fe–S axis, in variance to the parallel σS one) [106, 135].
As already mentioned, the local triplet state on FeO (S1 = 1) is only weakly
coupled with the ligand radical (S2 = 1/2), producing a pair of close-lying states:
with the total quartet (S = 1 + 1/2 = 3/2) or doublet (S = 1 − 1/2 = 1/2) spin,
corresponding to either parallel or antiparallel alignment of the S1 , S2 spins. This
magnetic coupling can be described phenomenologically by an effective Heisenberg–
Dirac–van Vleck spin Hamiltonian

ĤHDvV = −J Ŝ1 · Ŝ2 , (19)

where the J parameter is a measure of exchange interactions: J > 0 corresponds to


ferromagnetic coupling (i.e., quartet below doublet), J < 0 corresponds to antifer-
romagnetic coupling (i.e., doublet below quartet). From (19) one can derive a simple
relation between the J parameter and the quartet–doublet splitting, extractable from
quantum calculations:
J = 23 E (S=1/2) − E (S=3/2) . (20)

The principal electronic configuration for the S = 3/2 state (assuming the a2u
radical) is a single Slater determinant.
   
4
A2u = . . . (πx∗z )↑ (π yz
∗ ↑
) (a2u )↑ = . . . (πx∗z )α (π yz
∗ α
) (a2u )α  , (21)

However, the principal configuration for the S = 1/2 state (again, assuming the a2u
radical) is a combination of the three Slater determinants given below.
   
2
A2u = . . . (πx∗z )↑ (π yz
∗ ↑
) (a2u )↓ = 23 . . . (πx∗z )α (π yz∗ α
) (a2u )β 
   
+ − √16 . . . (πx∗z )α (π yz
∗ β
) (a2u )α  − √16 . . . (πx∗z )β (π yz
∗ α
) (a2u )α  (22)

Unfortunately, only the first one (i.e., |(π ∗ )x z )α (π yz


∗ α
) (a2u )β |) is routinely employed
as a Kohn-Sham wave function in DFT calculations. Thus, DFT does not provide
a rigorously correct description of the antiferromagnetic state (S = 1/2) and hence
the formula for obtaining the J value (20) must be modified in order to account for
spin contamination of the S = 1/2 state in DFT calculations [135].
The intuited problems of DFT in this area fueled a great interest in performing
correlated ab initio calculations as soon as they become technically available for
Cpd I models. Here, one should realize serious methodological challenges for such
calculations on Cpd I species, related to a variety of electronic configurations close
in energy and a great role played by electron correlation. The first study at multicon-
figurational ab initio level was carried out by Schöneboom, Neese and Thiel [157].
They performed CASSCF calculations with the smallest possible active space (3in3)
Electronic Properties of Iron Sites … 805


composed of the two πFe=O orbitals (πx∗z,yz ) and the porphyrin a2u , followed by
difference-dedicated CI (DDCI2).10 This CASSCF-DDCI approach was applied to
[Fe(=O)P(X)]+ cations (where P was porphin or tetraphenylporphin and the axial
ligand X was H2 O or nothing), similar to model Cpd I-like complexes studied exper-
imentally and to a model of Cpd I species in P450cam. The latter model (with
extended cysteine ligand) was considered both in gas phase and embedded in the
electrostatic charges taken from the preceding DFT:B3LYP/MM study [156] to sim-
ulate the enzyme environment. The work was focused on theoretical prediction of
spectroscopic (EPR, ENDOR, Mössbauer) parameters of Cpd I that would allow
a proper interpretation of the present and prospective experimental results for this
elusive compound. Selected conclusions from this study will be discussed below.
Independently, the authors of this chapter have applied CASSCF/CASPT2 method
to the Fe(=O)P(SH) model in gas phase [135], having remarkably more active orbitals
on the ferryl group (the bonding πx z,yz and the antibonding πx∗z,yz ; the bonding σz 2 and
antibonding σz∗2 ) as well as on the porphin ring (a2u , a1u , eg ) and on the thiolate ligand
(σS , πS ). This study pointed out the importance of sulfur-centered active orbitals (σS ,
πS ) since in the gas phase (prior to including dynamical correlation) the sulfur-based
triradicaloid states (2,4 ΣS , 2,4 ΠS ) had much lower energy at CASSCF level than the
porphyrin-based states (2,4 A2u ). Moreover, the A2u and ΣS configurations strongly
mixed with each other at CASSCF level, enforcing the use of state-averaged (SA)
approach (to capture simultaneously the both) and giving rise to large computational
difficulties. As a remedy, the multi-state (MS) CASPT2 approach was used, in which
the states from SA-CASSCF were allowed to re-mix in response to dynamical corre-
lation. Finally, the ground state with mixed porphyrin–sulfur cation radical character
(a2u –σS ) was obtained, having a rather similar character to the ground state obtained
from DFT (B3LYP) calculations. In the next relevant paper, Thiel and coworkers [4]
performed QM/MM calculations for P450cam with the CASSCF-DDCI+Q method,
but taking a more extensive active space than in the previous study from the same
group (Ref. [157]). Besides Cpd I, also its precursor (Cpd 0) and a hydroxy interme-
diate in catalytic camphor hydroxylation were studied. The active space for Cpd I was
composed of the relevant orbitals on the ferryl group (πx z,yz , πx∗z,yz , σz 2 , σz∗2 ) and on
the porphyrin ring (a1u , a2u ), along with the pair of σx y , σx∗y orbitals (to describe cova-
lency of the Fe–Nporphyrin bonding), and two “double-shell”-like orbitals for 3dx z,yz .
Unfortunately, this (13in12) active space led to convergence only for the quartet, but
not for the doublet states of Cpd I.
Beside answering many specific questions, a common goal of the three aforemen-
tioned ab initio studies [4, 135, 157] was to validate the picture of electronic structure
stemming from DFT calculations and, in particular, the description of the magnetic
coupling. Clearly, due to single-configurational Kohn-Sham wave function, the DFT
method runs into problems for the antiferromagnetically coupled doublet state (22),
which are reflected in the calculated spin density distributions and in spin popula-

10 However, the active orbitals in this study were not obtained self-consistently at CASSCF level
but were taken from restricted open-shell DFT (BP86) calculations.
806 M. Radoń and E. Broclawik

tions on the ferryl group and on the ligands. While DFT calculations predict a spin
population of ∼+2 on FeO and ∼−1 on the ligands (porphyrin and cysteine counted
together), the ab initio calculations point to spin populations with much smaller abso-
lute values. In fact, since three different determinants contribute to the correct (spin
adapted) wave function in eq. (22), the correct spin population for the ferryl group
should be close to +4/3 (i.e., 2/3 per each πx∗z,yz orbital), and −1/3 for the por-
phyrin and cysteine ligands [22]. Indeed, similar spin populations are obtained in ab
initio calculations that correctly take into account a multiconfigurational character of
the doublet state [22, 135, 157]. Due to the clear deficiency of single determinantal
description, it was suggested that spin-unrestricted DFT calculations should not be
used to calculate spin-dependent properties of Cpd I and the related systems, at least
without applying an appropriate spin-projection procedure [4, 157].
In spite of these problems, the DFT calculations yield rather realistic energet-
ics relevant to description of the magnetic coupling in Cpd I. In the mentioned
CASSCF-DDCI/MM study, Neese and Thiel et al. found that this level of theory
gives very similar relative energies of the S = 3/2 and S = 1/2 spin states of Cpd
I to those obtained in the previous B3LYP/MM calculations [157]. Both methods
predict a small antiferromagnetic coupling in P450cam Cpd I (J < 0) and give very
similar values of the doublet–quartet splitting. This state ordering is in agreement
with experimental data of CPO Cpd I. In contrast, for the six-coordinated model
complex [Fe(=O)(TPP.+ )(H2 O)]+ both B3LYP and DDCI point to quartet ground
state (J > 0), also in agreement with experimental data. The exchange coupling is
relatively small for both situations an the EPR spectra of the Cpd I species should
be dominated by relatively large zero-field splitting (ZFS), which is rooted in the
electronic structure of the ferryl group [157]. In was concluded that, taking spin
orbit interaction into account, Cpd I should have three close-lying Kramers doublets
(arising from mixing between the S = 1/2 and S = 3/2 spin states) which all are
populated at room temperature and may potentially contribute to the reactivity [157].
The difference in the sign of J between enzymes and thiolate-ligated complexes
(J < 0) in contrast to simple model complexes without thiolate ligand (J > 0) is
attributed to occurrence of an additional, weakly bonding interaction in the former
species: between one of the singly occupied ferryl orbitals (πx∗z ) and the ligand-
radical orbital (a2u –σS ) [195]. Without this effect, the quartet configuration (21)
would always have lower energy than the corresponding doublet one (22), since the
parallel spin alignment in the quartet state benefits from a larger exchange stabi-
lization (see Sect. 2). As argued by Weiss et al., this bonding stabilization of the
doublet state can only be effective if the symmetry is lowered (to Cs or C1 ) and
if the ligand radical is at least partially delocalized toward the ferryl group [195].
The presence of a soft sulfur atom provides the necessary delocalization of spin
density, absent from the non-thiolated Cpd I analogues. Green used natural mag-
netic orbitals to provide computational validation of this model on the basis of DFT
calculations [47, 48]. We found, however, that the nonhybrid DFT methods (BLYP,
BP86) point to much larger (and presumably overestimated) bonding effect than the
hybrid functional (B3LYP), which is in line with a tendency of the former functionals
to overestimate covalent bonding [121]. Consequently, the nonhybrid DFT methods
Electronic Properties of Iron Sites … 807

yield much larger splitting between the S = 1/2 and S = 3/2 electronic states with
the same radical character (i.e., a2u –σS or πS in Ref. [135]). The J values as large
as ∼ − 2 kcal/mol can be obtained with these functionals in contrast to J values on
the order of −0.1 kcal/mol obtained from hybrid DFT (the latter ones more consist
with experimental data). Neese and Thiel et al. noticed that the considered energy
lowering of the doublet state is essentially a multiconfigurational effect, being
described in CASSCF by mixing of the antiferromagnetically-coupled doublet con-
figuration (22) with other doublet configurations [4].

3.3.2 The Iron Spin State and Perferryl Verus Ferryl-Radical


Electromery in Cpd I

The most recent theoretical studies of Cpd I [21] or the related, Cpd I-like mod-
els [139] are largely focused on the energetics of the variety of their electromeric
states (i.e., electronic states with different iron spin or different charge distribution).
This problem is illustrated in Fig. 12. Starting from the “traditional” triradicaloid
electronic structure of iron(IV)-oxo porphyrin radical cation (top–left part of the
scheme), one may consider that: (a) the iron(IV)-oxo group is promoted to its high-
spin state (S = 2), yielding pentaradicaloid states (shown in the top–right part of
Fig. 12) or (b) open-shell porphyrin radical cation gets one electron from the iron,
becoming a closed shell porphyrin whereas the iron is oxidized to the perferryl state
[Fe(V)-oxo] (the bottom part of Fig. 12). With its Fe d3 configuration, the hypo-
thetical perferryl electromer of Cpd I may exist in the low- and high-spin states.
Moreover, the tri- and pentaradicaloid states may have a radical either in a2u or a1u
(or, in case of thiolate-ligated systems, alternatively in σS or πS ). In addition, for each
type of radical the ferryl multiplet and the ligand radical doublet may couple either to
ferro- or antiferromagnetic states, thus yielding in each case a pair of close lying spin
states: quartet/doublet for the triradicaloids, sextet/quartet for the pentaradicaloids.
Thus, there are plenty of possible iron(IV) and iron(V) electronic states with different
spin to be considered for Cpd I and Cpd I-related complexes. Radoń et al. pointed
out [139] that a reliable theoretical description of these electromeric states will have
to deal with at least two tricky issues: spin state energetics of the iron (see Sect. 3.1)
and the question of ligand noninnocence (see below) [42].
The perferryl versus ferryl-radical electromery here is, indeed, an example
of a more general issue of ligand innocence/noninnocence in transition metal
complexes [42, 133]. In this view, the porphyrin is “noninnocent” in the tri-
/pentaradicaloid states, but it would be “innocent” in the (hypothetical) perferryl
state. Although true Fe(V) complexes are exceedingly rare [34], some macrocycles—
notably TAML [108] and possibly also corrole [59] —can stabilize this high oxidation
state. A natural question thus arises whether porphyrin can do the same without being
immediately oxidized. On the basis of laser flash photolysis (LFP) experiments it
was speculated that the perferryl electromer Cpd I-like complex is initially formed
during photochemically induced oxidation and that it is stable enough to be “seen”
808 M. Radoń and E. Broclawik

Fig. 12 Electronic structure


of an iron-oxo porphyrin
compound in its various
possible electromeric states.
The electronic states are
distinguished by specifying
the oxidation state (IV or V)
and the local spin on the iron
(e.g., 3 FeIV , 4 FeV ), along
with the radical character
and the local spin on the
porphyrin: either closed-shell
1 P or radical cation 2 P .+ of

the two types (a2u or a1u ).


For simplicity the scheme is
shown for a model iron-oxo
porphyrin complex
[FeO(P)+ ] (considered in
Ref. [139]), hence it does not
show electronic states with
sulfur-based radical which
may additionally appear in
thiolate-ligated Cpd I
models. Reprinted with
permission from [139].
Copyright (2011) American
Chemical Society

in UV/Vis spectroscopy before it isomerizes to a more stable ferryl-radical form by


an intramolecular electron transfer [114, 115, 201].
Low lying electronic states of Cpd I in P450cam and CPO enzymes were studied
by Chen et al. [21]. The QM part of both models [Fe(=O)P(SH)] was surrounded by
electrostatic charges from the preceding B3LYP/MM simulations of the respective
enzyme (these calculations also provided the structure of the QM part). Otherwise,
Radoń et al. [139] studied the electromeric states for model (Cpd I-like) complexes:
[Fe(=O)P+ and Fe(=O)P(Cl)] in gas phase and in a continuous solvation model
(to estimate the effect of polarization). The models of Radoń et al. were focused
on description of oxo-iron-porphyrin fragment and hence they intentionally did not
include a thiolate ligand, because its soft character (in particular mixing of σS with
porphyrin a2u ) makes the electronic structure significantly more complicated. The
authors of Ref. [139] were able to perform state-specific calculations for each state
considered at its equilibrium structure (obtained from DFT calculations), and to
provide thereby the adiabatic relative energies. In contrast, Chen et al. obtained the
Electronic Properties of Iron Sites … 809

vertical energies at the common structure of the triradicaloid state (obtained from
DFT/MM optimization of each enzyme model).
Both mentioned studies employed large active spaces to describe the Fe atom
(with double-shell effect), the covalent Fe = O and Fe–Nporphyrin bonding, as well as
the ligand radical on porphyrin (or, optionally, on sulfur in Ref. [21]). However, by
making active not only the porphyrin HOMOs (a2u and a1u ) but also the (degener-
ated) LUMO (eg ), Radoń et al. managed to describe all the states considered with
a common active space (15in16). In contrast, Chen et al. used slightly different
active space for different states, adapting it to the orbital occupancies for the state
being calculated in state-specific calculations. The electronic states with porphyrin-
/ sulfur-radical character were described with an active space (13in13) containing
only one of the relevant ligand orbitals (a2u , a1u , σS , or πS ), i.e., the one being approx-
imately singly occupied in the electronic state considered, whereas for providing the
energies of the perferryl states the authors used a smaller active space (11in12), with
none of these ligand orbitals included. Such a procedure was (in part) validated by
performing selected calculations with an active space (15in14) containing both a2u
and a1u orbitals. Nonetheless, despite these notable differences in the computational
methodology and different choices of the model systems, the both studies actually
arrived at quite similar conclusions in regard to stability of the various electromeric
states.
Perhaps the most intriguing result from [21, 139] is that the perferryl states were
found at surprisingly low energies in the both sets of calculations. For the enzyme
models, the LS FeV state was found only at 6–7 kcal/mol above the a2u -based FeIV
triradicaloid, and the HS FeV state at even lower relative energy (∼2 kcal/mol).
Considering the adiabatic energies, Radoń et al. found the analogous perferryl states
even below the triradicaloid states for both Fe(=O)P+ and Fe(=O)P(Cl). However,
this is a situation in gas phase, whereas upon considering an effect of polarizable
medium these states are shifted up in energy by a few kcal/mol.11 Nonetheless,
even in a polar medium the perferryl states are still low-lying (below 10 kcal/mol
from the triradicaloid states). These results are surprising given that in all previous
DFT calculations the perferryl states of Cpd I were found very high in energy (16–
26 kcal/mol already in gas phase, and presumably even higher in polar medium /
protein environment) [3, 106]. However, all DFT studies so far were based on B3LYP
functional, whereas Radoń et al. found that the relative energies of the perferryl
(FeV )(P) and ferryl-radical (FeIV )(P.+ ) states are considerably functional dependent.
The hybrid functionals (B3LYP, B3LYP*) place the perferryl states at high energy,
while the nonhybrid ones (BP86, OLYP) at much lower energy, in accord with the ab
initio predictions. This issue is an intriguing and potentially important one because

11 The medium effect ranges from 2–4 kcal/mol for the five-coordinated Fe(=O)P+ to 4–8 kcal/mol
for the six-coordinated Fe(=O)P(Cl) complex, with only slight dependence on the dielectric constant
(ε = 5.7 and 78 were tested), but nearly not depending on the exchange-correlation functional used,
nor on a specific solvent model (PCM, COSMO) used in the calculations. This suggests that the
effect is rooted simply in larger electrostatic stabilization of the iron(IV)-oxo porphyrin cation
radical states as compared to the iron(V)-oxo states with a closed-shell porphyrin.
810 M. Radoń and E. Broclawik

most DFT calculations for Cpd I systems and of their reactivity are based solely on
the B3LYP functional.
In both mentioned papers many cross-checks were carried out to test whether the
perferryl states are not artificially lowered in energy [21, 139]. With initial suspect
that the original active space (15in16) might not be extensive enough for a balanced
description of iron-to-porphyrin charge transfer the authors of Ref. [139] tested the
effect of enlarging the active space on the porphyrin ring. Their suspicion was based
on the previous experience with copper corroles [131] for which there is a copper-
to-corrole charge transfer analogous to iron-to-porphyrin charge transfer here. To
test the effect of enlarging the active space on porphyrin, the restricted active space
approach (RASSCF/RASPT2) was used since the resulting active spaces were too
large to be handled in generic CASSCF/CASPT2. The final active space proposed for
RASSCF/RASPT2 calculations was based on as many as 28 active orbitals, including
16 π , π ∗ orbitals on the porphyrin ring.12 However, this substantial enlargement of
the active space led to even slightly larger stabilization of the FeV states, confirming
that they lie indeed low in energy as already the CASPT2 calculations suggested.
Moreover, Radoń et al. carried also benchmark CCSD(T) calculations for a small
model complex Fe(=O)(L2 )+ (where L = C3 N2 H− 5 , analogous to small models shown
in Fig. 3), which can be considered a mimic of Fe(=O)P+ . Interestingly, the relative
energies of the perferryl and ferryl-radical states for this model are very similar at
RASPT2 and CCSD(T) levels, suggesting that the perferryl–ferryl gaps for larger
models are also correctly reproduced at CASPT2 or RASPT2 level.
If according to ab initio methods the perferryl states are lying just a few kcal/mol
above the triradicaloid states, they should be thermally accessible at room tempera-
ture and they might, indeed, appear in the mentioned LFP experiments (vide infra).
It is also possible that the metastable perferryl state can somehow contribute to reac-
tivity of Cpd I, for instance through the previously suggested mechanism [59, 167],
in which the perferryl electromer of Cpd I is initially formed (by a heterolytic FeO–
O bond cleavage in Cpd 0, the Cpd I precursor) and it quickly oxidizes an organic
substrate (due to small activation energy on this pathway), before it can isomerize to
a more stable ferryl–radical form of Cpd I. We notice that very recent DFT calcula-
tions by Isobe and coworkers seem to support a possibility of this intriguing scenario
in hydroxylation of aliphatic C–H bonds [70, 71].
In the other aspects (i.e., apart from predicting the existence of low-lying perferryl
states) the two mentioned ab initio studies generally support the view of the Cpd I
electronic structure established earlier on basis of DFT and DFT/MM calculations
(with the hybrid B3LYP functional). Considering the triradicaloid states, many pos-
sibilities were considered by Chen et al. The sulfur-based radicaloids (σS , πS ) were
found very high in energy for both enzyme models because they are destabilized
in protein environment. The a1u porphyrin-based radicaloid states were found 18–

12 In these calculations only 6 ferryl-based orbitals (πx z,yz , πx∗z,yz , σz 2 , σz∗2 ) and all (remaining)
singly occupied orbitals (e.g., a2u in the tri-/pentaradicaloids) were placed in the RAS2 subspace
(see Sect. 2.2), whereas the other active orbitals were placed in RAS1 (if nearly doubly occupied)
or in RAS3 (if nearly empty).
Electronic Properties of Iron Sites … 811

19 kcal/mol higher than the corresponding a2u -based radicaloids. This result falls
between the DFT/MM result (12 kcal/mol) [3] and the previous gas-phase CASPT2
estimate (∼23 to 25 kcal/mol) [135], but it is much higher than predicted by Altun et
al. in their CASSCF-DDCI+Q/MM study [4]. It was found that in correlated calcu-
lations a reasonable splitting between the two types of states is recovered only after
including dynamical correlation (e.g., in the CASPT2 step) [22]. Moreover, Radoń et
al. noticed that for the model complexes the gap between the a1u - and a2u -based rad-
icaloids is significantly affected by the quality of the active space. The standard one
(with only four frontier orbitals on the porphyrin: a1u , a2u , and degenerate eg ) cannot
provide a realistic splitting at the CASPT2 level, but RASPT2 calculations with the
larger active space (vide supra) resolved this problem. Concerning the triradicaloid
2,4
A2u or 2,4 A1u states, the standard active space also points to the doublet-below-
quartet state ordering, which is not correct for these model complexes without the
axial thiolate (see Ref. [139]). Again, this problem is corrected in the RASPT2 cal-
culations based on the extended active space.
Finally, the both cited studies considered also the relative energy of the pentarad-
icaloid FeIV states (with the HS state on the iron) as compared to the triradicaloid FeIV
states (with the IS state on the iron). Here, however, the CASPT2 or RASPT2 spin
state energetics is probably biased in favor of the HS state, likewise it was found for the
ferrous complexes discussed in Sect. 3.1. Such behavior is indeed suggested by com-
parison with CCSD(T) calculations for the small mimicking complex (vide supra)
carried by Radoń et al. [139]. Therefore, the present CASPT2 (or RASPT2) calcula-
tions most likely place these pentaradicaloid states of Cpd I too low in energy. Taking
all this into account, Chen and Shaik suggested that the previous DFT (B3LYP) gap
of ∼12 between tri- and pentaradicaloid states (see Ref. [3]) may be in fact a good
estimate [22]. Although the pentaradicaloid states have higher energy than the tri-
radicaloid ones for the equilibrium geometry Cpd I, they are considerably stabilized
during the reaction pathway for C–H hydroxylation via exchange interactions [3].
This effect may potentially give rise to exchange-enhanced reactivity of Cpd I if
(obviously) the pentaradicaloid state is not too high in energy at the beginning of
the reaction [166]. Final resolving this issue will require more credible calculation
of spin state energetics, not currently possible (see Sect. 3.1).
In sum, correlated ab initio calculations for Cpd I revealed a number of low-lying
electromeric states, with different chemical character, some of which were missed in
previous DFT calculations. The pentaradicaloid and perferryl states may potentially
contribute to multi-state reactivity of Cpd I, and this issue certainly deserves careful
investigation in near future.

4 Concluding Remarks

This chapter summarized recent advances in quantum chemical description of iron


porphyrin complexes and models of heme proteins (e.g., myoglobin, hemoglobin,
cytochromes P450). Making conclusive calculations for these species require a bal-
812 M. Radoń and E. Broclawik

anced treatment of electron correlation, including nondynamical correlation, which


is presently a great challenge for the computational methods. After introducing the
DFT and correlated ab initio methods, including single-reference [e.g., CCSD(T)]
and multireference approaches [e.g., CASSCF/CASPT2], several case studies were
reviewed to show how these various computational methods deal with description
of electronic structure and properties of the biologically relevant iron porphyrins.
The discussion was particularly focused on spin state energetics of ferrous and ferric
complexes, binding properties of CO, NO, and O2 ligands to heme, and electronic
structure of P450 Cpd I and alike systems.
While DFT calculations for the heme and heme-related complexes have been
carried out for a long time, performing ab initio calculations has become feasible for
just a few years. Although correlated ab initio calculations are more time consuming
than “ordinary” DFT calculations, they already turned out to be indispensable in
a number of aspects. Firstly, due to their systematics and high accuracy, ab initio
calculations allowed to resolve many ambiguous results stemming from the use of
approximate exchange–correlation functionals in DFT. It was shown, for instance,
that CASPT2 accurately reproduce experimental binding energies of CO, NO, and
O2 to heme, unlike the majority of widely used DFT methods. For Cpd I and alike
high-valent iron-oxo porphyrins, ab initio calculations predict low-lying electromeric
states (with FeV = O character) that were previously left unnoticed on the basis of
DFT calculations, but now are recognized as potentially important in multi-state
reactivity. Secondly, multiconfigurational ab initio calculations offer a new and in-
depth insight into the complicated bonding mechanism in oxyheme and nitrosylheme,
which has been a subject of long-standing debates. Concerning this point, it was
shown that reinterpretation of the CASSCF wave function in valence bond (VB)-type
language can be used to describe the Fe–O2 and Fe–NO bonding situations in terms
of chemically intuitive resonance structures. However, despite the unquestionable
achievements, certain problems and limitations have also been identified in course
of the ab initio calculations, enforcing researchers to use these methods carefully
and in line with experience and methodological guidance gathered so far.
In some of the cases reviewed, comparison of ab initio calculations essentially
confirmed the adequacy of the DFT description. In other cases, however, important
discrepancies between DFT and ab initio calculations have been found, concerning
both the accuracy of approximate exchange–correlation functionals and interpre-
tation of the DFT results. A general rule for dealing with these difficult cases is
to cross-check various computational approaches before reaching the final conclu-
sions. It is noteworthy that ab initio methods, even if sometimes too expensive to be
applied to realistic enzyme models, are suitable for studying smaller model systems
with analogous electronic structure; such calculations can be used to benchmark
the quality of DFT description and to suggest the choice of the most appropriate
exchange–correlation functional. Therefore, while DFT remains the main compu-
tational approach, particularly for modeling reaction mechanism, ab initio methods
turn out to be very useful in resolving exceptionally intricate electronic structure,
such as found in many iron porphyrin and porphyrin-type complexes reviewed in
this chapter.
Electronic Properties of Iron Sites … 813

Acknowledgements This research project was supported by grant no UMO-2011/01/B/ST4/02620


from the National Science Centre (Poland) and by grant no IP2011 044471 from the Ministry of
Science and Higher Education (Poland). This scholarly work was made thanks to POWIEW project,
which is co-funded by the European Regional Development Fund (ERDF) as a part of the Innovative
Economy program. This publication was made possible through the financial support from the
Foundation for Polish Science (START scholarship provided for M.R.). We also acknowledge
computational grants from Academic Computer Center CYFRONET AGH in Kraków, WCSS in
Wroclaw (grant no. 181), and CI TASK in Gdańsk.

References

1. Adler, T.B., Knizia, G., Werner, H.J.: A simple and efficient CCSD(T)-F12 approximation. J.
Chem. Phys. 127(22), 221106 (2007). https://doi.org/10.1063/1.2817618
2. Ali, M.E., Sanyal, B., Oppeneer, P.M.: Electronic structure, spin-states, and spin-crossover
reaction of heme-related Fe-porphyrins: a theoretical perspective. J. Phys. Chem. B 116(20),
5849–5859 (2012). https://doi.org/10.1021/jp3021563
3. Altun, A., Shaik, S., Thiel, W.: What is the active species of cytochrome P450 during camphor
hydroxylation? QM/MM studies of different electronic states of compound I and of reduced
and oxidized iron-oxo intermediates. J. Am. Chem. Soc. 129(29), 8978–8987 (2007). https://
doi.org/10.1021/ja066847y
4. Altun, A., Kumar, D., Neese, F., Thiel, W.: Multireference ab initio quantum mechan-
ics/molecular mechanics study on intermediates in the catalytic cycle of cytochrome P450cam.
J. Phys. Chem. A 112, 12,904–12,910 (2008). https://doi.org/10.1021/jp802092w
5. Andersson, K., Malmqvist, P.Å., Roos, B.O.: Second-order perturbation theory with a com-
plete active self-consistent field reference function. J. Chem. Phys. 96(2), 1218–1226 (1991)
6. Angeli, C., Borini, S., Cavallini, A., Cestari, M., Cimiraglia, R., Ferrighi, L., Sparta, M.:
Developments in the N-electron valence state perturbation theory. Int. J. Quantum. Chem.
106(3), 686–691 (2006). https://doi.org/10.1002/qua.20831
7. Aquilante, F., Malmqvist, P.Å., Pedersen, T.B., Ghosh, A., Roos, B.O.: Cholesky
decomposition-based multiconfiguration second-order perturbation theory (CD-CASPT2):
application to the spin-state energetics of CoIII (diiminato)(NPh). J. Chem. Theory Comput.
4(5), 694–702 (2008). https://doi.org/10.1021/ct700263h
8. Balabanov, N.B., Peterson, K.A.: Systematically convergent basis sets for transition metals.
I. All-electron correlation consistent basis sets for the 3d elements Sc–Zn. J. Chem. Phys.
123(064), 107 (2005). https://doi.org/10.1063/1.1998907
9. Bartlett, R.J., Musial, M.: Coupled-cluster theory in quantum chemistry. Rev. Mod. Phys. 79,
291–352 (2007). https://doi.org/10.1103/RevModPhys.79.291
10. Barysz, M.: Two-component relativistic theories. In: Barysz, M., Ishikawa, Y. (eds.) Rela-
tivistic methods for chemists, no. 10 in challenges and advances in computational chemistry
and physics, pp. 165–190. Springer, The Netherlands (2010). https://doi.org/10.1007/978-1-
4020-9975-5_4
11. Blomberg, L.M., Blomberg, M.R., Siegbahn, P.E.: A theoretical study of the binding of O2 ,
NO and CO to heme proteins. J. Inorg. Biochem. 99, 949–958 (2005). https://doi.org/10.
1016/j.jinorgbio.2005.02.014
12. Blomberg, M.R., Johansson, A.J., Siegbahn, P.E.: O–O bond cleavage in dinuclear peroxo
complexes of iron porphyrins: a quantum chemical study. Inorg. Chem. 46, 7992–7997 (2007)
13. Brucker, E., Olson, J., Ikeda-Saito, M., Phillips Jr., G.: Nitric oxide myoglobin: crystal struc-
ture and analysis of ligand geometry. Proteins 30, 352–356 (1998). 10.1002/(SICI)1097-
0134(19980301)30:4<352::AID-PROT2>3.0.CO;2-L
14. Burlamacchi, L., Martini, G., Tiezzi, E.: Electron spin resonance of iron-nitric oxide com-
plexes. Iron-nitrosyl-halide compounds. Inorg. Chem. 8(9), 2021–2025 (1969). https://doi.
org/10.1021/ic50079a047
814 M. Radoń and E. Broclawik

15. Caffarel, M.: Quantum monte carlo in chemistry. In: Engquist, B. (ed.) Encyclopedia of
Applied and Computational Mathematics. Springer, Berlin (2011)
16. Caffarel, M., Daudey, J.P., Heully, J.L., Ramírez-Solís, A.: Towards accurate all-electron
quantum Monte Carlo calculations of transition-metal systems: spectroscopy of the copper
atom. J. Chem. Phys. 123(094), 102 (2005). https://doi.org/10.1063/1.2011393
17. Cao, X., Dolg, M.: Relativistic pseudopotentials. In: Barysz, M., Ishikawa, Y. (eds.) Rela-
tivistic Methods for Chemists, Challenges and Advances in Computational Chemistry and
Physics, vol. 10, pp. 215–277. Springer, The Netherlands (2010). https://doi.org/10.1007/
978-1-4020-9975-5_6
18. Capece, L., Estrin, D.A., Marti, M.A.: Dynamical characterization of the heme NO oxygen
binding (HNOX) domain. Insight into soluble guanylate cyclase allosteric transition. Bio-
chemistry 47(36), 9416–9427 (2008). https://doi.org/10.1021/bi800682k
19. Chandrasena, R.E.P., Vatsis, K.P., Coon, M.J., Hollenberg, P.F., Newcomb, M.: Hydroxylation
by the hydroperoxy-iron species in cytochrome p450 enzymes. J. Am. Chem. Soc. 126(1),
115–126 (2004). https://doi.org/10.1021/ja038237t
20. Chen, H., Ikeda-Saito, M., Shaik, S.: Nature of the Fe-O2 bonding in oxy-myoglobin: effect
of the protein. J. Am. Chem. Soc. 130(44), 14778–14790 (2008). https://doi.org/10.1021/
ja805434m
21. Chen, H., Song, J., Lai, W., Wu, W., Shaik, S.: Multiple low-lying states for compound I of
P450cam and chloroperoxidase revealed from multireference ab initio QM/MM calculations.
J. Chem. Theory Comput. 6(3), 940–953 (2010). https://doi.org/10.1021/ct9006234
22. Chen, H., Lai, W., Shaik, S.: Multireference and multiconfiguration ab initio methods in heme-
related systems: what have we learned so far? J. Phys. Chem. B 115(8), 1727–1742 (2011).
https://doi.org/10.1021/jp110016u
23. Chen, O., Groh, S., Liechty, A., Ridge, D.P.: Binding of nitic oxide to iron(II) porphrins: radia-
tive association, blackbody infrared radiative dissociation, and gas-phase association equilib-
rium. J. Am. Chem. Soc. 121, 11,910–11,911 (1999). https://doi.org/10.1021/ja991477h
24. Choe, Y.K., Hashimoto, T., Nakano, H., Hirao, K.: Theoretical study of the electronic ground
state of iron(II) porphine. Chem. Phys. Lett. 295, 380–388 (1998)
25. Choe, Y.K., Nakajima, T., Hirao, K., Lindh, R.: Theoretical study of the electronic ground
state of iron(II) porphine. J. Chem. Phys. 111(9), 3837–3845 (1999). https://doi.org/10.1063/
1.479687
26. Collman, J.P.: Functional analogs of heme protein active sites. Inorg. Chem. 36(23), 5145–
5155 (1997). https://doi.org/10.1021/ic971037w
27. Collman, J.P., Hoard, J.L., Kim, N., Lang, G., Reed, C.A.: Synthesis, stereochemistry, and
structure-related properties of α, β, γ , δ-tetraphenylporphinatoiron(II). J. Am. Chem. Soc.
97, 2676–2681 (1975). https://doi.org/10.1021/ja00843a015
28. Collman, J.P., Brauman, J.I., Iverson, B.L., Sessier, J.L., Morris, R.M., Gibson, Q.H.: O2
and CO binding to iron(II) porphyrins: a comparison of the “picket fence” and “pocket”
porphyrins. J. Am. Chem. Soc. 105, 3052–3064 (1983)
29. Conradie, J., Ghosh, A.: DFT calculations on the spin-crossover complex Fe(salen)(NO): a
quest for the best functional. J. Phys. Chem. B 111, 12,621–12,624 (2007). https://doi.org/
10.1021/jp074480t
30. Conradie, J., Quarless, D., Hsu, H.F., Harrop, T., Lippard, S., Koch, S., Ghosh, A.: Electronic
structure and FeNO conformation of nonheme iron-thiolate-NO complexes: an experimental
and DFT study. J. Am. Chem. Soc. 129(34), 10,446–10,456 (2007). https://doi.org/10.1021/
jp076979t
31. Cramer, C.J., Truhlar, D.G.: Density functional theory for transition metals and transition
metal chemistry. Phys. Chem. Chem. Phys. 11, 10,757–10,816 (2009). https://doi.org/10.
1039/b907148b
32. Davydov, R., Makris, T.M., Kofman, V., Werst, D.E., Sligar, S.G., Hoffman, B.M.: Hydrox-
ylation of camphor by reduced oxy-cytochrome P450cam: mechanistic implications of EPR
and ENDOR studies of catalytic intermediates in native and mutant enzymes. J. Am. Chem.
Soc. 123(7), 1403–1415 (2001). https://doi.org/10.1021/ja003583l. (pMID: 11456714)
Electronic Properties of Iron Sites … 815

33. Denisov, I.G., Makris, T.M., Sligar, S.G., Schlichting, I.: Structure and chemistry of
cytochrome p450. Chem. Rev. 105, 2253–2278 (2005). https://doi.org/10.1021/cr0307143
34. Dey, A., Ghosh, A.: “True” iron(V) and iron(VI) porphyrins: a first theoretical exploration. J.
Am. Chem. Soc. 124(13), 3206–3207 (2002). https://doi.org/10.1021/ja012402s
35. Dolphin, D., Sams, J.R., Tsin, T.B., Wong, K.L.: Synthesis and Moessbauer spectra of
octaethylporphyrin ferrous complexes. J. Am. Chem. Soc. 98, 6970–6975 (1976). https://
doi.org/10.1021/ja00438a037
36. Dunning, T.H.: Gaussian basis sets for use in correlated molecular calculations. I. The atoms
boron through neon and hydrogen. J. Chem. Phys. 90(2), 1007–1023 (1989). https://doi.org/
10.1063/1.456153
37. Egawa, T., Shimada, H., Ishimura, Y.: Evidence for compound I formation in the reaction
of cytochrome-P450cam with m-chloroperbenzoic acid. Biochem. Biophys. Res. Commun.
201(3), 1464–1469 (1994). https://doi.org/10.1006/bbrc.1994.1868
38. van Eldik, R.: Fascinating inorganic/bioinorganic reaction mechanisms. Coord. Chem. Revs.
251(13–14), 1649–1662 (2007). https://doi.org/10.1016/j.ccr.2007.02.004. (37th Interna-
tional Conference on Coordination Chemistry, Cape Town, South Africa)
39. Ellison, M., Schulz, C., Scheidt, W.: Structure of the deoxymyoglobin model [Fe(TPP)(2-
MeHIm)] reveals unusual porphyrin core distortions. Inorg. Chem. 41(8), 2173–2181 (2002).
https://doi.org/10.1021/ic020012g
40. Enemark, J., Feltham, R.: Principles of structure, bonding, and reactivity for metal nitro-
syl complexes. Coord. Chem. Revs. 13(4), 339–406 (1974). https://doi.org/10.1016/S0010-
8545(00)80259-3
41. Frenking, G., Fröhlich, N.: The nature of the bonding in transition-metal compounds. Chem.
Rev. 100, 717–774 (2000)
42. Ghosh, A.: Transition metal spin state energetics and noninnocent systems: challenges for
DFT in the bioinorganic area. J. Biol. Inorg. Chem. 11, 712–724 (2006)
43. Goddard III, W.A., Olafson, B.D.: Ozone model for bonding of an O2 to heme in oxyhe-
moglobin. Proc. Nat. Acad. Sci. 72, 2335–2339 (1975)
44. Goff, H., La Mar, G.N.: High-spin ferrous porphyrin complexes as models for deoxymyo-
globin and -hemoglobin: a proton nuclear magnetic resonance study. J. Am. Chem. Soc. 99,
6599–6606 (1977). https://doi.org/10.1021/ja00462a022
45. Goff, H., La Mar, G.N., Reed, C.A.: Nuclear magnetic resonance investigation of magnetic
and electronic properties of “intermediate spin” ferrous porphyrin complexes. J. Am. Chem.
Soc. 99, 3641–3646 (1977). https://doi.org/10.1021/ja00453a022
46. Goodrich, L.E., Paulat, F., Praneeth, V.K.K., Lehnert, N.: Electronic structure of heme-
nitrosyls and its significance for nitric oxide reactivity, sensing, transport, and toxicity in bio-
logical systems. Inorg. Chem. 49(14), 6293–6316 (2010). https://doi.org/10.1021/ic902304a
47. Green, M.T.: Evidence for sulphur-based radicals in thiolate compound I intermediates. J.
Am. Chem. Soc. 121, 7939–7940 (1999)
48. Green, M.T.: The structure and spin coupling of catalase compound I: a study of noncovalent
effects. J. Am. Chem. Soc. 123(37), 9218–9219 (2001). https://doi.org/10.1021/ja010105h.
(pMID: 11552853)
49. Griffith, W.P., Lewis, J., Wilkinson, G.: Some nitric oxide complexes of iron and copper. J.
Chem. Soc. 1958, 3993–3998 (1958). https://doi.org/10.1039/JR9580003993
50. Grimme, S.: Accurate description of van der Waals complexes by density functional theory
including empirical corrections. J. Comp. Chem. 25(12), 1463–1473 (2004). https://doi.org/
10.1002/jcc.20078
51. Grimme, S.: Semiempirical hybrid density functional with perturbative second-order corre-
lation. J. Chem. Phys. 124(034), 108 (2006)
52. Grimme, S., Antony, J., Schwabe, T., Mück-Lichtenfeld, C.: Density functional theory
with dispersion corrections for supramolecular structures, aggregates, and complexes of
(bio)organic molecules. Org. Biomol. Chem. 5, 741–758 (2007). https://doi.org/10.1039/
b615319b
816 M. Radoń and E. Broclawik

53. Grimme, S., Antony, J., Ehrlich, S., Krieg, H.: A consistent and accurate ab initio parametriza-
tion of density functional dispersion correction (DFT-D) for the 94 elements H–Pu. J. Chem.
Phys. 132(15), 154,104 (2010). https://doi.org/10.1063/1.3382344
54. Groves, J.: Models and mechanisms of cytochrome P450 action. In: Ortiz de Montellano,
P. (ed.) Cytochrome P450: Structure, Mechanism and Biochemistry, pp. 1–43. Kluwer Aca-
demic/Plenum Publishers, Dordrecht (2005). https://doi.org/10.1007/0-387-27447-2_1
55. Guallar, V., Olsen, B.: The role of the heme propionates in heme biochemistry. J. Inorg.
Biochem. 100(4), 755–760 (2006). https://doi.org/10.1016/j.jinorgbio.2006.01.019. (ce:title
High-valent iron intermediates in biology/ce:title xocs:full-name High-valent iron intermedi-
ates in biology/xocs:full-name)
56. Gütlich, P., Goodwin, H.A.: Spin crossover-an overall perspective. In: Gütlich, P., Goodwin,
H. (eds.) Spin Crossover in Transition Metal Compounds I, Topics in Current Chemistry, vol.
233, pp. 1–47. Springer, Berlin (2004). https://doi.org/10.1007/b13527
57. Hampel, C., Werner, H.J.: Local treatment of electron correlation in coupled cluster theory.
J. Chem. Phys. 104(16), 6286–6297 (1996). https://doi.org/10.1063/1.471289
58. Handy, N.C., Cohen, A.J.: Left-right correlation energy. Mol. Phys. 99(5), 403–412 (2001)
59. Harischandra, D., Zhang, R., Newcomb, M.: Photochemical generation of a highly reactive
iron-oxo intermediate. A true iron(V)-Oxo species? J. Am. Chem. Soc. 127(40), 13,776–
13,777 (2005)
60. Harvey, J.N.: On the accuracy of density functional theory in transition metal chemistry.
Annu. Rep. Prog. Chem. Sect. C: Phys. Chem. 102, 203–226 (2006). https://doi.org/10.1039/
b419105f
61. Harvey, J.N.: The coupled-cluster description of electronic structure: perspectives for bioinor-
ganic chemistry. J. Biol. Inorg. Chem. 16, 831–839 (2011). https://doi.org/10.1007/s00775-
011-0786-7
62. Helgaker, T., Klopper, W., Koch, H., Noga, J.: Basis-set convergence of correlated calculations
on water. J. Chem. Phys. 106, 9639–9646 (1997). https://doi.org/10.1063/1.473863
63. Henderson, T.M., Janesko, B.G., Scuseria, G.E.: Range separation and local hybridization in
density functional theory. J. Phys. Chem. A 112(49), 12,530–12,542 (2008). https://doi.org/
10.1021/jp806573k
64. Hirao, K.: Multireference Møller-Plesset method. Chem. Phys. Lett. 190(3–4), 374–380
(1992). https://doi.org/10.1016/0009-2614(92)85354-D
65. Hopmann, K.H., Conradie, J., Ghosh, A.: Broken-symmetry DFT spin densities of iron nitro-
syls, including roussin’s red and black salts: striking differences between pure and hybrid func-
tionals. J. Phys. Chem. B 113(30), 10,540–10,547 (2009). https://doi.org/10.1021/jp904135h
66. Hu, C., Roth, A., Ellison, M., An, J., Ellis, C., Schulz, C., Scheidt, W.: Electronic configuration
assignment and the importance of low-lying excited states in high-spin imidazole-ligated
iron(II) porphyrinates. J. Am. Chem. Soc. 127(15), 5675–5688 (2005). https://doi.org/10.
1021/ja044077p
67. Hu, C., An, J., Noll, B.C., Schulz, C.E., Scheidt, W.R.: Electronic configuration of high-spin
imidazole-ligated iron(II) octaethylporphyrinates. Inorg. Chem. 45(10), 4177–4185 (2006).
https://doi.org/10.1021/ic052194v
68. Hughes, T.F., Friesner, R.A.: Correcting systematic errors in DFT spin-splitting energetics for
transition metal complexes. J. Chem. Theory Comput. 7(1), 19–32 (2011). https://doi.org/10.
1021/ct100359x
69. Hughes, T.F., Harveyb, J.N., Friesner, R.A.: A B3LYP-DBLOC empirical correction scheme
for ligand removal enthalpies of transition metal complexes: parameterization against exper-
imental and CCSD(T)-F12 heats of formation. Phys. Chem. Chem. Phys. 14, 7724–7738
(2012). https://doi.org/10.1039/c2cp40220c
70. Isobe, H., Yamanaka, S., Okumura, M., Yamaguchi, K., Shimada, J.: Unique structural and
electronic features of perferryl-oxo oxidant in cytochrome P450. J. Phys. Chem. B 115(36),
10,730–10,738 (2011). https://doi.org/10.1021/jp206004y
71. Isobe, H., Yamaguchi, K., Okumura, M., Shimada, J.: Role of perferryl-oxo oxidant in alkane
hydroxylation catalyzed by cytochrome P450: a hybrid density functional study. J. Phys.
Chem. B 116(16), 4713–4730 (2012). https://doi.org/10.1021/jp211184y
Electronic Properties of Iron Sites … 817

72. Jameson, G.B., Rodley, G.A., Robinson, W.T., Gagne, R.R., Reed, C., Collman,
J.P.: Structure of a dioxygen adduct of (1-methylimidazole)-meso-tetrakis(α,α,α,α-o-
pivalamidophenyl)porphinatoiron(II). An iron dioxygen model for the heme component of
oxymyoglobin. Inorg. Chem. 17(4), 850–857 (1978). https://doi.org/10.1021/ic50182a012
73. Jensen, F.: Introduction to Computational Chemistry, 2nd edn. Wiley, New York (2007)
74. Jensen, K.P., Ryde, U.: Comparison of the chemical properties of iron and cobalt porphyrins
and corrins. ChemBioChem 4, 413–424 (2003). https://doi.org/10.1002/cbic.200200449
75. Jensen, K.P., Ryde, U.: How O2 binds to heme: reasons for rapid binding and spin inversion.
J. Biol. Chem. 279, 14,561–14,569 (2004)
76. Jensen, K.P., Roos, B., Ryde, U.: Erratum to “O2 -binding to heme: electronic structure and
spectrum of oxyheme, studied by multiconfigurational methods”. J. Inorg. Biochem. 99, 978
(2005). https://doi.org/10.1016/j.jinorgbio.2005.02.013
77. Jensen, K.P., Roos, B., Ryde, U.: O2 -binding to heme: electronic structure and spectrum of
oxyheme, studied by multiconfigurational methods. J. Inorg. Biochem. 99(1), 45–54 (2005b).
https://doi.org/10.1016/j.jinorgbio.2004.11.008
78. Jiang, W., DeYonker, N.J., Wilson, A.K.: Multireference character for 3d transition-metal-
containing molecules. J. Chem. Theory Comput. 8, 460–468 (2011)
79. Kellner, D.G., Hung, S.C., Weiss, K.E., Sligar, S.G.: Kinetic characterization of compound I
formation in the thermostable cytochrome P450 CYP119. J. Biol. Chem. 277(12), 9641–9644
(2002)
80. Kent, T.A., Spartalian, K., Lang, G.: High magnetic field Mössbauer studies of deoxymyo-
globin, deoxyhemoglobin, and synthetic analogues: theoretical interpretations. J. Chem. Phys.
71(12), 4899–4908 (1979). https://doi.org/10.1063/1.438303
81. Kitagawa, T., Teraoka, J.: The resonance Raman spectra of intermediate-spin ferrous por-
phyrin. Chem. Phys. Lett. 63, 443–446 (1979). https://doi.org/10.1016/0009-2614(79)80685-
5
82. Knizia, G., Adler, T.B., Werner, H.J.: Simplified CCSD(T)-F12 methods: theory and bench-
marks. J. Chem. Phys. 130(5), 054,104 (2009). https://doi.org/10.1063/1.3054300
83. Koch, W., Holthausen, M.C.: A Chemist’s Guide to Density Functional Theory, 2nd edn.
Wiley-VCH, Verlag GmbH, Weinheim (2001)
84. Koseki, J., Maezono, R., Tachikawa, M., Towler, M.D., Needs, R.J.: Quantum monte carlo
study of porphyrin transition metal complexes. J. Chem. Phys. 129(8), 085103 (2008). https://
doi.org/10.1063/1.2966003
85. Kozlowski, P.M., Spiro, T.G., Zgierski, M.Z.: DFT study of structure and vibrations in low-
lying spin states of five-coordinated deoxyheme model. J. Phys. Chem. B 104(45), 10,659–
10,666 (2000). https://doi.org/10.1021/jp001463u
86. Kulik, H.J., Cococcioni, M., Scherlis, D.A., Marziari, N.: Density functional theory in tran-
sition metal chemistry: a self-consistent Hubbard U approach. Phys. Rev. Lett. 97, 103,001–
103,004 (2006)
87. Lee, J.Y., Kang, N.S., Kang, Y.K.: Binding free energies of inhibitors to iron porphyrin
complex as a model for cytochrome P450. Biopolymers 97, 219–228 (2012). https://doi.org/
10.1002/bip.22009
88. Lee, T.J., Taylor, P.R.: A diagnostic for determining the quality of single-reference electron
correlation methods. Int. J. Quantum Chem. 36(S23), 199–207 (1989)
89. Li, D., Wang, Y., Han, K.: Recent density functional theory model calculations of drug
metabolism by cytochrome P450. Coord. Chem. Revs. 256(1112), 1137–1150 (2012). https://
doi.org/10.1016/j.ccr.2012.01.016
90. Liao, M.S., Scheiner, S.: Electronic structure and bonding in metal porphyrins, metal=Fe Co,
Ni. Cu. Zn. J. Chem. Phys. 117(1), 205–219 (2002). https://doi.org/10.1063/1.1480872
91. Liao, M.S., Huang, M.J., Watts, J.D.: Iron porphyrins with different imidazole ligands. A
theoretical comparative study. J. Phys. Chem. A 114(35), 9554–9569 (2010). https://doi.org/
10.1021/jp1052216
92. Lupinetti, A.J., Fau, S., Frenking, G., Strauss, S.H.: Theoretical analysis of the bonding
between CO and positively charged atoms. J. Phys. Chem. A 101, 9551–9559 (1997)
818 M. Radoń and E. Broclawik

93. Malmqvist, P.Å., Pierloot, K., Shahi, A.R.M., Cramer, C.J., Gagliardi, L.: The restricted
active space followed by second-order perturbation theory method: theory and application to
the study of CuO2 and Cu2 O2 systems. J. Chem. Phys. 128(204), 109 (2008). https://doi.org/
10.1063/1.2920188
94. Matsui, T., Unno, M., Ikeda-Saito, M.: Heme oxygenase reveals its strategy for catalyzing
three successive oxygenation reactions. Acc. Chem. Res. 43(2), 240–247 (2010). https://doi.
org/10.1021/ar9001685. (pMID: 19827796)
95. McClure, D.S.: Electronic structure of transition-metal complex ions. Radiation Res. Suppl.
2, 218–242 (1960)
96. Miralles, J., Daudey, J.P., Caballol, R.: Variational calculation of small energy differences.
The singlet-triplet gap in [Cu2 Cl6 ]2− . Chem. Phys. Lett. 198(6), 555–562 (1992). https://doi.
org/10.1016/0009-2614(92)85030-E
97. Miralles, J., Castell, O., Caballol, R., Malrieu, J.P.: Specific CI calculation of energy differ-
ences: transition energies and bond energies. Chem. Phys. 172(1), 33–43 (1993). https://doi.
org/10.1016/0301-0104(93)80104-H
98. Momenteau, M., Scheidt, W.R., Eigenbrot, C.W., Reed, C.A.: A deoxymyoglobin model with
a sterically unhindered axial imidazole. J. Am. Chem. Soc. 110, 1207–1215 (1988). https://
doi.org/10.1021/ja00212a032
99. Nakatsuji, H., Hasegawa, J., Ueda, H., Hada, M.: Ground and excited states of oxyheme:
SAC/SAC-CI study. Chem. Phys. Lett. 250(34), 379–386 (1996). https://doi.org/10.1016/
0009-2614(96)00033-4
100. Neese, F.: A spectroscopy oriented configuration interaction procedure. J. Chem. Phys.
119(18), 9428–9443 (2003). https://doi.org/10.1063/1.1615956
101. Neese, F., Valeev, E.F.: Revisiting the atomic natural orbital approach for basis sets: robust
systematic basis sets for explicitly correlated and conventional correlated ab initio methods?
J. Chem. Theory Comput. 7, 33–43 (2011). https://doi.org/10.1021/ct100396y
102. Norvell, J., Nunes, A., Schoenborn, B.: Neutron diffraction analysis of myoglobin: structure
of the carbon monoxide derivative. Science 190(4214), 568–570 (1975). https://doi.org/10.
1126/science.1188354
103. Obara, S., Kashiwagi, H.: Ab initio MO studies of electronic states and Mössbauer spectra
of high-, intermediate-, and low-spin Fe(II)-porphyrin complexes. J. Chem. Phys. 77, 3155
(1982). https://doi.org/10.1063/1.444239
104. Ogliaro, F., Cohen, S., Filatov, M., Harris, N., Shaik, S.: The high-valent compound of
cytochrome P450: the nature of the fe-s bond and the role of the thiolate ligand as an internal
electron donor. Angew Chem. Int. Ed. 39(21), 3851–3855 (2000a)
105. Ogliaro, F., Cohen, S., de Viser, S.P., Shaik, S.: Medium polarization and hydrogen bonding
effects on compound I of cytochrome P450: what kind of radical is it really? J. Am. Chem.
Soc. 122, 12,892–12,893 (2000b)
106. Ogliaro, F., de Visser, S.P., Groves, J.T., Shaik, S.: Chameleon states: high-valent metal-oxo
species of cytochrome P450 and its ruthenium analogue. Angew Chem. Int. Ed. 40, 2874–2878
(2001). 10.1002/1521-3773(20010803)40:15<2874::AID-ANIE2874>3.0.CO;2-9
107. Olah, J., Harvey, J.: NO bonding to heme groups: DFT and correlated ab initio calculations.
J. Phys. Chem. A 113, 7338–7345 (2009). https://doi.org/10.1021/jp811316n
108. de Oliveira, F.T., Chanda, A., Banerjee, D., Shan, X., Mondal, S., Lawrence Que, J., Bom-
inaa, E.L., Münck, E., Collins, T.J.: Chemical and spectroscopic evidence for an Fe(V)-oxo
complex. Science 315, 835–838 (2007). https://doi.org/10.1126/science.1133417
109. Olson, J.C., Phillips, G.N.: Myoglobin discriminates between O2 , NO and CO by electrostatic
interactions with the bound ligand. J. Biol. Inorg. Chem. 2, 544–552 (1997)
110. Olson, J.S., Mathews, A.J., Rohlfs, R.J., Springer, B.A., Egeberg, K.D., Sligar, S.G., Tame,
J., Renaud, J.P., Nagai, K.: The role of the distal histidine in myoglobin and haemoglobin.
Nature 336(6196), 265–266 (1988). https://doi.org/10.1038/336265a0
111. Ortiz de Montellano, P., James, J., De Voss, J.: Substrate oxidation by cytochrome P450
enzymes. In: Ortiz de Montellano, P. (ed.) Cytochrome P450: Structure, Mechanism and
Biochemistry, pp. 183–245. Kluwer Academic/Plenum Publishers, Dordrecht (2005). https://
doi.org/10.1007/0-387-27447-2_6
Electronic Properties of Iron Sites … 819

112. Ortiz de Montellano, P.R.: Hydrocarbon hydroxylation by cytochrome P450 enzymes. Chem.
Rev. 110, 932–948 (2010). https://doi.org/10.1021/cr9002193
113. Pan, Z., Zhang, R., Newcomb, M.: Kinetic studies of reactions of iron(IV)-oxo porphyrin
radical cations with organic reductants. J. Inorg. Biochem. 100(4), 524–532 (2006). https://
doi.org/10.1016/j.jinorgbio.2005.12.022
114. Pan, Z., Zhang, R., Fung, L.W.M., Newcomb, M.: Photochemical production of a highly
reactive porphyrin-iron-oxo species. Inorg. Chem. 46(5), 1517–1519 (2007). https://doi.org/
10.1021/ic061972w
115. Pan, Z., Wang, Q., Sheng, X., Horner, J.H., Newcomb, M.: Highly reactive porphyrin-iron-
oxo derivatives produced by photolyses of metastable porphyrin-iron(IV) diperchlorates. J.
Am. Chem. Soc. 131(7), 2621–2628 (2009). https://doi.org/10.1021/ja807847q
116. Pauling, L., Coryell, C.D.: The magnetic properties and structure of hemoglobin, oxyhe-
moglobin and carbonmonoxyhemoglobin. Proc. Nat. Acad. Sci. 22, 210–216 (1936)
117. Paulsen, H., Trautwein, A.X.: Density functional theory calculations for spin crossover com-
plexes. Top. Curr. Chem. 235, 197–219 (2004). https://doi.org/10.1007/b95428
118. Perdew, J.P.: The functional zoo. In: Geerlings, P., DeProft, F., Langenaeker, W. (eds.) Density
Functional Theory: A Bridge Between Chemistry and Physics, pp. 87–109. Vrije Universiteit
Brussel Press, Brussels (1999)
119. Perdew, J.P., Kurth, S.: Density functionals for non-relativistic coulomb systems in the new
century. In: Fiolhais C, Nogueira F, Marques M (eds) A Primer in Density Functional Theory,
Lecture Notes in Physics, vol. 620, pp. 1–55, Chap 1. Springer, Berlin (2003). https://doi.org/
10.1007/3-540-37072-2_1
120. Perdew, J.P., Ernzerhof, M., Burke, K.: Rationale for mixing exact exchange with density
functional approximations. J. Chem. Phys. 105(22), 9982–9985 (1996). https://doi.org/10.
1063/1.472933
121. Perdew, J.P., Ruzsinszky, A., Constantin, L.A., Sun, J., Csonka, G.I.: Some fundamental issues
in ground-state density functional theory: a guide for the perplexed. J. Chem. Theory Comput.
5, 902–908 (2009). https://doi.org/10.1021/ct800531s
122. Phillips, S.E.: Structure and refinement of oxymyoglobin at 1.6 Å resolution. J. Mol. Biol.
142(4), 531–554 (1980). https://doi.org/10.1016/0022-2836(80)90262-4
123. Phillips, S.E.V.: Structure of oxymyoglobin. Nature 273(5659), 247–248 (1978)
124. Phillips, S.E.V., Schoenborn, B.P.: Neutron diffraction reveals oxygen-histidine hydrogen
bond in oxymyoglobin. Nature 292, 81–82 (1981)
125. Piela, L.: Ideas of Quantum Chemistry. Elsevier, polish edition (2006). Idee Chemii Kwan-
towej, PWN, 2005
126. Pierloot, K.: Nondynamic correlation effects in transition metal coordination compounds. In:
Cundari, T.R. (ed.) Computational Organometallic Chemistry. Marcel Dekker Inc., New York
(2001)
127. Pierloot, K.: The CASPT2 method in inorganic electronic spectroscopy: from ionic transition
metal to covalent actinide complexes. Mol. Phys. 101(13), 2083–2094 (2003)
128. Pierloot, K., Vancoillie, S.: Relative energy of the high-(5 T2g ) and low-(1 A1g ) spin states of
[Fe(H2O)6 ]2+ , [Fe(NH3 )6 ]2+ , and [Fe(bpy)3 ]2+ : CASPT2 versus density functional theory.
J. Chem. Phys. 125(124), 303 (2006). https://doi.org/10.1063/1.2353829
129. Pierloot, K., Vancoillie, S.: Relative energy of the high-(5 T2g ) and low-(1 A1g ) spin states of
the ferrous complexes [Fe(L)(NHS4 )]: CASPT2 versus density functional theory. J. Chem.
Phys. 128(034), 104 (2008)
130. Pierloot, K., Dumez, B., Widmark, P.O., Roos, B.: Density matrix averaged atomic natural
orbital (ANO) basis sets for correlated molecular wave functions. IV. Medium size basis sets
for the atoms H-Kr. Theor. Chim. Acta. 90, 87–114 (1995)
131. Pierloot, K., Zhao, H., Vancoillie, S.: Copper corroles: the question of non-innocence. Inorg.
Chem. 49, 10,316–10,329 (2010). https://doi.org/10.1021/ic100866z
132. Poli, R., Harvey, J.N.: Spin forbidden chemical reactions of transition metal compounds. New
ideas and new computational challenges. Chem. Soc. Rev. 32, 1–8 (2003)
820 M. Radoń and E. Broclawik

133. Popescu, D.L., Chanda, A., Stadler, M., de Oliveira, F.T., Ryabov, A.D., Münck, E., Bominaar,
E.L., Collins, T.J.: High-valent first-row transition-metal complexes of tetraamido (4N) and
diamidodialkoxido or diamidophenolato (2N/2O) ligands: synthesis, structure, and magneto-
chemistry. Coord. Chem. Revs. 252, 2050–2071 (2008)
134. Praneeth, V., Neese, F., Lehnert, N.: Spin density distribution in five- and six-coordinate
iron(II)-porphyrin NO complexes evidenced by magnetic circular dichroism spectroscopy.
Inorg. Chem. 44, 2570–2572 (2005)
135. Radoń, M., Broclawik, E.: Peculiarities of the electronic structure of cytochrome P450 com-
pound I: CASPT2 and DFT modeling. J. Chem. Theory Comput. 3(3), 728–734 (2007). https://
doi.org/10.1021/ct600363a
136. Radoń, M., Pierloot, K.: Binding of CO, NO, and O2 to heme by density functional and
multireference ab initio calculations. J. Phys. Chem. A 112(46), 11,824–11,832 (2008). https://
doi.org/10.1021/jp806075b
137. Radoń, M., Srebro, M., Broclawik, E.: Conformational stability and spin states of cobalt(II)
acetylacetonate: CASPT2 and DFT study. J. Chem. Theory Comput. 5(5), 1237–1244 (2009).
https://doi.org/10.1021/ct800571y
138. Radoń, M., Broclawik, E., Pierloot, K.: Electronic structure of selected FeNO7 complexes in
heme and non-heme architectures: A density functional and multireference ab initio study. J.
Phys. Chem. B 114(3), 1518–1528 (2010). https://doi.org/10.1021/jp910220r
139. Radoń, M., Broclawik, E., Pierloot, K.: DFT and Ab Initio study of iron-oxo porphyrins: may
they have a stable iron(V)-oxo electromer? J. Chem. Theory Comput. 7, 898–908 (2011).
https://doi.org/10.1021/ct1006168
140. Ray, M., Golombek, A.P., Hendrich, M.P., Yap, G.P.A., Liable-Sands, L.M., Rheingold, A.L.,
Borovik, A.S.: Structure and magnetic properties of trigonal bipyramidal iron nitrosyl com-
plexes. Inorg. Chem. 38, 3110–3115 (1999)
141. Reiher, M., Salomon, O., Hess, B.A.: Reparameterization of hybrid functionals based on
energy differences of states of different multiplicity. Theor. Chem. Acc. 107(1), 48–55 (2001).
https://doi.org/10.1007/s00214-001-0300-3
142. Ribas-Ariño, J., Novoa, J.J.: The mechanism for the reversible oxygen addition to heme.
A theoretical CASPT2 study. Chem. Commun. 2007, 3160–3162 (2007). https://doi.org/10.
1039/b704871h
143. Rittle, J., Green, M.T.: Cytochrome P450 compound I: Capture, characterisation, and C–
H bond activation kinetics. Science 330, 933–937 (2010). https://doi.org/10.1126/science.
1193478
144. Rodriguez, J.H., Xia, Y.M., Debrunner, P.G.: Mössbauer spectroscopy of the spin coupled
Fe2+ -FeNO7 centers of nitrosyl derivatives of deoxy hemerythrin and density functional
theory of the FeNO7 (S = 3/2) motif. J. Am. Chem. Soc. 121(34), 7846–7863 (1999). https://
doi.org/10.1021/ja990129c
145. Roos, B.O.: Multiconfigurational self consistent field theory. In: Roos, B.O., Widmark, P.O.
(eds.) European Summerschool in Quantum Chemistry, vol. 2, pp. 287–360. Lund University,
Lund (2003)
146. Roos, B.O., Taylor, P.R., Siegbahn, P.E.M.: A complete active space SCF method (CASSCF)
using a density matrix formulated super-CI approach. Chem. Phys. 48(2), 157–173 (1980)
147. Roos, B.O., Andersson, K., Fulscher, M., Malmqvist, P.Å., Serrano-Andres, L., Pierloot,
K., Merchan, M.: Multiconfigurational perturbation theory: applications in electronic spec-
troscopy. In: Prigogine, I., Rice, S.A. (eds.) Advances in Chemical Physics: New Methods in
Computational Quantum Mechanics, vol. 93, pp. 219–331. Wiley, New York (1996)
148. Roos, B.O., Lindh, R., Malmqvist, P.Å., Veryazov, V., Widmark, P.O.: New relativistic ANO
basis sets for transition metal atoms. J. Phys. Chem. A 109, 6575–6579 (2005)
149. Rosen, G.M., Tsai, P., Pou, S.: Mechanism of free-radical generation by nitric oxide synthase.
Chem. Rev. 102(4), 1191–1200 (2002). https://doi.org/10.1021/cr010187s
150. Rovira, C.: Role of the His64 residue on the properties of the Fe-CO and Fe-O2 bonds
in myoglobin. A CHARMM/DFT study. J. Mol. Struc. (Theochem) 632, 309–321 (2003).
https://doi.org/10.1016/S0166-1280(03)00308-7
Electronic Properties of Iron Sites … 821

151. Rovira, C., Kunc, K., Hutter, J., Ballone, P., Parrinello, M.: Equilibrium geometries and
electronic structure of iron-porphyrin complexes: A density functional study. J. Phys. Chem.
A 101(47), 8914–8925 (1997). https://doi.org/10.1021/jp9722
152. Rovira, C., Kunc, K., Hutter, J., Ballone, P., Parrinello, M.: A comparative study of O2 , CO,
and NO binding to iron-porphyrin. Int. J. Quantum. Chem. 69(1), 31–35 (1998)
153. Rydberg, P., Sigfridsson, E., Ryde, U.: On the role of the axial ligand in heme proteins: a
theoretical study. J. Biol. Inorg. Chem. 9, 203–223 (2004). https://doi.org/10.1007/s00775-
003-0515-y
154. Rydberg, P., Gloriam, D.E., Olsen, L.: The SMARTCyp cytochrome P450 metabolism pre-
diction server. Bioinformatics 26, 2988–2989 (2010). https://doi.org/10.1093/bioinformatics/
btq584
155. Scherlis, D.A., Cococcioni, M., Sit, P., Marzari, N.: Simulation of heme using DFT + U: a
step toward accurate spin-state energetics. J. Phys. Chem. B 111, 7384–7391 (2007). https://
doi.org/10.1021/jp070549l
156. Schöneboom, J.C., Lin, H., Reuter, N., Thiel, W., Cohen, S., Ogliaro, F., Shaik, S.: The
elusive oxidant species of cytochrome P450 enzymes: characterisation by combined quantum
mechanical/molecular mechanical (QM/MM) calculations. J. Am. Chem. Soc. 124, 8142–
8151 (2002). https://doi.org/10.1021/ja026279w
157. Schöneboom, J.C., Neese, F., Thiel, W.: Toward identification of the compound I reactive
intermediate in cytochrome P450 chemistry: a QM/MM study of its EPR and Mössbauer
parameters. J. Am. Chem. Soc. 127(16), 5840–5853 (2005)
158. Schütz, M., Werner, H.J.: Low-order scaling local electron correlation methods. IV. Linear
scaling local coupled-cluster (LCCSD). J. Chem. Phys. 114(2), 661–681 (2001). https://doi.
org/10.1063/1.1330207
159. Schwarz, W.H.E.: An introduction to relativistic quantum chemistry. In: Barysz, M., Ishikawa,
Y. (eds.) Relativistic Methods for Chemists, Challenges and Advances in Computational
Chemistry and Physics, vol. 10, pp. 1–62. Springer, The Netherlands (2010). https://doi.org/
10.1007/978-1-4020-9975-5_1
160. Shaanan, B.: The ironoxygen bond in human oxyhaemoglobin. Nature 296, 683–684 (1982).
https://doi.org/10.1038/296683a0
161. Shaanan, B.: Structure of human oxyhaemoglobin at 2.1 Å resolution. J. Mol. Biol. 171(1),
31–59 (1983). https://doi.org/10.1016/S0022-2836(83)80313-1
162. Shaik, S., Chen, H.: Lessons on O2 and NO bonding to heme from ab initio multirefer-
ence/multiconfiguration and DFT calculations. J. Biol. Inorg. Chem. 16, 841–855 (2011).
https://doi.org/10.1007/s00775-011-0763-1
163. Shaik, S., De Visser, S.: Computational approaches to cytochrome P450 function. In: Ortiz de
Montellano, P. (ed.) Cytochrome P450: Structure, Mechanism and Biochemistry, pp. 45–
85. Kluwer Academic/Plenum Publishers, Dordrecht (2005). https://doi.org/10.1007/0-387-
27447-2_2
164. Shaik, S., Kumar, D., de Visser, S.P., Altun, A., Thiel, W.: Theoretical perspective on the struc-
ture and mechanism of cytochrome P450 enzymes. Chem. Rev. 105(6), 2279–2328 (2005)
165. Shaik, S., Cohen, S., Wang, Y., Chen, H., Kumar, D., Thiel, W.: P450 enzymes: their structure,
reactivity, and selectivity-modeled by QM/MM calculations. Chem. Rev. 110(2), 949–1017
(2010)
166. Shaik, S., Chen, H., Janardanan, D.: Exchange-enhanced reactivity in bond activation by
metaloxo enzymes and synthetic reagents. Nat. Chem. 3, 19–27 (2011). https://doi.org/10.
1038/nchem.943
167. Sheng, X., Horner, J.H., Newcomb, M.: Spectra and kinetic studies of the compound I deriva-
tive of cytochrome P450 119. J. Am. Chem. Soc. 130(40), 13,310–13,320 (2008). https://doi.
org/10.1021/ja802652b
168. Siegbahn, P.E.M., Himo, F.: The quantum chemical cluster approach for modeling enzyme
reactions. Wiley Interdisc Rev: Comput Mol Sci 1, 323–336 (2011)
169. Siegbahn, P.E.M., Blomberg, M.R.A., Chen, S.L.: Significant van der Waals effects in tran-
sition metal complexes. J. Chem. Theory Comput. 6, 2040–2044 (2010). https://doi.org/10.
1021/ct100213e
822 M. Radoń and E. Broclawik

170. Sigfridson, E., Ryde, U.: On the significance of hydrogen bonds for the discrimination between
CO and O2 by myoglobin. J. Biol. Inorg. Chem. 4(1), 99–110 (1999)
171. Sigfridson, E., Ryde, U.: Theoretical study of the discrimination between O2 and CO by
myoglobin. J. Inorg. Biochem. 91(1), 101–115 (2002)
172. Sigfridsson, E., Ryde, U.: The importance of porphyrin distortions for the ferrochelatase
reaction. J. Biol. Inorg. Chem. 8, 273–282 (2003)
173. Sigfridsson, E., Olsson, M.H.M., Ryde, U.: A comparison of the inner-sphere reorganization
energies of cytochromes, iron-sulfur clusters, and blue copper proteins. J. Phys. Chem. B
105(23), 5546–5552 (2001). https://doi.org/10.1021/jp0037403
174. Sligar, S.G.: Coupling of spin, substrate, and redox equilibriums in cytochrome P450. Bio-
chemistry 15(24), 5399–5406 (1976)
175. Spolitak, T., Dawson, J.H., Ballou, D.P.: Reaction of ferric cytochrome P450cam with
peracids: kinetic characterization of intermediates on the reaction pathway. J. Biol. Chem.
280, 20,300–20,309 (2005). https://doi.org/10.1074/jbc.M501761200
176. Springer, B.A., Egeberg, K.D., Slighar, S.G., Rohlfs, R.J., Mathews, A.J., Olson, J.C.:
Discrimination between oxygen and carbon monoxide and inhibition of autooxydation by
mioglobin. J. Biol. Chem. 264(6), 3057–3060 (1989)
177. Springer, B.A., Sligar, S.G., Olson, J.S., Phillips, G.N.J.: Mechanisms of ligand recognition
in myoglobin. Chem. Rev. 94(3), 699–714 (1994). https://doi.org/10.1021/cr00027a007
178. Stawoska, I., Orzel, Ł., Łabuz, P., Stochel, G., van Eldik, R.: Application of high pressure
laser flash photolysis in studies on selected hemoprotein reactions. Biochim. Biophys. Acta
1784(11), 1481–1492 (2008). https://doi.org/10.1016/j.bbapap.2008.08.006
179. Strauss, S.H., Silver, M.E., Long, K.M., Thompson, R.G., Hudgens, R.A.,
Spartalian, K., Ibers, J.A.: Comparison of the molecular and electronic struc-
tures of (2,3,7,8,12,13,17,18-octaethylporphyrinato)iron(II) and (trans-7,8-dihydro-
2,3,7,8,12,13,17,18-octaethylporphyrinato)iron(II). J. Am. Chem. Soc. 107(14), 4207–4215
(1985). https://doi.org/10.1021/ja00300a021
180. Strickland, N., Harvey, J.N.: Spin-forbidden ligand binding to the ferrous-heme group: Ab
initio and DFT studies. J. Phys. Chem. B 111, 841–852 (2007)
181. Strickland, N., Mulholland, A.J., Harvey, J.N.: The Fe-CO bond energy in myoglobin: A
QM/MM study of the effect of tertiary structure. Biophys. J. 90, 27–29 (2006). https://doi.
org/10.1529/biophysj.105.078097
182. Sun, X., Wang, H., Feng, D.: Binding properties of CO, NO, and O2 to P450 heme: a density
functional study. Chin. J. Phys. Chem. 20, 552–556 (2007). https://doi.org/10.1088/1674-
0068/20/05/552-556
183. Szabo, A., Ostlund, N.S.: Modern quantum chemistry. In: Introduction to Advanced Electronic
Structure Theory. Dover Publications Inc, New York (1989)
184. Tomson, N.C., Crimmin, M.R., Petrenko, T., Rosebrugh, L.E., Sproules, S., Boyd, W.C.,
Bergman, R.G., DeBeer, S., Toste, F.D., Wieghardt, K.: A step beyond the feltham-enemark
notation: spectroscopic and correlated ab initio computational support for an antiferromag-
netically coupled M(II)-(NO)− description of Tp*M(NO) (M = Co, Ni). J. Am. Chem. Soc.
133(46), 18,785–18,801 (2011). https://doi.org/10.1021/ja206042k
185. Traylor, T.G., Sharma, V.S.: Why no? Biochemistry 31(11), 2847–2849 (1992). https://doi.
org/10.1021/bi00126a001
186. Turner, J.W., Schultz, F.A.: Coupled electron-transfer and spin-exchange reactions. Coord.
Chem. Revs. 219, 81–97 (2001). https://doi.org/10.1016/S0010-8545(01)00322-8
187. Ugalde, J.M., Dunietz, B., Dreuw, A., Head-Gordon, M., Boyd, R.J.: The spin dependence of
the spatial size of Fe(II) and of the structure of Fe(II)-porphyrins. J. Phys. Chem. A 108(21),
4653–4657 (2004). https://doi.org/10.1021/jp0489119
188. Vancoillie, S., Malmqvist, P.Å., Pierloot, K.: Calculation of EPR g tensors for transition-metal
complexes based on multiconfigurational perturbation theory (CASPT2). ChemPhysChem
8(12), 1803–1815 (2007)
189. Vancoillie, S., Zhao, H., Radoń, M., Pierloot, K.: Performance of CASPT2 and DFT for relative
spin-state energetics of heme models. J. Chem. Theory Comput. 6(2), 576–582 (2010). https://
doi.org/10.1021/ct900567c
Electronic Properties of Iron Sites … 823

190. Vancoillie, S., Zhao, H., Tran, V.T., Hendrickx, M.F.A., Pierloot, K.: Multiconfigurational
second-order perturbation theory restricted active space (RASPT2) studies on mononuclear
first-row transition-metal systems. J. Chem. Theory Comput. 7, 3961–3977 (2011). https://
doi.org/10.1021/ct200597h
191. Wanat, A., Schneppensieper, T., Stochel, G., van Eldik, R., Bill, E., Wieghardt, K.: Kinetics,
mechanism, and spectroscopy of the reversible binding of nitric oxide to aquated iron(II). An
undergraduate text book reaction revisited. Inorg. Chem. 41, 4–10 (2002). https://doi.org/10.
1021/ic010628q
192. Wang, Q., Sheng, X., Horner, J.H., Newcomb, M.: Quantitative production of compound I
from a cytochrome P450 enzyme at low temperatures. kinetics, activation parameters, and
kinetic isotope effects for oxidation of benzyl alcohol. J. Am. Chem. Soc. 131(30), 10629–
10636 (2009). https://doi.org/10.1021/ja9031105
193. Weigend, F., Häser, M., Patzelt, H., Ahlrichs, R.: Ri-mp2: Optimized auxiliary basis sets and
demonstration of efficiency. Chem. Phys. Lett. 294, 143–152 (1998)
194. Weiss, J.J.: Nature of the ironoxygen bond in oxyhaemoglobin. Nature 202, 83–84 (1964).
https://doi.org/10.1038/202083b0
195. Weiss, R., Mandon, D., Wolter, T., Trautwein, A.X., Müther, M., Bill, E., Gold, A., Jayaraj,
K., Terner, J.: Delocalization over the heme and the axial ligands of one of the two oxidizing
equivalents stored above the ferric state in the peroxidase and catalase compound-i interme-
diates: indirect participation of the proximal axial ligand of iron in the oxidation reactions
catalyzed by heme-based peroxidases and catalases? J. Biol. Inorg. Chem. 1(4), 377–383
(1996). https://doi.org/10.1007/s007750050069
196. Westcott, B.L., Enemark, J.L.: Transition metal nitrosyls. In: Solomon, E.I., Lever, A.B.P.
(eds.) Inorganic Electronic Structure and Spectroscopy, vol. 2, pp. 403–450. Wiley, New York
(1999)
197. Williams, R.: Metallo-enzyme catalysis: the entatic state. J. Mol. Catal. A 30, 1–26 (1985).
https://doi.org/10.1016/0304-5102(85)80013-4
198. Yamamoto, S., Kashiwagi, H.: CASSCF study on the Fe-O2 bond in a dioxygen heme complex.
Chem. Phys. Lett. 161(1), 85–89 (1989)
199. Yamamoto, S., Teraoka, J., Kashiwagi, H.: Ab initio RHF and CASSCF studies on Fe–O bond
in high-valent iron-oxoporphyrins. J. Chem. Phys. 88, 303–312 (1988)
200. Ye, S., Neese, F.: Accurate modeling of spin-state energetics in spin-crossover systems with
modern density functional theory. Inorg. Chem. 49(3), 772–774 (2010). https://doi.org/10.
1021/ic902365a
201. Zhang, R., Newcomb, M.: Laser flash photolysis generation of high-valent transition metal-
oxo species: insights from kinetic studies in real time. Acc. Chem. Res. 41(3), 468–477 (2008).
https://doi.org/10.1021/ar700175k
202. Zhang, R., Nagraj, N., Lansakara-P, D.S.P., Hager, L.P., Newcomb, M.: Kinetics of two-
electron oxidations by the compound I derivative of chloroperoxidase, a model for cytochrome
P450 oxidants. Org. Lett. 8(13), 2731–2734 (2006). https://doi.org/10.1021/ol060762k
203. Zhao, Y., Truhlar, D.G.: Density functional for spectroscopy: No long-range self-interaction
error, good performance for rydberg and charge-transfer states, and better performance on
average than B3LYP for ground states. J. Phys. Chem. A 110(49), 13,126–13,130 (2006a).
https://doi.org/10.1021/jp066479k. (pMID: 17149824)
204. Zhao, Y., Truhlar, D.G.: A new local density functional for main-group thermochemistry,
transition metal bonding, thermochemical kinetics, and noncovalent interactions. J. Chem.
Phys. 125(194), 101 (2006b). https://doi.org/10.1063/1.2370993
205. Zhao, Y., Truhlar, D.G.: The M06 suite of density functionals for main group thermochemistry,
thermochemical kinetics, noncovalent interactions, excited states, and transition elements: two
new functionals and systematic testing of four M06-class functionals and 12 other functionals.
Theor. Chem. Acc. 120, 215–241 (2008). https://doi.org/10.1007/s00214-007-0310-x
Bioinorganic Reaction
Mechanisms—Quantum Chemistry
Approach

Tomasz Borowski and Ewa Broclawik

Abstract This chapter is focused on applications of quantum chemical (QC) DFT


methodology to study reaction mechanisms of metalloenzymes, emphasising new
insights that could be obtained thanks to the computations and showing the limita-
tions of the QC approach. Several case studies taken from Authors’ research serve
to explain and rationalize modelling protocols and to underline information pro-
vided by computations, which are not accessible from experiment. Case studies are
assorted as to illustrate how the most likely mechanisms may be identified among
mechanistic proposals. It is also highlighted how deliberate model constructing and
probing various scenarios and/or electronic states help in identifying key factors rul-
ing enzymatic reactions. It is hoped this contribution clarified that credibility of the
results relies heavily on chemical knowledge, intuition as well as on experience of
the researcher.

1 Introduction

A considerable fraction of chemistry of life involves reactions that are catalysed by


metalloenzymes [24]. The scope of chemistry performed by these bioinorganic cat-
alysts is very broad and spans from hydrolysis [37], e.g. of urea to CO2 and NH3
by urease, to closing the penicillin ring by isopenicillin-N synthase [11]. Similarly,
the range of biological roles played by the reactions catalysed by metalloenzymes
is vast and encompasses such processes as DNA repair, photosynthetic oxygen evo-
lution, respiration, synthesis and degradation of a plethora of metabolites, including
neurotransmitters, hormones, antibiotics, and many others.

T. Borowski (B) · E. Broclawik (B)


Jerzy Haber Institute of Catalysis and Surface Chemistry,
Polish Academy of Sciences, ul. Niezapominajek 8, 30-239 Krakow, Poland
e-mail: ncborows@cyf-kr.edu.pl
E. Broclawik
e-mail: broclawi@chemia.uj.edu.pl

© Springer Nature Switzerland AG 2019 825


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9_24
826 T. Borowski and E. Broclawik

In majority of cases where enzymes binding transition metal ions in their active
sites are recruited as catalysts, the enzymatic reaction involves redox steps that are
inconceivable without participation of specialized cofactors. The latter usually can
adopt several oxidation states and stabilize various reactive forms of substrates dur-
ing the catalytic cycle, and hence provide a low energy path for alternatively chemi-
cally demanding transformations. This type of catalysis poses serious challenges to
research on such systems. The reactions considered here are usually multi-step pro-
cesses and involve short-lived and often highly reactive intermediates that frequently
can decay along various scenarios. This makes the studies on the bioinorganic cat-
alytic mechanisms highly demanding, and thus it requires employment of various
research techniques. One of them is computational quantum chemistry (QC) with its
unique feature that it can provide a description of all species along the catalytic cycle
on equal footing, including transition states and short-lived intermediates [3], which
are frequently out of reach for temporary experimental techniques. In this chapter
we are discussing several instructive examples illustrating how QC can be applied
to study reaction mechanisms of metalloenzymes, emphasizing the new insights that
were obtained thanks to the computations and showing the limitations of the current
approach. The examples are taken from our research work on mononuclear nonheme
iron enzymes.

2 Methodology

2.1 Active Sites of Enzymes and Their QC Models

The usual starting point for computational studies on the enzymatic reaction mecha-
nism is a crystal structure, preferably solved for an enzyme-substrate/product com-
plex. When a structure of this type is available, construction of a QC active site model
becomes relatively straightforward, however, several more or less arbitrary decisions
must still be taken by the researcher. The (S)-2-hydroxypropylphosphonic acid epox-
idase (HppE) depicted in Fig. 1 may serve as an illustrative example [23]. As can
be noticed, in the active site of HppE a single metal ion, i.e. Fe(II), is coordinated
by three protein ligands: two histidines and one glutamate and by the organic sub-
strate - HPP. In the immediate vicinity of the first coordination shell there are several
polar residues (displayed in ball-and-stick representation for the X-ray structure) and
hydrophobic ones, the latter close to the methyl group of HPP.
A whole enzyme or extended active site region is hardly tractable by QC and
therefore the model must be considerably reduced, unless one chooses to use a hybrid
QM/MM method where the active site of the protein is described by a QC method and
the remaining part of the system by molecular mechanics [39]. The first limitation
concerns the size and composition of the QC model of an active site, which is usually
a compromise between its completeness and the computational cost, growing very
fast with the model size. In our example, a minimal model would include none but
Bioinorganic Reaction Mechanisms—Quantum Chemistry Approach 827

Fig. 1 Active site region of the X-ray structure of HppE-Fe(II)-HPP (PDB: 1ZZ8 [23]) and its model
(with bound O2 ) used in QC investigations of the catalytic reaction mechanism [32]. Asterisks mark
atoms with fixed coordinates

Fe(II) center and its first coordination shell with properly truncated protein residues
(e.g. bonds that were cut being saturated with hydrogen atoms). Minimal models
were used routinely a decade ago when computer power was rather modest [42,
43], although they may be still of practical value also nowadays. For example, they
are useful in preliminary screening of prospective reaction paths or as a reference
to larger models when attempting to identify catalytic effects due to the second (or
consecutive) shell. In particular, polar residues forming hydrogen bonds (H-bonds)
with the first shell ligands may be important as they can certainly not only modulate
mobility of the ligands but also their proton affinity and redox potential of the metal
or ligands. Thus, it is strongly recommended that these polar groups are explicitly
included in the QC model, as was indeed done in the study on HppE. Here whole
side chains were retained in the model for Asn135, Asn197 and Tyr105, whereas for
Lys23 and Arg97 the fragments were truncated at the carbon neighbouring the basic
group. Moreover, a water molecule H-bonding to Arg97 and HPP was included in
the QC model. Hydrophobic residues from the second coordination shell would be
included in a still larger (and more complete) model [17, 45], even if they are not
supposed to affect qualitative features of the reaction mechanism due to weak nature
of their non-bonded interactions.
Since X-ray protein structures are very rarely of sufficient resolution to reveal
positions of hydrogen atoms, the latter have to be added manually on top of the
selected model. This is a relatively straightforward step for hydrocarbon fragments
828 T. Borowski and E. Broclawik

and amide, amine and guanidinium groups. On the other hand, for alcohol and phenol
OH groups one has to decide on the value of the H–O–C–C dihedral angles in the
initial model, usually chosen so that H-bond interaction network is optimal, i.e.
maximum number of H-bonds is obtained. Histidine residues that are not coordinated
to the metal are the most difficult target: firstly because pKa of a histidine side
chain (free amino acid) is close to 7 and thus the group can be either neutral or
positively charged. Secondly, if the group is neutral the single nitrogen-bound proton
can be placed on either of the nitrogen atoms of the residue. To resolve these issues
one usually relies on experimental data concerning the possible catalytic role of a
given His (proton acceptor/donor) and/or looks for H-bond partners at the immediate
surrounding of a given His side chain and makes a qualified guess. The histidines
bound to the metal ion are usually assumed to be electro-neutral (with a single N-
bound proton). In an alternative, and presumably less biased, approach one would use
a Poisson–Boltzmann titration method to determine the protonation states of protein
residues [1].
In the actual protein the side chains are covalently bound to the backbone (main
chain), which is usually omitted in the active site model. The anchoring role of the
backbone is introduced to the model via constraints imposed on selected peripheral
atoms [35]. In our example, hydrogen atoms introduced to saturate cut bonds and
their bonding partners were constrained to mimic the anchoring role of the backbone
(Fig. 1). For majority of the side chains this choice corresponds to fixing in space
the Cβ carbons and the hydrogens replacing Cα (one of backbone atoms), i.e. the
model assumes a perfectly rigid backbone. This approximation is not necessary if a
QM/MM method is applied to the whole enzyme-substrate complex.
In cases when X-ray structures are not available for enzyme-substrate complexes,
appropriate macromolecular models of such species are usually built on the basis of
existing (fragmentary) structural data, aided with the use of empirical or semiempir-
ical force fields and docking or molecular dynamics simulation methods, which are
covered elsewhere in this book.
Once constructed, the active site model is used to explore the potential energy
surface (PES) with methods presented in the following subsections.

2.2 Electronic Structure methods

Active site models typically include 50–300 atoms and they serve for exploration of
potential energy surfaces, which usually involves numerous repetition of geometry
optimization and frequency calculations. Thus already sheer amount of computations
to be done puts severe limitations on the QC method applicable to the problem. With
the presently available computer power the methods of density functional theory
(DFT) offer the best compromise between accuracy and computational demand. DFT
methodology has been briefly introduced in the preceding Chapter and shown to be
modest with respect to computing resources requirements and to perform reasonably
well.
Bioinorganic Reaction Mechanisms—Quantum Chemistry Approach 829

However, widely used DFT methods can be considerably functional-dependent


and when it comes to the choice of a particular exchange-correlation functional, dif-
ferent philosophies can be found in the computational chemistry community. Some
researchers opt for choosing a functional that best reproduces certain spectroscopic
features of a given or closely related system. Unfortunately, it is not at all certain that
the improvement in description of spectral properties is paralleled by the improve-
ment in relative energies of chemical species paving the reaction coordinate. Another
approach is to choose a functional that best reproduces barrier heights for a selected
reaction type; this approach might be very useful for studies on a series of reac-
tions involving the rate limiting step of the same type. In the case of multi-step
mechanisms of metalloenzymes, where two or more chemically different steps of
comparable barriers are often present, the choice based on this criterion becomes
doubtful. Finally, one can choose a particular functional and keep using it in studies
on various enzymatic systems. The advantage of the last option is that the results
obtained in separate studies, e.g. on various types of enzymes, can be compared and
reactivity trends may be easily analysed; the potential disadvantage, however, would
be a possibility of overlooking unsystematic, and occasionally large, energy errors
pertinent to a particular enzyme and/or reaction type (cf. preceding chapter). In the
studies reported in this chapter the last protocol has been followed, that is the same
functional, i.e. B3LYP, was consequently applied to all systems [2, 28]. Available
CCSD(T) and CASPT2 benchmark data have indicated B3LYP to be sufficiently reli-
able for modelling enzymatic reactions catalysed by non-heme iron sites [7, 32, 49].
The size of the one-electron basis set is the next bottleneck in extensive DFT
modelling where the compromise between accuracy and computational efficiency
must be met. Previous experience have indicated that the quality of a one electron
basis set used for the final electronic energy calculations should not be lower than
valence polarized triple-ζ , though for geometry optimizations and frequency analysis
a double-ζ basis set suffices. Indeed, case studies showed that using larger basis set
for geometry optimization hardly affects the final energy profile [41].
In our modelling protocol we used non-relativistic approximation, which is jus-
tified since for first-row transition metals and properties of interest here, i.e. relative
energies of species defining reaction paths, relativistic effects should be far less
important than uncertainties due to other approximations assumed. This issue is dis-
cussed in Sect. 2.1 in the previous chapter “Electronic Properties of Iron Sites and
Their Active Forms in Porphyrin-Type Architectures”.

2.3 Exploration of Potential Energy Surfaces

Geometry optimization of the initial model yields a single structure of the closest
stable chemical species, whereas the other intermediates with joining them transi-
tion states need to be found manually (automatic procedures for PES exploration do
exist, yet they are practical only for small systems [30]). Based on previous experi-
ence, existing chemical knowledge and mechanistic hypotheses, one usually assumes
830 T. Borowski and E. Broclawik

plausible mechanisms (and corresponding approximate reaction coordinates) for a


given reaction step and attempts to optimise the respective transition state by scan-
ning PES along the appropriate reaction coordinate. Let us invoke as an example the
C–H bond cleavage by the Fe(IV)=O species taking place in the catalytic cycle of
HppE. Since in this reaction a hydrogen atom is transferred from the substrate to
the oxo ligand, an intuitive approximate reaction coordinate is the oxo - H distance
(Fig. 2). By systematic shortening the oxo - H distance and performing constrained
optimizations with this distance fixed, a series of geometries are obtained whose ener-
gies are plotted as a function of the scanned distance. A characteristic profile with a
single maximum is usually obtained, and a geometry of a point close to the maximum
serves as a starting point for explicit transition state optimization. Our experience
shows that calculating molecular Hessian matrix for a point close to the maximum
greatly facilitates optimization of the transition state. The optimised transition state
is subsequently verified with frequency calculations, which upon successful TS opti-
mization should reveal existence of a single normal mode with imaginary frequency
that describes the movement of the system over the barrier. It should be noted here
that TS optimization sometimes fails because the system moves down to reactants
or product valleys. In such cases once has to use a different geometry in the vicinity
of energy maximum as a starting point for TS optimization. The next intermediate
along the reaction profile is searched for by starting consecutive geometry optimiza-
tion at a point already behind the barrier and removing the constraints imposed in
the scan. A more rigorous optimization of intrinsic reaction coordinate (IRC) that
joins reactants with a given transition state is computationally demanding, and thus
rarely done. In an alternative approach one optimises a set of geometries, including
TS, defining a minimum energy path from a reactant to a product of a given reaction
step [40].

Fig. 2 Active site model with approximate reaction coordinate marked in green, and an energy
profile obtained in a relaxed energy scan (in red; the point marked in green depicts explicitly
optimized transition state)
Bioinorganic Reaction Mechanisms—Quantum Chemistry Approach 831

This (partly manual) scenario for potential energy surface exploration is continued
until all reaction steps and conceivable side reactions are covered. One must admit,
however, that completeness of such a PES exploration relies heavily on chemical
knowledge, intuition as well as experience of the researcher. One should also note
here that once the active site model becomes larger than ca. 100 atoms, the PES
becomes complicated by the presence of multiple minima, and hence many transition
paths may be found. This problem is well recognised in the field of macromolecular
simulations [14] and, for example, methods of transition path sampling were designed
to provide a full multi-pathway picture of the transition (here chemical reaction) [13].
Unfortunately, such methods require generation of system trajectories of lengths
practically not available at the moment for DFT methods applied to active site models
of metalloenzymes. However, in the field of QM/MM, methods that take into account
appropriate sampling of the configurational space of the MM part of the system were
proposed [25, 26, 36], and as their computational demand is roughly of the same
order as for static QM/MM calculations, they should gain popularity soon.
The above described procedure works smoothly provided all critical points (min-
ima and saddle points) lie on the same adiabatic PES. However, it is not very uncom-
mon in the reactions of metalloenzymes that either the spin of the system changes
in the course of the reaction or that two diabatic surfaces of the same spin form
a sharp crossing due to very distinct local symmetry of occupied orbitals. In such
cases, instead of TS one looks for a minimum energy crossing point (MECP), i.e.
a minimum energy point along the crossing seam (Fig. 3). As the energy separation
between the two diabatic surfaces is very sensitive to the choice of a basis set, it is
recommended that for MECP optimization the part of the system that is the locus

Fig. 3 Example of two diabatic potential energy surfaces with their crossing seam and a minimum
energy crossing point (MECP)
832 T. Borowski and E. Broclawik

of electronic structure changes is described with the extended, triple-ζ quality basis
set. It is worth to note here that there are automatic procedures searching for MECP
available [21], though in cases of PES:s with the same spin a manual search, similar
to that presented in Fig. 3, may still be necessary.
Total electronic energy for the optimized intermediates (TS:s and MECP:s) is
subsequently computed with triple-ζ basis set for all atoms; this energy should be
corrected with several other energy contributions to yield final energies. The first
additional term accounts for energy of zero-point vibrations and it is calculated
based on full frequency analysis. The entropy (and the free energy) contributions are
usually not computed for models with constrained atoms as the entropy is mostly
affected by low frequency modes and these are, in turn, very sensitive to the presence
of constraints.
The purpose of the next energy contribution is to cover the electrostatic effects
exerted by the part of the protein not explicitly included in the QC model. To this end,
a polarizable continuum model (PCM) is routinely used where the surrounding of
the QC model is described as a continuous dielectric, characterized by an appropriate
dielectric constant, usually assumed as equal to 4 (mimicking hydrophobic interior
of the protein). The last energy correction is introduced to correct for the deficiency
of DFT methods in describing the van der Waals interactions. This energy correction
may be computed at various levels, yet usually already the simplest empirical formula
gives satisfactory results [18].
Once the final energies of all intermediates, TS:s and MECP:s are computed, a
diagram presenting a profile of potential energy along the reaction coordinate can
be constructed. The diagram shown in Fig. 4 illustrates the example sketched briefly
in a preceding section: here the point labelled as 5 TS5 (+6.5 kcal/mol), joining
intermediates 5 10 and 5 11, corresponds to the transition state for the C–H bond
cleavage, whose search was described above (the superscript is the spin multiplicity).
The analysis of this diagram allows for identification of the most likely mechanism

Fig. 4 Reaction energy profile for the HppE catalysed formation of fosfomycin
Bioinorganic Reaction Mechanisms—Quantum Chemistry Approach 833

(black solid line) leading to the product (5 14), as this path involves lowest activation
barriers. Though, a variant of the mechanism proceeding through MECPOXO , 5 10,
5
TS5, 5 11 and MECPHS cannot be excluded because the barrier connected with
MECPOXO is negligible. On the path marked with the solid line, the 5 8 → 5 TS4
→ 5 9 step is the rate-limiting one (with the highest energy barrier), whereas in the
alternative path the 5 10 → 5 TS5 → 5 11 defines the rate limiting step. For other
alternative scenarios (marked in grey) barriers are by at least 4.6 kcal/mol higher
than on the preferred path, which indicates these alternative steps are less probable.

3 Case Studies

3.1 Activation of O2 by α-Ketoacid Dependent Dioxygenases -


Probing Reaction Channels in Multiple Electronic States

α-ketoacid dependent dioxygenases (α-KDD) form a large and diverse group of


enzymes, whose representatives can be found in all forms of life [22, 38]. They
catalyse a plethora of different oxidation reactions encountered in such processes
as DNA and RNA repair [46], gene (de)activation [34], synthesis of collagen, syn-
thesis of antibiotics, molecular oxygen sensing [12], catabolism and anabolism of
various metabolites. Due to a very wide scope of chemical reactions catalysed by
α-KDD, involving aliphatic and aromatic hydroxylation, dehydrogenation, sulphox-
idation, oxidative demethylation, ring closing, ring expansion, halogenation, they
form, arguably, the most universal platform used by Nature for oxidative transfor-
mations [15].
General catalytic cycle of an α-KDD, presented in Fig. 5, consists of two major
steps [20]. The first is oxidative decarboxylation of α-ketoacid leading to generation
of Fe(IV)=O species and release of CO2 . The second step involves oxidation of an
organic substrate to a product by the oxoferryl species. It is generally accepted that
the first half-cycle, i.e. generation of the Fe(IV)=O species, proceeds in the same way
in all α-KDD enzymes irrespective of the precise chemical identity of the organic
substrate, though its presence in the active site is a prerequisite for subsequent O2
binding and activation [50].
Concerning the mechanism of oxidative decarboxylation, one can find several
proposals in the literature (Fig. 6) consistent with dioxygenase nature of α-KDD that
requires that one oxygen atom of O2 ends up in the decarboxylated acid whereas
the second one is built into the major organic product. In the mechanism A, O2 is 2-
electron reduced, at the expense of Fe(II) being oxidised to Fe(IV), when a bicyclic
peroxoketal intermediate forms. This intermediate is then supposed to undergo a
synchronous heterolytic O–O and C–C cleavage yielding the Fe(IV)=O species and
CO2 . According to the mechanism B, the O2 activation is a single step where the
attack on the keto group, cleavage of O–O and C–C bonds are all coupled. Finally,
in the mechanism C the attack on the keto group elicits C–C cleavage, and thus CO2
834 T. Borowski and E. Broclawik

organic
substrate

O O

FeII
O R

organic
product O2

organic CO2
RCOCOO- + substrate
organic substrate O
O
FeIV
O R

Fig. 5 General catalytic cycle of an α-KDD

O O O
O
FeIV
R
O
A
O + CO2
O O
O O O
B FeIV
FeII / III
O R
O R

O
O
FeII
R + CO2
O

Fig. 6 Three alternative mechanism for O2 activation by α-KDD

release, though the O–O bond is preserved in the resulting peracid intermediate.
Subsequent O–O bond heterolysis leads to the oxoferryl species. Thus, at least three
chemically distinct reaction paths can be envisioned for the process and the picture
is further complicated by the fact that several spin states can be involved here.
With the aim to provide some insights into this intricate process, QC studies on the
reaction mechanism were done with the use of an active site model for clavaminic acid
synthase (CAS, a representative α-KDD), depicted in Fig. 7 [5]. The organic substrate
was not included in the model since the O2 activation mechanism is supposed to be
independent of its identity (vide supra).
In the E-Fe(II)-α-KG complex the metal ion is in a high-spin (quintet) electronic
configuration with four singly (equispin) and one doubly occupied Fe 3d orbitals.
When this species starts to interact with triplet O2 (two unpaired electrons), pairing
of the electron spins on these two centres may lead to complexes with triplet, quintet
or septet total spin state. Interestingly, for the E-Fe(II)-α-KG-O2 complex all three
states are computed to lie within 6 kcal/mol energy range, with triplet being the
ground state and quintet being the highest one (Fig. 8). As can be noticed in the
Figure, reaction energy profiles run along various scenarios in the three spin states,
Bioinorganic Reaction Mechanisms—Quantum Chemistry Approach 835

Fig. 7 The active site model for CAS, i.e. a representative α-KDD, used in the study on dioxygen
activation [5]. Reprinted with permission from Ref. [5]

Fig. 8 Reaction energy profile for activation of O2 by α-KDD. Black solid line: the lowest activation
energy quintet path proceeding through Fe(II)-peracid complex; blue dotted line: the septet single-
step path; green line: the critical parts of the triplet path
836 T. Borowski and E. Broclawik

with septet exhibiting the simplest course - with a single barrier, which corresponds
to the mechanism B from Fig. 6. Such behaviour of the system in the septet spin state
indicates that the bicyclic structure with high spin Fe(III) and superoxo ketal is not
stable; similarly, the Fe(II)-peracid complex can neither form in the septet state. The
latter observation is not surprising since a septet state of the Fe(II)-peracid complex
would imply high spin Fe(II) coupled to triplet peracid (or other, similarly unstable
valence states).
In the triplet state, however, the bicyclic species with superoxo ketal, similar to
that proposed in the mechanism A from Fig. 6, is a stable structure, though it lies
at a considerably elevated (+24.0 kcal/mol) energy level. In this complex, interme-
diate spin (IS) Fe(III) (with one empty, three singly and one doubly occupied Fe
3d orbitals) couples antiferromagnetically (opposite spins) with the superoxo-ketal,
and the empty 3d orbital participates in a strong coordination bond with the two
negatively charged oxygens of the substrate. This bond is one of the factors stabiliz-
ing the bicyclic structure. The consecutive step, whereby CO2 is released and triplet
Fe(II)-peracid complex is formed, proceeds through a high energy transition state
(32.8 kcal/mol), which definitely precludes the triplet spin state from participation
in O2 activation by α-KDD.
From the QC results it follows that the oxidative decarboxylation proceeds on
the quintet PES, according to the mechanism C from Fig. 6. Thus, the attack of Fe-
bound oxygen on the keto group of the ketoacid triggers decarboxylation leading to
a high spin Fe(II)-peracid intermediate. This step involves a barrier of 16.1 kcal/mol,
considerably lower than the barrier on the triplet PES and somewhat smaller than
the barrier in the septet state. The quintet spin state seems optimal for oxidative
decarboxylation since first, the process can proceed through a stable Fe(II)-peracid
intermediate which means that only the C–C bond needs to be cleaved when oxygen
attacks the keto group, and secondly because quintet is a ground spin state for the
Fe(II)-peracid complex featuring a high-spin Fe(II). In contrast, in the septet state
the O–O bond needs to be cleaved simultaneously with the C–C bond as the peracid
complex is not stable in this spin state, whereas in the triplet state iron must adopt a
high-energy IS electronic configuration when the bicyclic intermediate forms.
The Fe(II)-peracid complex is very short-lived as the barrier to its decay amounts
to only 5.7 kcal/mol. This barrier is connected with the first one-electron transfer
from the iron to the O–O bond occurring during the O–O bond cleavage whereas the
transfer of the second electron is practically spontaneous. Cleavage of the O–O bond
completes the oxygen activation stage of the catalytic cycle yielding the reactive
Fe(IV)=O species. As can be noticed in Fig. 8, quintet is the ground electronic state
also for the oxoferryl species, with triplet and septet lying considerably higher.
In conclusion, the QC study on the O2 activation by α-KDD revealed that the
three proposed mechanisms (Fig. 6) are realized on three various PES:s differing in
total spin; this mechanism - spin dependence can be understood taking into account
the electronic structure requirements imposed by a particular spin state. The lowest
activation energy path is located on the quintet PES and it proceeds through oxidative
decarboxylation yielding Fe(II)-peracid intermediate which subsequently undergoes
an easy O–O bond heterolysis leading to the reactive Fe(IV)=O species.
Bioinorganic Reaction Mechanisms—Quantum Chemistry Approach 837

3.2 Aromatic Hydroxylation by Fe(IV)=O - the Role


of Second Shell Residues

4-hydroxyphenylpyruvate dioxygenase (HPPD) belongs to the family of α-KDD and


as it is involved in catabolism of tyrosine it can be found in almost all forms of life
[33]. The organic substrate of HPPD contains a ketoacid group which is oxidatively
decarboxylated in the first catalytic half-cycle. In the resulting Fe(IV)=O species,
the decarboxylation product, i.e. 4-hydroxyphenylacetate (4-HPA), is bound directly
to the ferryl ion (Fig. 9). The unique feature of the catalytic reaction of HPPD is
the 100%-efficient migration of the -CH2 –COO substituent that takes place during
hydroxylation of the ring. This process is initiated by the electrophilic attack of the
oxo group on the aromatic carbon where the migrating substituent is attached. The
resulting intermediate is a so-called σ -complex, which involves either ring radical
and Fe(III) or ring cation and Fe(II). In the following step the substituent migrates
to the nearby carbon yielding a keto form of the product - homogentisate.
In the first QC study on the mechanism of HPPD reaction a small model that
involved only the first shell ligands was used [4]. Back then, there were no structural
data suggesting how the substrate binds in the active site, and hence even the coordina-
tion site of 4-HPA carboxyl was uncertain. Recently an X-ray structure for E-Co(II)-
product complex was solved for a closely related enzyme - 4-hydroxymandelate
synthase [9], and it was used as a starting point for classical molecular dynamics
simulations that yielded a macromolecular model of the HPPD-Fe(IV)=O-4-HPA
complex. This, in turn, was used to construct a larger QC active site model, shown in
Fig. 10, employed in a second study on the HPPD reaction [48]. As can be noticed, in
addition to the first shell ligands (His161, His240, Glu322 and 4-HPA) the big model
includes several second shell groups, most notably Gln309 that forms H-bonds with
Fe-bound carboxyl groups of Glu322 and 4-HPA.
The study conducted with the first shell only model pointed to a two step migration
mechanism, proceeding through a biradical intermediate (upper branch in Fig. 11).
More specifically, as a result of electrophilic attack of Fe(IV)=O on the ring a radical

HO

OH
OH

O
O O
O O III / II O
O Fe
IV O Fe O Fe
II

Fig. 9 The reaction of aromatic ring hydroxylation coupled to side chain migration catalysed by
HPPD
838 T. Borowski and E. Broclawik

Fig. 10 Active site model used in the recent study on the catalytic reaction mechanism of HPPD.
Second shell residues other than Gln309 are drawn in a simplified way

σ -complex forms and the following C–C bond cleavage is a homolytic process yield-
ing a carboxymethyl and semichinone radicals coordinated to Fe(II). Rebound of the
radicals completes the reaction. In contrast, when the big active site model was
employed either within a QC or QM/MM model, the mechanism simplified to a
single-step process (lower branch in Fig. 11). The key difference between these two
mechanisms is the electronic structure of the σ -complex which seems to be decisive
for the migration path. Thus, in a radical σ -complex/Fe(III) species the C–C cleav-
age is a homolytic reaction, whereas in the cation σ -complex/Fe(II) intermediate the
C–C cleavage is formally a heterolytic process coupled to formation of the new C–C
bond, i.e. a single step 1,2-migration.
The fact that different mechanisms were obtained with the two models highlights
the important role played in this case by the second shell residues, which were miss-
ing in the small model. With this respect, Gln309 is most important as it forms two
H-bonds with the first-shell (carboxylate) ligands (Fig. 10), and these bonds substan-
tially strengthen when the iron ion is reduced from Fe(IV) to Fe(II). As a result, the
electrophilic attack of the Fe(IV)=O on the ring is a two-electron process yielding
Bioinorganic Reaction Mechanisms—Quantum Chemistry Approach 839

OH

OH

O O

O 17.7 O II
III O Fe
O Fe

4.3 11.7

14.3

HO
OH

O
-0.5 O
O II
O Fe
O
II
O Fe
-2.5
-23.9
-28.7

Fig. 11 Two distinct reaction mechanisms for the hydroxylation/side chain migration step of the
HPPD catalytic cycle. In red energies for the radical two-step mechanism found with the small
active site model; in green energies for the single-step mechanism supported by the recent study
employing the big model. Energies (in kcal/mol) are relative values with respect to the corresponding
Fe(IV)=O species

cation σ -complex/Fe(II) intermediate, which subsequently undergoes a single-step


1,2-migration. Notably, all attempts to find a single-step pathway for the radical
σ -complex/Fe(III) species failed.
In summary, studies on the reaction mechanism of hydroxylation of aromatic ring
coupled to 1,2-migration of carboxymethyl substituent were done with QC active
site models of various size and they suggested two different mechanisms. The bigger
model included the critical (Gln309) residue that forms two H-bonds with the first
shell ligands and in this way tunes the redox potential of the metal ion by stabilising
the Fe(II) state. The change of the electronic structure of the σ -complex species
translates to the alteration of the mechanism for the 1,2-migration. Thus, the take
home message is that one should always try to include in the QC model the second
shell residues that form H-bonds with the first shell ligands.
840 T. Borowski and E. Broclawik

3.3 Formation of an Epoxide Ring in Fosfomycin - Reactivity


of Various Iron-Oxygen Species

Fosfomycin is a clinically useful antibiotic with a relatively simple chemical structure


comprising an epoxide ring. The last step of fosfomycin biosynthesis is catalysed
by (S)-2-hydroxypropylphosphonic acid epoxidase (HppE) and it involves 4-electron
reduction of O2 to water with two electrons provided by the fosfomycin precursor and
two other electrons coming from a yet unknown external reductant [29]. Oxidation
of the precursor is coupled with closure of the epoxide ring. Concerning the reaction
mechanism, three different scenarios were proposed in the literature (Fig. 12). All
of them begin with binding of dioxygen to the Fe(II)-substrate complex (ES), where
the substrate chelates the metal ion, yet differ in the identity of the reactive species
responsible for C–H bond cleavage and the stage when the external electrons are
delivered. As it has been proposed that electrons are accompanied by protons, these
redox steps might proceed as proton-coupled electron transfer (PCET) reactions
[19]. In the mechanism A it is the Fe(III)-bound superoxide (ES-O2 → A1) that is
effecting the C–H bond cleavage and both external electrons are delivered later on.
In the mechanisms B and C, the ES-O2 is reduced to Fe(III)-OOH species (B1) and
this intermediate is either responsible for C–H cleavage (mechanism B) or is further
reduced to yield the reactive Fe(IV)=O species C1 that subsequently cleaves the C-H
bond. Once a carbon radical is formed the epoxide ring can be closed by an attack of
the alcoholate oxygen on the radical center (C2 → EP or B2 → EP). QC study was
conducted with the purpose to discriminate between the three suggested mechanisms
and to provide insights into the details of the reaction path [32].

OH
O
III
Fe
O O

P O H+ + e-

A O
O A1 O
O OH

FeII FeIII H2O FeIV H + + e- FeII


O O O2 O O O O
O
P O P O P O PO32-
2
1
O O O EP
B2
ES ES-O2
B
B, C OH
O H2O
O OH
H+ + e- FeIII
FeIV FeIII
O O H+ + e-
O O O O
P O
P O P O
C
O H2O
B1 O O
C1
C2

Fig. 12 Mechanisms proposed for formation of an epoxide ring by HppE


Bioinorganic Reaction Mechanisms—Quantum Chemistry Approach 841

Fig. 13 HppE reaction energy profile for the mechanism C (with the C–H cleavage via 5 TS5) and
initial steps of the mechanisms A and B

Composition of the HppE active site model was described in the Sect. 2.1. As
the external electrons are supposedly delivered to the HppE active site together with
protons, these steps can be described as H-atom uptakes, and their energy can be
reliably computed since the total electric charge of the active site remains unchanged.
To this end, one only needs to calculate a donor-H bond energy for a suitable external
electron donor, which in our case is a fully reduced flavin coenzyme (FMN) that was
used in experimental work on HppE.
It follows from the computed reaction energy profile for the initial steps of the
three mechanisms (Fig. 13) that first, the barriers encountered in the mechanisms A
and B are considerably higher than in the mechanism C, and second, that only in the
mechanism C the energy of consecutive intermediates drops monotonically, whereas
in the mechanisms A and B some of the initial steps are energetically uphill. The
activation energies of the initial “chemical” steps are: 26.5, 33,8 and 12.5 kcal/mol,
for the mechanism A, B and C, respectively (TS1, TS3 and TS4), which forms a
strong argument in favour of the mechanism C. It is assumed that proton and electron
uptake steps are faster than the first chemical steps of the mechanisms A and B, i.e.
the effective barrier to electron/proton transfer is lower than 26.5 kcal/mol, which
seems reasonable.
The detailed reaction energy profile (relative energies with respect to 5 8) obtained
for the mechanism C is presented in Fig. 4, whereas the corresponding reaction
diagram is shown in Fig. 14. In the mechanism C, two electrons and two protons
are delivered to the active site prior to any bond cleavage step. In the resulting
intermediate 5 8 a high-spin Fe(II) is coordinated by a hydroperoxo ligand and the
substrate protonated on the alcoholate oxygen (other positions for the added proton
were considered, yet they had higher energies). Heterolytic O–OH bond cleavage
is connected with protonation of the leaving OH by the substrate’s OH group, and
842 T. Borowski and E. Broclawik

HO O O
HO O O
FeIII FeII FeIII FeIV
O O H+ + e- HO O O O O O
5
TS4 via MECPOXO
P O P O P O P O
OH OH H2O OH OH
64 59 5
(B1) 5
8 10 (C1)

5TS5'
5TS5

OH2
OH OH OH
FeII FeII FeIII HS Fe(III) IS Fe(III)
FeIII
O O O O O O
O O 5TS6 via MECPHS
P O P O P O P O
O OH OH OH
514
513 (EP) 512 (C2) 5
11
5TS8
5TS7
5TS9
OH2 OH2 5TS7'

5
II FeIII TS6'
Fe FeII
O 5
O O
TS10 O O
O
P O P O
P O
O O OH
5
17 516
HO
515

Fig. 14 Details of the HppE most likely mechanism - C

it leads to a highly reactive Fe(III)-O• species (5 9), which is an excited state form
of the more common Fe(IV)=O intermediate (5 10). Either of these reactive species
elicits cleavage of the C1-H bond, which is exposed towards the oxyl/oxo group.
In the native substrate with S-configuration at C2, the C2–H bond points in the
opposite direction than the Fe–O bond and hence is not accessible for the reaction.
The C–H bond cleavage by 5 9 is a barrierless process, as is the internal conversion
via MECP O X O to the ground state form, i.e. 5 10. The latter cleaves the C-H bond
with a barrier of 19.6 kcal/mol (5 TS5), and this step yields intermediate spin (IS)
Fe(III)-OH / carbon radical species 5 11 that decays via MECP H S to the ground
state with a high spin Fe(III) (5 12). In the following step, the Fe(III)-bound OH
group is protonated at the expense of the phosphonic group of the substrate (5 12
→ 5 TS8 → 5 16) and then the only remaining chemical step is closing the epoxide
ring. Concerning the stereochemistry of this step, HppE catalyses the conversion
to the cis-epoxide, i.e. fosfomycin (5 14), yet in some cases with an accompanying
trans-epoxide (5 17) byproduct. However, the active site model employed in the QC
study predicts the transition state (5 TS10) leading to the less hindered, and thus more
stable, trans-epoxide lies 3.8 kcal/mol below 5 TS9 that leads to fosfomycin. In other
words it means that the model predicts the trans-epoxide would be the major product.
This discrepancy can be easily understood if one recalls that the QC active site model
does not include any hydrophobic residues forming a niche for the methyl group of
the substrate (truncation of the model was a necessary simplification for this study
Bioinorganic Reaction Mechanisms—Quantum Chemistry Approach 843

where numerous steps needed to be characterised). Thus, a steric hindrance imposed


by these groups is proposed to destabilize 5 TS10 by ca. 5 kcal/mol with respect to
5
TS9, which would guarantee the proper stereochemistry of the major product. Other
alternative paths (in grey in Figs. 4 and 14) involve barriers considerably higher than
corresponding steps on the proposed path and they are not discussed here.
In summary, investigations on the HppE reaction mechanism provided an example
of a case study where various Fe-oxygen forms were postulated as reactive species
responsible for C–H bond cleavage and where an external reductant was involved.
The QC results obtained for the three postulated mechanisms allowed us to identify
the most likely one (the mechanism C) and to provide insights into its details. Since
the QC model did not include protein residues forming a niche for the methyl group
of the substrate the stereochemistry of the major product could not be reproduced,
yet the model can be extended to eliminate this deficiency.

3.4 Alkenyl Migration Mechanism by Extradiol Dioxygenases


- the Value of Small Quantum Chemistry Models

Extradiol dioxygenases are typically found in soil bacteria capable of using aromatic
compounds as a carbon source. They catalyse oxidative cleavage of a catechol ring
leading to acyclic 2-hydroxymuconaldehyde acid product (see Fig. 15) [10]. The
catechol substrate binds to the active site Fe(II) ion in a bidentate mode with one
oxygen left protonated (a) [47]. Subsequent binding of O2 leads eventually to species
b with a hydroperoxo bridge between the ring and ferrous ion. For the following steps
two different mechanisms were proposed. In one scenario (upper branch in Fig. 15),
the O–O bond is cleaved homolitically with the aid of metal ion which provides one
electron to reduce the OH radical to HO− (c). In the next step the oxyl radical attacks
the ring yielding an epoxide radical (d), which in a few following and fast steps
is transformed to the product complex f. This mechanism is supported by several
QC studies [6, 16, 44] and an X-ray structure for species with O–O bond cleaved,
analogous to c [27]. The second mechanism assumes that the O–O bond cleavage is a
heterolytic process coupled to ring expansion leading in one step from b to a lactone
intermediate e [31]. Hydrolysis of the lactone completes the reaction. Notably, this
mechanism assumes that the metal ion does not change its oxidation state between
species b and f. Indirect argument in support of this mechanism was obtained in
a study where a mechanistic probe was used. The probe substrate has a -CH2 -OH
group instead of the -OOH present in the intermediate b, and it was reported to be
transformed to 2-tropolone, i.e. a 7-membered ring derivative of species analogous
to e.
With the aim to test if the proposed mechanism would be energetically viable
for the mechanistic probe, a QC study was undertaken with the use of an active site
model for extradiol dioxygenase [8]. Moreover, as the mechanism assumes that the
metal ion is merely a Lewis acid, a much simpler model was considered, i.e. a model
844 T. Borowski and E. Broclawik

O HO O HO

homolytic O-O cleavage, O FeIII O FeIII


Fe redox active
O O
H
O O d
O2 OH c
FeII O
FeII

O O

a b
OH
HO
O
heterolytic O-O cleavage,
Fe as Lewis acid FeII COO
FeII
+
O O
O

e f

Fig. 15 Two mechanisms proposed for ring cleavage reaction catalysed by extradiol dioxygenases.
Reprinted with permission from Ref. [8]

where the whole active site is replaced by a single molecule of formic acid. The two
models are presented in Fig. 16. In the reduced model the formic acid is placed so
that it donates its acidic hydrogen to form a H-bond with the leaving OH group of the
probe substrate. The second oxygen accepts a H-bond from the ring-bound OH. Such
arrangement enables the formic acid to shuttle a proton from O5 to O1 during the
reaction. For the actual active site model various protonation patterns were probed,
and the one that gave the lowest activation energy is presented in Fig. 16. Moreover,
for comparative purposes a hypothetical Fe(III) oxidation state was also considered
for the E-Fe-substrate complex, as Fe(III) is a stronger Lewis acid than Fe(II).
The major results of the study are synthetically presented in Fig. 17, where the
relative energies and key bond lengths are reported for the critical stationary points.
Two general conclusions can be drawn from the analysis of the Figure. First, that

Fig. 16 Models used in the study on the alkenyl migration mechanism. Reprinted with permission
from Ref. [8]
Bioinorganic Reaction Mechanisms—Quantum Chemistry Approach 845

(a) O O O
H
H2 a
C OH H2C O
b H O
c O
OH H O
a 2.20 A CH2 H
O
O O H
b 1.91 A H H2O
O c 1.70 A O

H H

ΔE = 54.1 [kcal/mol] ΔE = 2.5 [kcal/mol]

(b) His200 His200 His200


H H H

O OH O OH O HO
a
H 2C H2C
FeIII FeIII FeIII
b
O c O O
a 2.49 A C
H2
sFe = 4.05 b 1.76 A
His248H His248H sFe = 4.07 sFe = 4.09
c 1.60 A His248H

ΔE = 48.4 [kcal/mol] ΔE = 10.9 [kcal/mol]

(c) His200 His200


H H His200

O OH O OH O H2O
a
H2C C III H2
FeII Fe FeIII
b C
O c O O
a 2.22 A
b 1.98 A
His248H sFe = 3.74 His248H sFe = 4.09 c 1.51 A His248H sFe = 4.11

ΔE = 35.5 [kcal/mol] ΔE = 14.9 [kcal/mol]

Fig. 17 Reaction mechanisms obtained for the alkenyl migration with three different models.
Reprinted with permission from Ref. [8]

in this example formic acid is already quite reasonable model for the Fe(III) form
of the active site. Second, that even for the probe substrate the redox activity of the
metal ion is a key catalytic factor. More specifically, the critical bond lengths for
the transition states in panels A and B are very similar and the chemical structure of
the organic product is the same in both cases. In parallel, the computed activation
energies are also close to each other, though very high. Importantly, in both cases the
transition state is for a heterolytic alkenyl migration mechanism, which we attempted
to test (no spin polarization along the cleaved C–O bond). On the other hand, for
a model with a native Fe(II) oxidation state of the metal a different mechanism
with lower activation energy was found. As shown in Fig. 17c, in this case the C–
O bond cleavage is a homolytic reaction uncoupled from the ring expansion. Just
as in the radical mechanism proposed for the native substrate (Fig. 15), the leaving
846 T. Borowski and E. Broclawik

OH radical is one-electron reduced by the ferrous ion, whereas the -CH2 radical
attacks the nearby atom from the aromatic ring. Importantly, the calculated barrier is
significantly lower than that for the (hypothetical) Fe(III)-bound model, where the
proposed mechanism is the heterolytic alkenyl migration. In light of these facts, and
taking into account the enormous height of the barrier, it is safe to conclude that the
heterolytic ring cleavage mechanism is very unlikely for extradiol dioxygenases.
In summary, the study on the mechanism of ring expansion for a mechanistic
probe for extradiol dioxygenases showed that significant insights into catalytic factors
ruling enzymatic reaction can be obtain by deliberate constructing and testing a range
of QC models differing in size and composition. First, a minimal model, where a
molecule of formic acid was used in place of the whole active site, turned out to give
similar geometries and energies as the active site model. Such a small model could
be used in benchmark energy computations with, for example, CCSD(T) method.
Second, an active site model with a non-native Fe(III) state of the metal cofactor was
tested and compared to the model with the normal Fe(II) cofactor. Comparison of
the results allowed us to rule out the heterolytic ring expansion mechanism.

4 Concluding Remarks

In this chapter we showed how the DFT methods applied to active site models can be
used to test mechanistic hypotheses for metalloenzymes. As exemplified by the case
studies summarised here, for such systems one often needs to test several plausible
spin states or various reactive species that can be formed in a course of a redox
reaction. In some cases the second shell residues need to be included in the model
as their omission may even lead to a completely altered reaction mechanism. On the
other hand, in some other instances the whole active site model may be replaced
by a single molecular fragment and still yield valuable information. In yet another
case, the active site model needs to be rather large to reproduce the stereospecificity
of the enzyme. One should also note here, that the accuracy of the DFT methods
is inevitably limited, and computed reaction energies and barriers can sometimes
become burdened by an error of up to ca. 10 kcal/mol. However, the aim of the
mechanistic studies is usually not to reproduce an experimental value of a barrier
height or reaction energy, but rather to suggest the most likely mechanism. Thus,
if two mechanisms differ in the rate-limiting barriers by more than 10 kcal/mol the
lower barrier mechanism is to be selected as the most likely and the other is ruled
out. When the difference is below 5–6 kcal/mol, which is a typical magnitude of
error of DFT methods, one can attempt to construct a reduced model and use it
in, e.g. CCSD(T) benchmark calculations, though this procedure may not always
succeed. With all these caveats, we believe the DFT modelling of enzymatic reaction
mechanisms will continue to be a valuable complement to experimental techniques.
The experimental identification and characterization of transition states is out of
reach.
Bioinorganic Reaction Mechanisms—Quantum Chemistry Approach 847

Acknowledgements This research project was supported by grant No. UMO-2011/01/B/ST4/


02620 from the National Science Centre, Poland, and partly supported by grants: POKL.04.0101-
00-434/08-00, 2011/01/N/ST4/02330 and, Kraków Interdisciplinary Ph.D.-Project in Nanoscience
and Advanced Nanostructures” operated within the Foundation for Polish Science MPD Programme
co-financed by the EU European Regional Development Fund.

References

1. Antosiewicz, J., Shugar, D.: Poisson–Boltzmann continuum-solvation models: applications to


pH-dependent properties of biomolecules. Mol. Biosyst. 7, 2923–2949 (2011)
2. Becke, A.D.J.: Density-functional thermochemistry. III. The role of exact exchange. Chem.
Phys. 98, 5648–5652 (1993)
3. Blomberg, M.R.A., Siegbahn, P.E.M.: A quantum chemical approach to the study of reaction
mechanisms of redox-active metalloenzymes. J. Phys. Chem. B 105, 9375–9386 (2001). https://
doi.org/10.1021/jp010305f
4. Borowski, T., Bassan, A., Siegbahn, P.E.M.: 4-Hydroxyphenylpyruvate dioxygenase: a hybrid
density functional study of the catalytic reaction mechanism. Biochemistry 43, 12,331–12,342
(2004)
5. Borowski, T., Bassan, A., Siegbahn, P.E.M.: Mechanism of dioxygen activation in 2-
oxoglutarate-dependent enzymes: a hybrid DFT study. Chem. Eur. J. 10(4), 1031–1041 (2004).
https://doi.org/10.1002/chem.200305306
6. Borowski, T., Georgiev, V., Siegbahn, P.E.M.: On the observation of a gem diol intermediate
after O–O bond cleavage by extradiol dioxygenases: a hybrid DFT study. J. Mol. Model 16(11),
1673–1677 (2010). https://doi.org/10.1007/s00894-010-0652-5
7. Borowski, T., Noack, H., Radoń, M., Zych, K., Siegbahn, P.E.M.: Mechanism of selective
halogenation by SyrB2: a computational study. J. Am. Chem. Soc. 132(37), 12887–12898
(2010b). https://doi.org/10.1021/ja101877a
8. Borowski, T., Wójcik, A., Miłaczewska, A., Georgiev, V., Blomberg, M.R.A., Siegbahn, P.E.M.:
The alkenyl migration mechanism catalyzed by extradiol dioxygenases: a hybrid DFT study.
J. Biol. Inorg. Chem. (2012). https://doi.org/10.1007/s00775-012-0904-1
9. Brownlee, J., He, P., Moran, G.R., Harrison, D.H.T.: Two roads diverged: the struc-
ture of hydroxymandelate synthase from Amycolatopsis orientalis in complex with 4-
hydroxymandelate. Biochemistry 47(7), 2002–2013 (2008). https://doi.org/10.1021/bi701438r
10. Bugg, T.D., Sanvoisin, J., Spence, E.L.: Exploring the catalytic mechanism of the extradiol
catechol dioxygenases. Biochem. Soc. Trans. 25(1), 81–85 (1997)
11. Burzlaff, N.I., Rutledge, P.J., Clifton, I.J., Hensgens, C.M., Pickford, M., Adlington, R.M.,
Roach, P.L., Baldwin, J.E.: The reaction cycle of isopenicillin N synthase observed by X-ray
diffraction. Nature 401(6754), 721–724 (1999). https://doi.org/10.1038/44400
12. Dann, C.E., Bruick, R.K., Deisenhofer, J.: Structure of factor-inhibiting hypoxia-inducible
factor 1: an asparaginyl hydroxylase involved in the hypoxic response pathway. Proc. Nat.
Acad. Sci. U.S.A 99(24), 15,351–15,356 (2002). https://doi.org/10.1073/pnas.202614999
13. Dellago, C., Bolhuis, P.G.: Transition path sampling simulations of biological systems. Top.
Curr. Chem. 268, 291–317 (2007)
14. Evans, D.A., Wales, D.J.: Free energy landscapes of model peptides and proteins. J. Chem.
Phys. 118, 3891 (2003)
15. Flashman, E., Schofield, C.J.: The most versatile of all reactive intermediates? Nat. Chem.
Biol. 3(2), 86–87 (2007). https://doi.org/10.1038/nchembio0207-86
16. Georgiev, V., Borowski, T., Blomberg, M.R.A., Siegbahn, P.E.M.: A compartison of the reaction
mechanisms of iron- and manganese-containing 2,3-HPCD: an important spin transition for
manganese. J. Biol. Inorg. Chem. 13, 929–940 (2008)
848 T. Borowski and E. Broclawik

17. Georgieva, P., Himo, F.: Quantum chemical modeling of enzymatic reactions: the case of
histone lysine methyltransferase. J. Comput. Chem. 31(8), 1707–1714 (2010). https://doi.org/
10.1002/jcc.21458
18. Grimme, S.: Semiempirical GGA-type density functional constructed with a long-range dis-
persion correction. J. Comput. Chem. 27, 1787–1799 (2006)
19. Hammes-Schiffer, S.: Theory of proton-coupled electron transfer in energy conversion pro-
cesses. Acc. Chem. Res. 42, 1881–1889 (2009)
20. Hanauske-Abel, H.M., Gnzler, V.: A stereochemical concept for the catalytic mechanism of
prolylhydroxylase: applicability to classification and design of inhibitors. J. Theor. Biol. 94(2),
421–455 (1982)
21. Harvey, J.N., Aschi, M., Schwarz, H., Koch, W.: The singlet and triplet states of phenyl cation. A
hybrid approach for locating minimum energy crossing points between non-interacting poten-
tial energy surfaces. Theor. Chem. Acc. 99, 95–99 (1998)
22. Hausinger, R.P.: Fe(II)/α-ketoglutarate-dependent hydroxylases and related enzymes. Crit. Rev.
Biochem. Mol. Biol. 39(1), 21–68 (2004). https://doi.org/10.1080/10409230490440541
23. Higgins, L.J., Yan, F., Liu, P., Liu, H., Drennan, C.L.: Structural insight into antibiotic fos-
fomycin biosynthesis by a mononuclear iron enzyme. Nature 437(7060), 838–844 (2005).
https://doi.org/10.1038/nature03924
24. Holm, R.H., Kennepohl, P., Solomon, E.S.: Structural and functional aspects of metal sites in
biology. Chem. Rev. 96, 2239–2314 (1996). https://doi.org/10.1021/cr9500390
25. Hu, H., Lu, Z., Parks, J., Burger, S., Yang, W.: Quantum mechanics/molecular mechanics
minimum free-energy path for accurate reaction energetics in solution and enzymes: sequential
sampling and optimization on the potential of mean force surface. J. Chem. Phys. 128(034),
105 (2008)
26. Kawatsu, T., Lundberg, M., Morokuma, K.: Protein free energy corrections in ONIOM QM:
MM modeling: A case study for isopenicillin N synthase (IPNS). J. Chem. Theory Comput. 7,
390–401 (2011)
27. Kovaleva, E.G., Lipscomb, J.D.: Intermediate in the O–O bond cleavage reaction of an extradiol
dioxygenase. Biochemistry 47, 11168–11170 (2008)
28. Lee, C., Yang, W., Parr, R.G.: Development of the Colle-Salvetti correlation energy formula
into a functional of the electron density. Phys. Rev. B 37, 785–789 (1988)
29. Liu, P., Murakami, K., Seki, T., He, X., Yeung, S.M., Kuzuyama, T., Seto, H., Liu, H.: Protein
purification and function assignment of the epoxidase catalyzing the formation of fosfomycin.
J. Am. Chem. Soc. 123(19), 4619–4620 (2001)
30. Maeda, S., Ohno, K., Morokuma, K.: Exploring multiple potential energy surfaces: photo-
chemistry of small carbonyl compounds. Adv. Phys. Chem. Article ID 268,124, 13 pages
(2012)
31. Mendel, S., Arndt, A., Bugg, T.D.H.: Acid-base catalysis in the extradiol catechol dioxygenase
reaction mechanism: site-directed mutagenesis of His-115 and His-179 in Escherichia coli
2,3-dihydroxyphenylpropionate 1,2-dioxygenase (MhpB). Biochemistry 43(42), 13390–13396
(2004). https://doi.org/10.1021/bi048518t
32. Miłaczewska, A., Broclawik, E., Borowski, T.L.: On the catalytic mechanism of (S)-2-
hydroxypropylphosphonic acid epoxidase (HppE): a hybrid DFT study. Chem. Eur. J. (2012).
https://doi.org/10.1002/chem.201202825
33. Moran, G.R.: 4-Hydroxyphenylpyruvate dioxygenase. Arch. Biochem. Biophys. 433(1), 117–
128 (2005). https://doi.org/10.1016/j.abb.2004.08.015
34. Ng, S.S., Kavanagh, K.L., McDonough, M.A., Butler, D., Pilka, E.S., Lienard, B.M.R., Bray,
J.E., Savitsky, P., Gileadi, O., von Delft, F., Rose, N.R., Offer, J., Scheinost, J.C., Borowski,
T., Sundstrom, M., Schofield, C.J., Oppermann, U.: Crystal structures of histone demethylase
JMJD2A reveal basis for substrate specificity. Nature 448(7149), 87–91 (2007). https://doi.
org/10.1038/nature05971
35. Pelmenschikov, V., Blomberg, M., Siegbahn, P.E.: A theoretical study of the mechanism for
peptide hydrolysis by thermolysin. J. Biol. Inorg. Chem. 7, 284–298 (2002)
Bioinorganic Reaction Mechanisms—Quantum Chemistry Approach 849

36. Rod, T., Ryde, U.: Accurate QM/MM free energy calculation of enzyme reactions: Methylation
by catechol O-methyltransferase. J. Chem. Theory Comput. 1, 1240–1251 (2005)
37. Schenk, G., Mitić, N., Gahan, L.R., Ollis, D.L., McGeary, R.P., Guddat, L.W.: Binuclear met-
allohydrolases: Complex mechanistic strategies for a simple chemical reaction. Acc. Chem.
Res. (2012). https://doi.org/10.1021/ar300067g
38. Schofield, C., Zhang, Z.: Structural and mechanistic studies on 2-oxoglutarate-dependent oxy-
genases and related enzymes. Curr. Opin. Struct. Biol. 9(6), 722–731 (1999)
39. Senn, H., Thiel, W.: QM/MM methods for biological systems. Top. Curr. Chem. 268, 173–290
(2007)
40. Sheppard, D., Terrell, R., Henkelman, G.: Optimization methods for finding minimum energy
paths. J. Chem. Phys. 128(134), 106 (2008)
41. Siegbahn, P.E.M.: Modeling aspects of mechanisms for reactions catalyzed by metalloenzymes.
J. Comput. Chem. 22, 1634–1645 (2001)
42. Siegbahn, P.E.M.: Mechanisms of metalloenzymes studied by quantum chemical methods. Q.
Rev. Biophys. 36, 91–145 (2003)
43. Siegbahn, P.E.M., Borowski, T.: Modeling enzymatic reactions involving transition metals.
Acc. Chem. Res. 39(10), 729–738 (2006). https://doi.org/10.1021/ar050123u
44. Siegbahn, P.E.M., Haeffner, F.: Mechanism for catechol ring-cleavage by non-heme iron extra-
diol dioxygenases. J. Am. Chem. Soc. 126(29), 8919–8932 (2004). https://doi.org/10.1021/
ja0493805
45. Siegbahn, P.E.M., Himo, F.: Recent developments of the quantum chemical cluster approach
for modeling enzyme reactions. J. Biol. Inorg. Chem. 14(5), 643–651 (2009). https://doi.org/
10.1007/s00775-009-0511-y
46. Trewick, S.C., Henshaw, T.F., Hausinger, R.P., Lindahl, T., Sedgwick, B.: Oxidative demethyla-
tion by Escherichia coli AlkB directly reverts DNA base damage. Nature 419(6903), 174–178
(2002). https://doi.org/10.1038/nature00908
47. Vaillancourt, F.H., Barbosa, C.J., Spiro, T.G., Bolin, J.T., Blades, M.W., Turner, R.F.B.,
Eltis, L.D.: Definitive evidence for monoanionic binding of 2,3- dihydroxybiphenyl to 2,3-
dihydroxybiphenyl 1,2-dioxygenase from UV resonance Raman spectroscopy, UV/Vis absorp-
tion spectroscopy, and crystallography. J. Am. Chem. Soc. 124(11), 2485–2496 (2002). https://
doi.org/10.1021/ja0174682
48. Wójcik, A., Broclawik, E., Siegbahn, P.E.M., Lundberg, M., Moran, G., Borowski, T.: Role
of Substrate Positioning in the Catalytic Reaction of 4-Hydroxyphenylpyruvate Dioxygenase -
A QM/MM Study. J. Am. Chem. Soc. 136(41), 14472–14485 (2014). https://doi.org/10.1021/
ja506378u
49. Ye, S., Riplinger, C., Hansen, A., Krebs, C., Bollinger, J.M., Neese, F.: Electronic structure
analysis of the oxygen-activation mechanism by Fe(II)- and α-ketoglutarate (αkg)-dependent
dioxygenases. Chemistry 18(21), 6555–6567 (2012). https://doi.org/10.1002/chem.201102829
50. Zhou, J., Kelly, W.L., Bachmann, B.O., Gunsior, M., Townsend, C.A., Solomon, E.I.: Spec-
troscopic studies of substrate interactions with clavaminate synthase 2, a multifunctional α-
KG-dependent non-heme iron enzyme: Correlation with mechanisms and reactivities. J. Am.
Chem. Soc. 123, 7388–7398 (2001)
Index

A 377, 381–383, 385–388, 394, 395, 398,


Amyloid formation and stability, 472, 551 401, 404, 405, 412, 416, 420, 425, 716
Metalloenzymes, 372, 773, 825, 826, 829, 831,
B 846
Bioinformatics methods, 562 Molecular dynamics, 6, 9, 11, 15, 30, 34, 36,
Boltzmann inversion, 121 48, 61, 68, 91, 95, 119, 120, 122, 163,
164, 169, 171, 175, 186, 191, 192, 259,
C 261, 263, 268, 270–272, 281, 286,
Coarse-grained models of nucleic acids, 12, 13, 296–298, 308, 310, 319, 325, 339, 372,
77, 117 385, 387, 394, 398, 399, 401, 402, 406,
Coarse-grained models of protein structure, 37, 410, 413, 414, 421, 467, 471, 476, 480,
258, 272, 308, 415, 522, 552 491, 514, 541, 544, 682, 718, 720, 728,
Coarse–graining, 121 729, 731, 734, 756, 828, 837
Molecular mechanics, 4, 40, 337, 423, 489,
D 725, 772, 826
Dynamics of nucleic acids, 30, 70, 117, 119, Molecular quantum mechanics, 164, 168, 348,
137, 138, 140, 144, 335 722, 723, 725, 726, 757, 772
Molecular simulations and modeling, 62, 203,
E 253
Empirical force fields, 5, 382, 828 Monte Carlo methods, 119, 186, 192, 227, 236,
242, 243, 259–261, 325, 387, 453, 454,
F 618, 735, 769
Force field, 120
N
G Nonbonded potential, 126
Generalized ensembles, 262
P
I Parallel tempering, 264
Interstrand potential, 125 Protein dynamics, 34, 38, 61–63, 72, 77, 412
Intrastrand potential, 123 Protein structure prediction, 28, 39, 46, 63, 76,
90, 109, 373, 388, 573, 603, 634, 659,
M 689, 700, 703
Markov chain, 264
Membrane proteins and lipids, 15, 61, 64, 72, S
74, 332, 342, 357, 358, 372–374, 376, Simulated tempering, 263

© Springer Nature Switzerland AG 2019 851


A. Liwo (ed.), Computational Methods to Study the Structure and Dynamics
of Biomolecules and Biomolecular Processes, Springer Series on Bio-
and Neurosystems 8, https://doi.org/10.1007/978-3-319-95843-9

You might also like