Bioinformatics Techniques For Drug Discovery

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

SPRINGER BRIEFS IN COMPUTER SCIENCE

Aman Chandra Kaushik


Ajay Kumar · Shiv Bharadwaj
Ravi Chaudhary · Shakti Sahi

Bioinformatics
Techniques for
Drug Discovery
Applications for
Complex Diseases
SpringerBriefs in Computer Science

Series editors
Stan Zdonik, Brown University, Providence, Rhode Island, USA
Shashi Shekhar, University of Minnesota, Minneapolis, Minnesota, USA
Xindong Wu, University of Vermont, Burlington, Vermont, USA
Lakhmi C. Jain, University of South Australia, Adelaide, South Australia, Australia
David Padua, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
Xuemin Sherman Shen, University of Waterloo, Waterloo, Ontario, Canada
Borko Furht, Florida Atlantic University, Boca Raton, Florida, USA
V. S. Subrahmanian, University of Maryland, College Park, Maryland, USA
Martial Hebert, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Katsushi Ikeuchi, University of Tokyo, Tokyo, Japan
Bruno Siciliano, Università di Napoli Federico II, Napoli, Italy
Sushil Jajodia, George Mason University, Fairfax, Virginia, USA
Newton Lee, Newton Lee Laboratories, LLC, Burbank, California, USA
SpringerBriefs present concise summaries of cutting-edge research and practical
applications across a wide spectrum of fields. Featuring compact volumes of 50 to
125 pages, the series covers a range of content from professional to academic.
Typical topics might include:
• A timely report of state-of-the art analytical techniques
• A bridge between new research results, as published in journal articles, and a
contextual literature review
• A snapshot of a hot or emerging topic
• An in-depth case study or clinical example
• A presentation of core concepts that students must understand in order to make
independent contributions
Briefs allow authors to present their ideas and readers to absorb them with
minimal time investment. Briefs will be published as part of Springer’s eBook
collection, with millions of users worldwide. In addition, Briefs will be available for
individual print and electronic purchase. Briefs are characterized by fast, global
electronic dissemination, standard publishing contracts, easy-to-use manuscript
preparation and formatting guidelines, and expedited production schedules. We aim
for publication 8–12 weeks after acceptance. Both solicited and unsolicited
manuscripts are considered for publication in this series.

More information about this series at http://www.springer.com/series/10028


Aman Chandra Kaushik Ajay Kumar

Shiv Bharadwaj Ravi Chaudhary


Shakti Sahi

Bioinformatics Techniques
for Drug Discovery
Applications for Complex Diseases

123
Aman Chandra Kaushik Ravi Chaudhary
School of life Sciences School of Biotechnology
and Biotechnology Gautam Buddha University
Shanghai Jiao Tong University Greater Noida, Uttar Pradesh
Shanghai India
China
Shakti Sahi
Ajay Kumar School of Biotechnology
School of Engineering Gautam Buddha University
Gautam Buddha University Greater Noida, Uttar Pradesh
Greater Noida, Uttar Pradesh India
India

Shiv Bharadwaj
Nanotechnology Research
and Application Center
Sabanci University
Tuzla, Istanbul
Turkey

ISSN 2191-5768 ISSN 2191-5776 (electronic)


SpringerBriefs in Computer Science
ISBN 978-3-319-75731-5 ISBN 978-3-319-75732-2 (eBook)
https://doi.org/10.1007/978-3-319-75732-2
Library of Congress Control Number: 2018932352

© The Author(s) 2018


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by the registered company Springer International Publishing AG
part of Springer Nature
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

This book is an outgrowth or organized compilation of the recent bioinformatics


approaches used for the drug discovery, and is designed primarily for the
researchers and academicians in the respective field. It is not, however, an ele-
mentary book and presupposes knowledge of computational biology for post-
graduate level and research scholars. The authors have long held the view that the
lack of knowledge on the fundamental aspects of the various computational tools is
a serious shortcoming for the postgraduate education as well as research scholars.
Hence, the inclusions of greater details then are usually found in texts for research
scholar or postgraduates presuming the experiment in computational biology. With
the current demand of new drugs for complex diseases as well as the development
of resistance in the diseases, computational tools have been recommended and
successfully established as solution to the growing demands of drug for the phar-
maceutical industries as well as research institutes. Chapters covering the recent
computational aided drug discovery and drug designing approaches in expanding
matter. The usually required material has been presented in a concise form, and then
details on special aspects have been described in the form of addenda. It is hoped
that this approach will meet the needs of beginners in the field of drug designing
and discovery, and also provide resourceful information to the research scholars or
researchers for more advanced study.
Bioinformatics approach in drug designing is an interdisciplinary field that
required sophisticated techniques and software tools to elucidate the hidden or
complex biological data. The intellectual challenge involved in the study of drug
discovery has attracted the scientists from fields of Computer Science, Biology,
Mathematics and Engineering science, and the field today constitutes a frontier of
computational biology. All attempts have been made in the present work to provide
an integrated approach covering all the essential aspects on drug discovery using
bioinformatics approach. If one visualizes the drug designing as an organized
collection of different interactions between the drug molecules or inhibitor and
target of interest, most commonly a protein, the emphasis given to the molecular
docking, dynamics simulations and models to validate their inhibitory ability on the
target molecule in certain chapters will be understandable. The research scholars

v
vi Preface

will be impressed with the fact that the fundamental strategies in drug discovery are
the inhibition of target by blocking their active sites present in any complex dis-
eases. This is to be expected since the evolutionary diversification and complexa-
tion taken place in different diseases are much greater than that of agents or
molecules metabolic activities or biochemical pathways.
Chapter 2 gives insight into the ligand-based approach for drug designing using
the computational technique of the subject. Chapter 3 describes the structure-based
approach for drug designing using computational technique and Chap. 4 integrates
the information on three-dimensional (3D) pharmacophore modelling based drug
designing by computational technique and other properties. Chapter 5 explains the
molecular dynamics simulation approach to investigate dynamic behaviour of
system through the application of Newtonian mechanics. Chapter 6 explains the
receptor thermodynamics of ligand–receptor or ligand–enzyme association and
Chap. 7 speaks about the thermodynamics cycles and their application in protein
targets. Finally, Chap. 8 provides the insights into different computational
approaches to understand the genomics and proteomics that help to predict the
target of interest.

Shanghai, China Aman Chandra Kaushik


Greater Noida, India Ajay Kumar
Istanbul, Turkey Shiv Bharadwaj
Greater Noida, India Ravi Chaudhary
Greater Noida, India Shakti Sahi
Contents

1 Brief Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Brief Evolutionary History of In Silico Approaches . . . . . . . . . . 2
1.2 Computational Drug Discovery and Design . . . . . . . . . . . . . . . . 3
1.3 Epigenetics: Beyond the Sequence . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Histones Modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Ligand-Based Approach for In-silico Drug Designing . . . . . . . . . . . . 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Molecular Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 2D QSAR Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 3D QSAR Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Multidimensional QSAR . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Constitutional Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Quantitative Structure–Activity Relationships . . . . . . . . . . . . . . . 14
2.5 Molecular Fingerprint and Similarity Searches . . . . . . . . . . . . . . 15
2.6 Similarity Searches in LB-CADD . . . . . . . . . . . . . . . . . . . . . . . 16
2.7 Similarity Networks and off Target Predictions . . . . . . . . . . . . . . 16
2.8 Fingerprint Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.9 Computational Methods for Biomolecular Docking . . . . . . . . . . . 17
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Structure-Based Approach for In-silico Drug Designing . . . . . . . . . . 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Protein Docking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Protein–Protein Docking . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Protein–Ligand Docking . . . . . . . . . . . . . . . . . . . . . . . . . 23
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

vii
viii Contents

4 Three-Dimensional (3D) Pharmacophore Modelling-Based Drug


Designing by Computational Technique . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 Pharmacophore Model . . . . . . . . . . . . . . . . . . . . . . . . . . 29
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5 Molecular Dynamics Simulation Approach to Investigate Dynamic
Behaviour of System Through the Application of Newtonian
Mechanics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Molecular Dynamics Simulations . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Monte Carlo Research with Metropolis Criterion . . . . . . . . . . . . 35
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6 Receptor Thermodynamics of Ligand–Receptor
or Ligand–Enzyme Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Database Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.2.1 De Novo Drug Design . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3 State-of-the-Art Free Energy Calculations . . . . . . . . . . . . . . . . . . 41
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7 Thermodynamic Cycles and Their Application in Protein
Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.2 Protein Targets and Applications . . . . . . . . . . . . . . . . . . . . . . . . 44
7.3 4-Hydroxyphenylpyruvate Dioxygenase (HPPD) . . . . . . . . . . . . . 45
7.4 Oligopeptide-Binding Protein a (OppA) . . . . . . . . . . . . . . . . . . . 46
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8 Genomics and Proteomics Using Computational Biology . . . . . . . . . 47
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.2 Peptide Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.3 De Novo and Hybrid Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 49
8.4 Sequence Database Search Algorithms . . . . . . . . . . . . . . . . . . . . 49
8.5 Scoring of Peptide Identifications . . . . . . . . . . . . . . . . . . . . . . . . 49
8.6 Peptide-Spectrum Match Scores and Common Thresholds . . . . . . 50
8.7 Fundamentals of Gene Transcription and Translation . . . . . . . . . 51
8.8 Genome Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.9 Definition of Genome Annotation . . . . . . . . . . . . . . . . . . . . . . . 53
8.10 Genome Annotation Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.11 Proteogenomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
About the Authors

Aman Chandra Kaushik He is a core computational


biologist with proclivity for biological databases and nature
inspired algorithms. He holds Bachelor in Life Science
(DDU University, India); Master in Bioinformatics (CSJM
University, India); Ph.D. in Bioinformatics (Indo-Israel
collaborative Project) and Post-doctorate in computational
biology from Ben Gurion University, Israel. Currently, he
is Research assistant at Shanghai Jiao Tong University,
China. He was a research fellow in Indian Council of
Medical Research (ICMR) sponsored project. He has
published research articles in various international journals
of repute. He also attended national as well as international
conferences and presented his papers. He has also been
awarded several scholarships and travel grants including
Post-doc scholarship from Kreitman Postdoctoral
Fellowship (PDF); Post-doc scholarship from Shanghai
Jiao Tong University sponsored by Ministry of Science and
Technology, China; Travel grant and total expenses
MCCMB 2017 Conference, Moscow, Russia from
Kreitman, Israeli Ministry of Science, ISF; Travel grant
and total expenses for “Worldwide innovative networking
in personalized cancer medicine”, WIN 2017 Symposium,
Paris, France; Travel grant and total expenses for Joint
ICGEB-ICTP-APCTP Workshop from ICTP which gov-
erned by UNESCO, IAEA and Italy; 4 month Scholarship
from Ministry of Science, Technology Space Israel and
“Young Researcher Scholar Award” from GRDS
International.

ix
x About the Authors

Ajay Kumar is an M.Tech student in Gautam Buddha


University and his research area focus on Mechanical
Engg. and Cancer Biology.

Shiv Bharadwaj is an post-doctoral scholar at Sabanci


University, Istanbul, Turkey. He holds a PhD in
Biotechnology and his research works focus on
Nanotechnology and Bioinformatics.

Ravi Chaudhary is an PhD schlar in Gautam Buddha


University, Greater Noida, India. He holds a M.Tech in
Biotechnology and his research works focus on
Biotechnology.
About the Authors xi

Shakti Sahi is an Assistant Professor at School of


Biotechnology, Gautam Buddha University. She holds
a PhD in Molecular Modelling and Drug Design from
Department of Biophysics, All-India Institute of
Medical Sciences. Prior to this, she completed
Master’s in Pharmacy from Institute of Technology,
Banaras Hindu University. Her research works focus
on molecular modelling and drug design with special
emphasis on G-protein-coupled receptor (GPCRs).
Dr. Sahi has published many research articles in
reputed journals.
Chapter 1
Brief Introduction

Abstract Recent knowledge collected on drug molecules and their intermolecular


interactions can be used to predict the mechanisms underlying the human physiolog-
ical processes. In this scenario, computer-aided drug design (CADD) is commonly
employed to facilitate the progression of potential inhibitor identification. Amongst
the various computational approaches, pharmacophore modelling is classified as a
decent technique to identify the lead inhibitors or drug molecules that fit chemically
different structural classes. Besides, biological networks and designed biochemical
mathematical models have been employed to explore the pharmacokinetics and phar-
macodynamics in biological systems. Moreover, molecular dynamics (MD) simula-
tion, a broadly used computational approach based on Newton’s equation of motion
for a given system of atoms, delivers the information about protein–ligand inter-
actions. Additionally, synthetic biology approach has been broadly employed as a
precise and vigorous technique to accelerate the genome sequence data and reduction
in DNA synthesis cost. Synthetic biology has been also reported to investigate the bio-
logical circuit and behaviour or the role of human physiological system. Prominently,
the competences to design potential drugs are highly dependent on the fundamental
understanding of drug molecules and their biochemical interactions. In this context,
the gap between number of identified hit molecules and authentic or genuine drug
molecules can be bridged by utilizing the recent bioinformatics approaches.

Keywords Biological networks · Pharmacophore · Pharmacokinetics


Systems biology · Drug discovery · Diseases

Recent applications of computational approaches in pharmacy, termed as in silico


pharmacology, or sometimes referred as computational therapeutics, is a fast devel-
oping area worldwide that addresses the growth of various computer-aided programs
to completely obtain, evaluate and combine both the medical as well as biological
information collected from numerous resources. Precisely, it describes the applica-
tion of collected information within the conception of simulations or computational
models that can be used to produce predictions, recommend hypotheses and even-
tually provide the improvements in therapeutics/medications. Using in silico phar-
macology, we can summarize the drug development that is a massively complicated
interpretation and information workout. Consequently, such available information

© The Author(s) 2018 1


A. C. Kaushik et al., Bioinformatics Techniques for Drug Discovery,
SpringerBriefs in Computer Science, https://doi.org/10.1007/978-3-319-75732-2_1
2 1 Brief Introduction

provides the way to find the shortcuts or manage the guidelines towards the drug
designing and its commercialization [1]. In medical science, drug development is a
comprehensive study of different types of interactions between the chemical com-
pounds and macromolecules such as medicinal agents, also known as ligands and
their respective targets. The exploration for drug-like compounds or molecules that
specifically and selectively bind to the target, i.e. active sites in the biomolecules
of interest, followed by interference with its receptor function or enzymatic activity,
demands multi- and interdisciplinary approaches. Herein, computer-aided modelling
tools played an important role to predict and understand the relevant ligand–receptor
or ligand–enzyme interactions [2].

1.1 Brief Evolutionary History of In Silico Approaches

Drug development and associated analysis in the establishment of potential drug


molecules even continued in the absence of modern computational approaches.
Albert [3, 4] concluded the preliminary perceptions on the structure–activity
interactions that can be traced back to the nineteenth century. Moreover, Meyer [5]
and Overton [6] established that higher depressant action arises at a point due to the
generation of maximum partition coefficient between lipid solvent and water, and
precisely reported the equation between activities and physicochemical properties.
These studies provided the methods to identify the electronic and lipophilicity
properties as important factors in PD and PK reactions. These factors were well
demonstrated by epoch-generating and in recent studies of Hansch [7, 8]. Also,
the work of Crum Brown and Fraser summarized by (Albert [3]) documented the
role of 2D structure of drug molecules or compounds in pharmacological activity.
Cushny [9] purposed the three-dimensional (3D) structure for drug molecules
and reported the relations between enantiomerism and bioactivity. Later in the
mid-twentieth century, this concept was further explored and studied with the
finding of conformational effects on bioactivity [10]. For better understanding on
molecular structure, late in nineteenth and twentieth, researchers John Langley, Paul
Ehrlich and Alfred Clark; reviewed by Arïens [11], Parascandola 12] and calculated
the principle of receptors, specifically aim of drug activity in synchronous with
evolving understanding on molecular structure. The similarities amongst receptors
and enzymes had been then defined by Albert (1971).
The converging outlines of development in biology and chemistry created a big
data and understanding which moved over the most common ability of ‘in cerebro’
information manoeuvring and has been driving the growth and emergence of com-
puter sciences. In the early 1950s, Hansch was the one who used calculators and
statistical data to conclude the quantitative relations between structure (variables,
descriptors and process). Between 1980s and 1990s, such activities evolved into
quantitative structure–activity relationships (QSARs) which were implemented using
computer graphics and molecular modelling. Nonetheless, computational approaches
quickly ceased to be an integral tool in the drug designing and development. This
1.2 Computational Drug Discovery and Design 3

results in a triad of chemistry–biology–informatics that emerged as a unique system


to bring new insights into the pharmacology.

1.2 Computational Drug Discovery and Design

Recent drug discovery greatly relies on computational efforts and it provides an


insight at the atomic level. Thus, computational methods and experimental observa-
tions generally complement each other in an interdisciplinary mode [2]. The ratio-
nalization of experimental findings at an atomistic level can provide the general
guidelines for active compounds synthesis. Furthermore, assessments of binding
free energy (G) offered useful insights into ligand binding process [13]. There is a
consensus in the scientific community that major tasks in computational aided drug
design nowadays primary rely on the accurate and efficient calculation of binding
free energies. The binding free energy symbolizes the equilibrium between ligand in
solution and bound state to its molecular target, e.g. a protein. Additionally, binding
free energy depends on different types of interactions taking place upon ligand bind-
ing on its target. To screen ligands based on their binding affinity, mainly three factors
that should be considered are (i) ligand, (ii) protein and (iii) solvent that contains
both the species. Each factor may subsidize to binding free energy, that describes the
sum of all contributions as represented in Fig. 1.1.
These observations from ligand binding perspective conclude that complexes
of ligand–protein lead to favourable interactions such as the formation of hydro-
gen bond, electrostatic attractions, sigma bond formation, etc. Also, these interac-
tions contribute to the enthalpy binding of ligand with a target of interest. Whilst,
interaction process leads to change in conformational, translational and rotational
freedom of ligand and hence, resulting in the addition of unfavourable entropic in
the binding process. However, conformational selection model stated that a protein
can adapt to an ensemble of different conformations [14]. Thus, this model well
described the more often occurrence of low-energy conformations and rare occa-
sions for the higher energy conformations. Besides, it is often reported that ligand
binds to the unfavourable protein conformations. Hence, the favourable protein–li-
gand enthalpy is sufficient to stabilize the protein in its high-energy conformational
state for closely binding ligands and from the perspective of protein, there is an
addition of unfavourable enthalpic.
By interacting with the protein, favourable enthalpic contributions are observed
for the ligand, but loses both the rotational and conformational freedom that leads to
an unfavourable entropic contribution. With the ligand binding, protein might acquire
high-energy conformation state that results in the addition of unfavourable entropic
and enthalpic into the whole system. Also, desolvation promotes unfavourable
enthalpic but contributes favourable entropic to the system. In this regard, in sil-
ico or computational techniques assisted to simulate and make decisions nearly on
all the elements involved in drug development process [15]. For instance, when-
ever the improvements in regards to human genome are studied, we are bound to
4 1 Brief Introduction

Fig. 1.1 Contributions of three main factors for ligand binding to total binding free energy

incorporate computational and experimental information collected using objective


beginning and in silico pharmacology in linking all the critical information vari-
eties [16]. Hence, structure-based techniques are largely employed in drug discovery
and development. Additionally, as an example in neuropharmacology, it is antici-
pated that kinetics-based ligand–receptor models should be incorporated with system
methods to completely comprehend the neurological problems and as a whole can
be employed in pharmacology [17]. Essentially, there are two main consequences
whenever bioactive compounds interact with the biological systems as shown in
Fig. 1.2 [18].
A biological system is defined as a extremely complex network of biologically
relevant entities such as proteins and genes. For instance, unicellular organisms, cells
separated from multicellular organisms and population of unicellular or multicellular
organisms, all represent the individual biological system. However, when it comes
to interactions of drug (or any xenobiotic) molecules with the biological system, the
phenomenon could be explained as ‘what drug molecules do to the biosystem’ and
‘what biosystem does to the drug molecule’. A drug that functions for a biological sys-
tem generates a pharmacological and toxic reaction, classified as pharmacodynamic
(PD) events. Similarly, biological system performs the activities on the drug such as
taking in dispersing, metabolizing and reducing drug molecules and this response
is classified as pharmacokinetic (PK) events. However, it is important to mention
that these two facets of drug molecules and biological system are indistinguishably
1.2 Computational Drug Discovery and Design 5

Fig. 1.2 Two types of basic interaction approaches between the drug and biological systems, termed
as PD events (activity and toxicity) and PK events (ADME) (modified from [19] and reproduced
with the kind permission of the Verlag Helvetica Chimica Acta in Zurich). ADME; absorption,
distribution, metabolism and excretion; PD, pharmacodynamic; PK, pharmacokinetic

mutualism [20]. Absorption, circulation and eradication will demonstrably show the
decisive impact on determination, i.e. definitive intensity and level of PD, and conse-
quently, biotransformation develops distinct PK. More precisely, it may be possibly
beneficial to pacify the goals as several biological elements that generate PD events
following their interaction with drug molecule or any other xenobiotic compound.
Such elements include receptors, ion networks, nucleic acids, anabolic and catabolic
enzymes. Likewise, one could relate to biological components that include biological
xenobiotic metabolizing enzymes, transporters, circulating proteins, membranes, as
such they act on drugs by metabolizing, transporting, distributing or excreting out of
the biological system.
Drug designing and development of new medicines is a long, multifaceted, expen-
sive and highly perilous procedure that has few peers in the commercial world. There-
fore, computer-aided drug design (CADD) approaches are being widely employed
in pharmaceutical industries to rapidly speed up the drug development process [21].
Typically, it takes 10–15 years and approximately US$500–800 million for the syn-
thesis and testing of lead drugs into the market [22]. In this regard, it is advantageous
to use computational aided tools in the optimization of hit-to-lead drug to cover a
large library of chemicals whilst decreasing the number of compounds that should be
deigned as evaluate in the in vitro studies. The standardization of potential screened
ligand by computational aided tools involves structure-based analysis of docking
energy profiles for the screened analogs, ligand-based evaluation of screened com-
pounds with analogous chemical structure, enhanced projected biological activity,
calculation of favourable affinity, improve drug metabolism and pharmacokinetics
6 1 Brief Introduction

(DMPK), absorption, distribution, metabolism, excretion and potential for toxicity


(ADMET) properties. In contrast to the commercial method, CADD-assisted syn-
thesis of chemical compounds and biological characterization make these methods
more cheaper, prominent to focus, consume less time and diversify the chemical
space [23]. CADD can upsurge the screening rate of novel ligand or drug molecules
as it involved target-specific search against traditional high-throughput screening
(HTS) and combinatorial chemistry. It not only aimed to reveal the molecular foun-
dation of therapeutic activities but also asserted to forecast the possible by-products
that will assist in enhanced activity. Generally, CADD is usually used for three major
purposes in a drug discovery crusade which are as follows [24]:
(1) To screen the wide chemical compound libraries to small sets of compounds
that can be experimentally evaluated.
(2) To generalize the optimization approach for potential screened compounds to
increase its DMPK and ADMET properties.
(3) To assist in designing and development of novel inhibitors or drug molecules,
either by ‘synthesizing’ the initial fragments with one functional group at a time
or by mending together synthesized fragments into new novel chemotypes.

1.3 Epigenetics: Beyond the Sequence

Epigenetics is broadly defined as the study of heritable variations or adoptions gained


by the genes to the environment without inducing any change in the genome of
organism. Herein, basic properties of epigenetic marks are described.
1. DNA methylation
DNA methylation is defined as the covalent modification of DNA. The most well-
studied example is in vertebrates where DNA methylation takes place at cytosine-
guanine dinucleotides (CpG sites). It includes adding a methyl (–CH3 ) group at
5 carbons of pyrimidine ring in cytosine base of DNA and converting this base
from cytosine to 5-methylcytosine. This bioreaction is carried out by enzymes called
DNA methyltransferases (DNMT) and involves transferring a methyl group from
S-adenosyl methionine (SAM) to cytosine. Mainly two families of DNMT are known
in mammals, they are DNMT1 and DNMT3. The enzymes DNMT3a and DNMT3b
are classified as de novo methyltransferases whilst DNMT1 plays an important
role in the maintenance of newly replicated cells, scanning a newly synthesized
genomic DNA sequence for methylated CpG sites in the mother strand and adding
methyl groups to corresponding CpG sites in the daughter strand. Besides, existence
and process of active DNA demethylation is an even more active area of research.
The collection of ten-eleven translocation (TET) proteins has been of great interest
because they can transform 5-methylcytosine to 5-hydroxymethylcytosine as well
as 5-hydroxymethylcytosine to 5-carboxylmethylcytosine, which can be excised via
base excision repair to revert to an unmethylated cytosine state [25, 26].
1.3 Epigenetics: Beyond the Sequence 7

Moreover, CpG sites tend to be under-represented in genomes as a direct con-


sequence of their propensity for methylation at cytosine site and vulnerability of
methylated cytosines to deamination that results in cytosine to thymine transition.
Additionally, methylation is the default state for a large proportion of cytosines
present in CpG pairs. Most important exception being CpG islands, the exact criteria
for defining these regions is open to differences of opinion but in general consist of
regions of several hundred or thousand base pairs with an enrichment for CpG dinu-
cleotides relative to the genome wide average. These regions occupied with CpG sites
are generally not methylated and hence, distribution of DNA methylation from a sam-
pling of CpG loci in a vertebrate genome is typically bimodal, with a low-methylation
mode corresponding to CpG sites within CpG islands, and a high-methylation mode
corresponding to CpG sites elsewhere. A third mode, though much smaller than the
other two, could be assigned to hemi-methylated sites corresponding to imprinted
regions wherein maternally or paternally inherited copy of a locus is silenced early in
the development via DNA methylation whilst the other copy remains unmethylated
[27].
Despite this ‘tri-modality’ of DNA methylation, it should be noted that a single
cell can be methylated at both, one or neither copies for a given locus. All the cells in
an individual or even in each cell sample taken from an individual are not expected
to follow an identical pattern for methylation (or lack thereof) at a given CpG site.
Also, taking a group of cells from an individual and then measuring their overall
methylation level at a CpG site would lead to a continuous measurement, and can
be considered as fraction of CpG alleles which are methylated in the given group of
cells. However, these overall methylation levels can vary between individuals in a
population.
The list of biological functions that methylated CpG sites performed in a cell is
long. However, the function played by a CpG site greatly depends on the context of
genetic sequence and other epigenetic modifications present in its vicinity. Generally,
it is defined by a site that is repressive to transcription. Also, it important to mention
that DNA methylation is not hypothesized to play a fully causative, repressive role
in all contexts, but could also be because of other factors responsible for the tran-
scriptional activity. Moreover, some experimental evidences also reported that DNA
methylation reinforced a transcriptionally inactive state under certain circumstances
rather than being a straightforward cause or consequence thereof [28]. Also, genomic
imprinting and female X chromosome inactivation are two long-studied functions of
DNA methylation. In the former, either the maternally or paternally inherited copy of
gene is silenced by copious methylation of CpG sites at promoter region [27]. In the
latter, methylation in one copy of female’s X chromosome makes it transcriptionally
inactive to achieve the same level of transcription as expressed by males holding a
single copy of X chromosome [29].
Transposable elements comprise a huge fragment of the human genome. In terms
of absolute numbers of CpG sites involved, a large fraction of those methylated sites
act as promoters of such elements [30, 31]. This methylation leads to transcriptional
inactivity as well as increased likelihood of C→T mutagenesis over time and decreas-
ing the likelihood of transposable elements mobilization in the genome, increasing
8 1 Brief Introduction

the overall genomic stability. Beyond the examples outlined above, CpG methylation
in mammals has been investigated in most of the genes, particularly in context to
cancer where aberrant methylation is linked to inappropriate activation or repression
of cell proliferation-related genes. Typically, promoter regions can be alienated into
two types of categories based on the presence or absence of CpG islands. Genes in
which their respective promoter sites contain CpG islands and more commonly in
an unmethylated state are generally repressed via means other than DNA methyla-
tion, such as by binding of polycomb proteins. However, methylation of CpG island
promoters is seen in the regions where a long-term interval for repressed state is
required, such as in female X chromosome inactivation and imprinted genes. Inter-
estingly, genes whose promoter region do not contain CpG islands show much more
variability in their DNA methylation [32].
CpG sites within the frames of genes are also subject to variable DNA methy-
lation. Exceptionally, this DNA methylation is typically positively correlated with
expression of a gene when present within its frames rather than near the transcription
initiation site. Current hypothesis points towards hindrance of gene methylation at
spurious transcription initiation sites within the gene frames that allows transcription
machinery to more effectively bind and initiate transcription at true start sites [33].
Enhancers are sites more distal (up to several hundred kb) from genes that also
participate in the process of transcriptional regulation. The functions and effects of
enhancer DNA methylation are less well researched than those for promoters. But
recent efforts have found active enhancers to be neither completely unmethylated
nor methylated, but to exist in states termed as ‘low-methylation’ regions [34].
Research in past decades focused on DNA methylation, its patterns and effects
at canonical genes, or in the context of diseases such as cancer. High throughput
methods for measuring DNA methylation at a wide range of CpG loci in the genome
has been used to extract information on quantifying distribution of DNA methyla-
tion and variation in populations of healthy individuals, as well as its relationship
to genetic variation, gene expression and other epigenetic traits. Also, recent work
done to investigate these relationships in corresponds to set of primary untransformed
human fibroblasts and documented the presence of both negative and positive corre-
lations between DNA methylation and gene expression that depend less on position
with respect to gene frame or promoter and more with respect to histone marks in
the selected gene region.

1.4 Histones Modification

The genetic material, i.e. DNA in case of eukaryotic cells is well packaged
into nucleosomes that tends to reduce the access to DNA for the transcription
machinery. Further, additional modifications in the histones, i.e. the constituent
proteins of nucleosomes, could also either further restrict or alleviate the access
to DNA. Moreover, various amino acid residues within the histones are subject
to various modifications, including methylation, ubiquitination, acetylation and
1.4 Histones Modification 9

phosphorylation that lead to possible configurations of histone modifications present


in each region. Recent efforts to study distributions of modifications have pointed
towards the various transcriptional states, such as active or inactive genes, promoters
and enhancers, and are being correlated with combinations of certain marks [35].
Distribution of individual marks, their functions and implications of a given com-
bination of functions is a growing area of research [36, 37]. H3K4me2 indicates
di-methylation (me2) of lysine 4 (K4) on histone 3 (H3). Methylation of lysine
residues 4, 27 and 36 on Histone 3 is one type of modification for which data are
available in a wide variety of cell types. Whilst these should only be interpreted as
general guidelines rather than deterministic rules, H3K4me3 is typically associated
with promoters of active genes, H3K4me2 is found in active genes, and H3K4me1
adjacent to active promoters in some cases and with more distal enhancers of genes
that are either active or poised for activation. Lysine 27 acetylation (H3K27ac) has
been shown to be a mark that, together with H3K4me1, signals active enhancers as
opposed to poised enhancers [38]. H3K27me3 is indicative of inactive promoters,
whilst H3K36me3 is indicative of active gene bodies.

References

1. S. Ekins, J. Mestres, B. Testa, In silico pharmacology for drug discovery: methods for virtual
ligand screening and profiling. Br. J. Pharmacol. 152, 9–20 (2007)
2. S. Ekins, J. Mestres, B. Testa, In silico pharmacology for drug discovery: applications to targets
and beyond. Br. J. Pharmacol. 152, 21–37 (2007)
3. A. Albert, Relations between molecular structure and biological activity: stages in the evolution
of current concepts. Ann. Rev. Pharmacol. 11:13–36 (1971)
4. A. Albert, Selective toxicity. The physcico-chemical basis of therapy. Chapman and Hall:
London (1985)
5. H. Meyer, Zur Theorie der Alkoholnarkose. Arch. Expl. Patholharmakol. 42:110–118 (1899)
6. E. Overton, Studien über die Narkose. Gustav Fischer: Jena (1901)
7. C. Hansch, T. Fujita, p-σ-π analysis. A method for the correlation of biological activity and
chemical structure. J. Am. Chem. Soc. 86, 1616–1626 (1964)
8. C. Hansch, Quantitative relationships between lipophilic character and drug metabolism. Drug
Metab. Rev. 1, 1–13 (1972)
9. A. Cushny, Biological Relations of Optical Isomeric Substances. Williams and Wilkins: Bal-
timore (1926)
10. A. Burgen, Conformational changes and drug action. Fed Proc, 2723–2728 (1981)
11. E.J. Arïens EJ. Receptors: from fiction to fact. Trends Pharmacol. Sci. 1:11–15 (1979)
12. J. Parascandola, Origins of the receptor theory. Trends Pharmacol. Sci. 1, 189–192 (1979)
13. X. Du, Y. Li, Y.-L. Xia, S.-M. Ai, J. Liang, P. Sang, X.-L. Ji, S.-Q. Liu, Insights into protein–li-
gand interactions: mechanisms, models, and methods. Int. J. Mol. Sci. 17, 144 (2016)
14. P. Csermely, R. Palotai, R. Nussinov, Induced fit, conformational selection and independent
dynamic segments: an extended view of binding events. Trends Biochem. Sci. 35, 539–546
(2010)
15. S. Ekins, P.W. Swaan, Development of computational models for enzymes, transporters, chan-
nels, and receptors relevant to ADME/Tox. Rev. Comput. Chem. 20, 333 (2004)
16. P.A. Whittaker, What is the relevance of bioinformatics to pharmacology? Trends Pharmacol.
Sci. 24, 434–439 (2003)
10 1 Brief Introduction

17. I. Aradi, P. Érdi, Computational neuropharmacology: dynamical approaches in drug discovery.


Trends Pharmacol. Sci. 27, 240–243 (2006)
18. S. Ekins, J. Mestres, B. Testa, In silico pharmacology for drug discovery: applications to targets
and beyond. Br. J. Pharmacol. 152, 21–37 (2007)
19. B. Testa, S.D. Krämer, The biochemistry of drug metabolism—an introduction. Chem.
Biodivers. 3, 1053–1101 (2006)
20. B. Testa, S.D. Kraemer, The biochemistry of drug metabolism—an introduction. Chem. Bio-
divers. 4, 257–405 (2007)
21. T. Katsila, G.A. Spyroulias, G.P. Patrinos, M.-T. Matsoukas, Computational approaches in
target identification and drug discovery. Computational and structural biotechnology journal
14, 177–184 (2016)
22. S.C. Basak, Editorial. Curr. Comput. Aided Drug Des. 8, 1–2 (2012)
23. I.J. Enyedy, W.J. Egan, Can we use docking and scoring for hit-to-lead optimization? J. Comput.
Aided Mol. Des. 22, 161–168 (2008)
24. A. Veselovsky, A. Ivanov, Strategy of computer-aided drug design. Curr. Drug Targets-Infect.
Disord. 3, 33–40 (2003)
25. Y.-F. He, B.-Z. Li, Z. Li, P. Liu, Y. Wang, Q. Tang, J. Ding, Y. Jia, Z. Chen, L. Li, Tet-mediated
formation of 5-carboxylcytosine and its excision by TDG in mammalian DNA. Science 333,
1303–1307 (2011)
26. S. Ito, L. Shen, Q. Dai, S.C. Wu, L.B. Collins, J.A. Swenberg, C. He, Y. Zhang, Tet pro-
teins can convert 5-methylcytosine to 5-formylcytosine and 5-carboxylcytosine. Science 333,
1300–1303 (2011)
27. E. Li, C. Beard, R. Jaenisch, Role for DNA methylation in genomic imprinting. Nature 366,
362–365 (1993)
28. A. Blattler, P.J. Farnham, Cross-talk between site-specific transcription factors and DNA methy-
lation states. J. Biol. Chem. 288, 34287–34294 (2013)
29. T. Mohandas, R. Sparkes, L. Shapiro, Reactivation of an inactive human X chromosome:
evidence for X inactivation by DNA methylation. Science 211, 393–396 (1981)
30. J.A. Yoder, C.P. Walsh, T.H. Bestor, Cytosine methylation and the ecology of intragenomic
parasites. Trends Genet. 13, 335–340 (1997)
31. C.P. Walsh, J.R. Chaillet, T.H. Bestor, Transcription of IAP endogenous retroviruses is con-
strained by cytosine methylation. Nat. Genet. 20, 116–117 (1998)
32. P.A. Jones, Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat.
Rev. Genet. 13, 484–492 (2012)
33. A.K. Maunakea, R.P. Nagarajan, M. Bilenky, T.J. Ballinger, C. D’Souza, S.D. Fouse, B.E.
Johnson, C. Hong, C. Nielsen, Y. Zhao, Conserved role of intragenic DNA methylation in
regulating alternative promoters. Nature 466, 253–257 (2010)
34. M.B. Stadler, R. Murr, L. Burger, R. Ivanek, F. Lienert, A. Schöler, E. van Nimwegen,
C. Wirbelauer, E.J. Oakeley, D. Gaidatzis, DNA-binding factors shape the mouse methylome
at distal regulatory regions. Nature (2011)
35. J. Ernst, M. Kellis, Discovery and characterization of chromatin states for systematic annotation
of the human genome. Nat. Biotechnol. 28, 817–825 (2010)
36. J.-S. Lee, E. Smith, A. Shilatifard, The language of histone crosstalk. Cell 142, 682–685 (2010)
37. T. Suganuma, J.L. Workman, Signals and combinatorial functions of histone modifications.
Annu. Rev. Biochem. 80, 473–499 (2011)
38. M.P. Creyghton, A.W. Cheng, G.G. Welstead, T. Kooistra, B.W. Carey, E.J. Steine, J. Hanna,
M.A. Lodato, G.M. Frampton, P.A. Sharp, Histone H3K27ac separates active from poised
enhancers and predicts developmental state. Proc. Natl. Acad. Sci. 107, 21931–21936 (2010)
Chapter 2
Ligand-Based Approach for
In-silico Drug Designing

Abstract In this chapter, a brief introduction to ligand-based methodologies


employed for designing of drug has been described. Generally, ligand-based approach
for drug designing (LB-CADD) technique is employed when biological target struc-
ture is not known and hence, this technique is considered as an ancillary approach for
the drug designing. The theoretical basis of ligand-based approach involves quantita-
tive structure–activity relationships (QSAR) and biomolecular docking studies. Like
molecular descriptors, molecular fingerprint, similarity searches, similarity networks
and off-target predictions. Finally, a brief description of the present work is given.

Keywords LB-CADD · 2D or 3D structure · QSAR · Molecular descriptors


Molecular fingerprint

2.1 Introduction

Principally, ligand-based computer-aided drug discovery (LB-CADD) is based on


the principle of similar properties and states that compounds or molecules holding
similar structure tend to depict the similar properties [1]. The LB-CADD method
includes the evaluation of ligands that are known to interact with the selected target.
This technique aims to arrange and retain physicochemical properties of the ligand
or compounds, in the order of desired interactions respective to the target of interest
while irrelevant information and interactions are discarded. In this regard, these
methods employed a set of guide structures collected from molecules or compounds
reported to interact and are related to the target of interest with respect to their 2D or
3D structures. It is considered as the way to establish drug discovery development
process and for such reason, generally, it does not necessitate the knowledge of target
of interest structure [2]. The two basic techniques of LB-CADD are as follows: (i)
collection of chemical species considered as chemical similarity to known ligands
with a couple similarity measure; (ii) construction of QSAR model that predicts
biological activity from the chemical framework. Hence, this approach is commonly

© The Author(s) 2018 11


A. C. Kaushik et al., Bioinformatics Techniques for Drug Discovery,
SpringerBriefs in Computer Science, https://doi.org/10.1007/978-3-319-75732-2_2
12 2 Ligand-Based Approach for In-silico Drug Designing

used for in silico screening of novel ligands holding the desired biological activity,
hit-to-lead and lead-to drug optimization. Also, this approach can be employed in
the optimization to improve drug metabolism and pharmacokinetics (DMPK) or
potential toxicity (ADMET) properties.

2.2 Molecular Descriptors

Molecular descriptors map associated with ingredient such as a group of statisti-


cal or different molecular characteristics being considered and become essential for
describing the task [3]. These descriptors are basically designed on the subject knowl-
edge or quantum-mechanical tools [4, 5]. Broadly, two categories of descriptors are
known based on the given information in regards to 3D position and conforma-
tion of the molecules [6]. Broadly classified descriptors include one-dimensional
(1D) that includes scalar physicochemical properties such as molecular weight, two-
dimensional (2D) molecular constitution-derived descriptors and three-dimensional
(3D) molecular conformation-derived descriptors. However, these types of descrip-
tors complexity further showed overlying with other compound descriptors and fre-
quently lead to incorporation of information from simple ones.

2.2.1 2D QSAR Descriptors

The wide category of descriptors found in this approach, i.e. 2D-QSAR, is defined as
the typical characteristic to separate through the 3D orientation of ingredients. These
descriptors cover anything from easy measures of entities constituting the molecule,
via its topological and geometrical characteristics to calculate the electrostatic and
quantum-chemical descriptors or higher level methods such as fragment counting [7].

2.2.2 3D QSAR Descriptors

Comparative field molecular analysis (CoMFA) is a 3D-QSAR technique and is com-


putationally more complex than 2D-QSAR approach. It involves the alignments of
molecules and extracts the aligned characteristics that can be linked to the biological
activity. Usually, it contains a few actions to get statistical descriptors for the ingre-
dient structure. Further, the conformers in dataset need to be lowered with respect
to arranged area. Eventually, submerged conformers are probed computationally for
different descriptors. Some techniques in addition to the ingredient positioning are
also designed [8, 9].
2.2 Molecular Descriptors 13

2.2.3 Multidimensional QSAR

Multidimensional QSAR holds the 4D and 5D descriptors. Multidimensional


QSAR (mQSAR) aims to measure all the energy efforts of ligand binding including
removal of solvent particles, loss in conformational entropy and pocket adaptation,
i.e. binding [10].
4D-QSAR is an expansion of 3D-QSAR that treats each molecule as attire of var-
ious tautomers, conformations, protonation states, orientations and stereoisomers.
The 4D-QSAR relates to the ensemble sample of spatial features of each molecule.
Recently, a receptor-independent (RI) 4D-QSAR technique has been proposed [11].
This technique includes all particles as a grid and interaction, i.e. assigning ele-
ments at every single atom into the molecule (polar, nonpolar, hydrogen bond donor,
etc.). Molecular dynamic simulations (MDS) are acclimatized to create a Boltzmann
weighted conformational assembly of the molecule within the designed grid. Trial
alignment is performed in the grid with various particles and descriptors to define the
probable occupancy frequencies within each one of these alignments. These descrip-
tors are called as grid cell occupancy descriptors. Herein, the conformational attire
of every ingredient is employed to generate the grid cell occupancy descriptors rather
than the solitary conformation.
5D-QSAR was developed taking into account the regional alterations in binding
that play a role in the induced fit modal for the ligand binding, involved in a technique
purposed by Vedani [12]. The induced fit is simulated by mapping a ‘mean envelope’
for many ligands inside a training set on the ‘inner envelope’ for every specific
molecule. This technique comprises a few protocols for assessing the induced fit
models including a scale that is linear in the adaptation of topology, adaptations
predicated on property field energy minimization and lipophilicity potential. By using
this information, the energetic expense for adaptation within the ligands towards
binding site geometry is determined.

2.3 Constitutional Descriptors

Constitutional descriptors are defined as simple and frequently used descriptors,


revealing the characteristics in regards to chemical composition whilst no infor-
mation on the molecular geometry or atom connectivity is encoded. Instances of
constitutional descriptors‘ can be well explained by their definition of molecular
weight (MW), number of atoms (nAT), number of hydrogen atoms (nH), number of
carbon atoms (nC), number of nitrogen atoms (nN), number of oxygen atoms (nO)
and number of halogen atoms (nX). Here, the number of rotatable bonds (RBN)
defined the number of bonds that allowed free rotation at their respective spin axis.
The RBN is classified as any single bond but in the ring structure and generally, they
are attached to the nonterminal heavy atom. However, due to high rotational energy
barrier, amide bonds are omitted from the count. The number of rings (or independent
14 2 Ligand-Based Approach for In-silico Drug Designing

cycles, i.e. the number of non-overlapping cycles) in a graph is commonly known as


the cyclomatic number. The number of rings (nCIC) is calculated as the cardinality
of a set of independent rings called as smallest set of smallest rings (SSSR). Also,
the number of donor atoms for H-bonds (nHDon) is a measure of hydrogen bonding
forming ability of a molecule that is expressed in terms of the number of possible
hydrogen bond donors. Specifically, it is considered by addition of hydrogen bonded
to any nitrogen and oxygen without negative charge in the molecule. Whilst, the
number of acceptor atoms for H-bonds (nHAcc) is a measure of hydrogen bonding
ability of a molecule expressed in terms of number of possible hydrogen bond accep-
tors. Specifically, it is calculated by adding up any nitrogen, oxygen and fluorine,
excluding ‘N’ with positive formal charge, higher oxidation states and pyrrolyl form
of nitrogen. Additionally, lots of characteristics associated with bonds are employed,
including total quantities of solitary, dual, triple or type that is aromatic, in addition
to the quantity of aromatic bands [13].

2.4 Quantitative Structure–Activity Relationships

Quantitative structure–activity relationship (QSAR) models provide the mathemat-


ical relation between structural characteristics and target of interest in the presence
of ligands or compounds libraries [14]. These models are simple regression or even
classification models employed in the biological, chemical and engineering sciences.
For example, QSAR regression models are associated with a set of ‘predictor’ fac-
tors (X) to the efficiency of response factor (Y ), although a group of QSAR models
connects the predictor factors to a categorical value of response factor.
In fact, the infancy of in silico pharmacology estimated to have developed in the
early 1960s, when Hansch and others began to establish QSAR models from the
collected data on various molecular descriptors to physical, chemical and biological
characteristics that aimed to provide a computational approach based on the bioactiv-
ity of molecules [15]. However, Free and Wilson (1964) established a mathematical
model that relates the presence of different chemical substituents to biological activ-
ity, and later, the two methods were combined to design the Hansch/Free-Wilson
method [16]. Consequently, within the broadest sense, QSARs include the construc-
tion of mathematical model relevant to molecular structure up to a chemical property
or biological impact in the form of analytical methods. Additionally, the intrin-
sic sound connected with both first information and tangible methodological facets
involved in the mixed construction of QSAR model [17]. Finally, in cases where a
significant correlation is accomplished for the pair of training particles that is why
powerful biological information is offered, the model can anticipate the biological
impact for any other particles. During the last 40 years, these attempts have actually
created a large number of QSAR models, a number of them have been gathered and
stored in the C-QSAR database [18, 19].
2.4 Quantitative Structure–Activity Relationships 15

A general workflow for QSAR-based drug discovery project involves the collec-
tion of active and inactive ligands group followed by designing a set of mathematical
descriptors that describe the physicochemical and structural properties of selected
ligands or compounds. Following a model is generated to identify the relationship
between those descriptors and respective experimental activity, increasing the pre-
dictive probability. Finally, the model is employed to predict the activity for a library
of test compounds that were encoded with the same descriptors. Hence, the accom-
plishment of designed QSAR model relies not only on the quality of initial set of
active/inactive ligands but also depends on the selected descriptors as well as the
ability to establish an appropriate mathematical equation. However, one of the most
relevant facts regarding this method is that all the designed models will be directly
proportional to the sampling space of initial set of ligands or compounds with known
activity and on the chemical diversity. In brief, divergent scaffolds or functional
groups of the ligands are not considered within this ‘training’ group of compounds
and will not be signified in the final designed model. Whilst, any potential hits within
the screened library that contain these groups will likely be unexploited. Hence, it
is recommended to select a wide chemical space within the training set. In fact,
modern REACH plan of European Union has encouraged the experts and regula-
tors to concentrate on developing specific validation concepts for QSAR models in
the framework of chemical-based legislations, formerly known as the Setubal, and
nowadays called as OECD concepts.

2.5 Molecular Fingerprint and Similarity Searches

Molecular fingerprint-based methods attempted to represent the particles in a way as


allowing quick structural contrast an endeavour to determine the structurally com-
parable particles or to cluster collections centered on structural similarity. These
processes are driven by much less computationally expenses than pharmacophore
QSAR or mapping models. They totally count on chemical structure and omit com-
pounds that understood biological tasks, making this approach more qualitative in
nature than many other LB-CADD approaches [20].
Furthermore, fingerprint-based techniques consider all elements in the molecule
similarity and avoid concentrating just on elements of a molecule being most
important for the task. This reduces the susceptibility to overfitting and requires
smaller sized datasets. Nevertheless, model performance is affected by the impact of
unneeded features, therefore, the usually contracted chemical areas are assessed [20].
Irrespective of this disadvantage, 2D fingerprints keep on being the representation
of option for similarity-based virtual screening [21].
16 2 Ligand-Based Approach for In-silico Drug Designing

2.6 Similarity Searches in LB-CADD

Fingeprint methods may be used to search the databases for ingredients which are
close to structure query and promoting a lengthy selection of ingredients that tend to
be examined for increased task through the contribute. While, 2D similarity search
databases utilize the chemo-type information from earliest generation hits, resulting
testing are used in 2D fingerprint and 3D shape similarity searches to determine
unique agonists. The hormone oestrogen is an essential hormone which is liable for
most of the elements in developmental physiology of structure [22]. Cytohesins rep-
resents the little guanine nucleotides change aspects that promote Ras-like GTPases
and control the various regulatory networks concerned in a type of disease [23].

2.7 Similarity Networks and off Target Predictions

Recently, chemical likeness measures like Tanimoto coefficients are now being uti-
lized to generate the networks capable of clustering drugs that bind to numerous
objectives to novel off aim effects. Keiser et al. [24] utilized a similarity approach
that was ensemble as SEA to compare the drug targets based on their ligands simi-
larity. SEA predicts whether a ligand and target will interact utilizing an analytical
model for chemical similarity based on possibility. Sets of ligands that communicate
with every target are distinguished by determining Tanimoto coefficients according
to standard 2D daylight fingerprints for every single set of molecules between two
sets [25]. Natural similarity ratings between all the pairs of ligand sets are determined
as the amount of all Tanimoto coefficients involving the sets higher than 0.57. Since,
the possibility of attaining Tanimoto coefficients higher than 0.57 increases with
set size, this is certainly normalized by expected similarity. This model for random
chemical similarity is accomplished by arbitrarily creating 300,000 pairs of molecule
sets with spanning logarithmic size of 10–1000 molecules. Expectation ratings are
predicated based on nature ratings by random possibility and utilizing the sequential
connect to the ligand sets on the clustered map [25].

2.8 Fingerprint Extensions

Present research focusses on increasing the fingerprint-based LB-CADD techniques.


As previously mentioned, one disadvantage is that all top features of molecule are
correspondingly essential for ranking prospect molecules, irrespective of any arti-
ficiality of these features to the biological task on the target. Hessler et al. [26]
proposed a technique that combines some great benefits of similarity and pharma-
cophore searching for the foundation of 2D [27]. This technique proposed a couple
of molecules that are changed into a topological model (MTree) according to chem-
2.8 Fingerprint Extensions 17

ically practical matching of functional set. This produces a topological map of the
most enormously similar pair of structurally diverse molecules or fragments along
with the active molecules. Whilst, conserved features of high similarity are rated
according to the matching nodes due to low dependence on chemical substructures.
However, the MTree model is a particular concept and employed to recognize the
alternative novel molecular scaffolds or chemo types [27].

2.9 Computational Methods for Biomolecular Docking

With the rapidly increasing quantity of generated molecular data, the computer-based
evaluation of molecular interactions becomes progressively feasible. Techniques for
computer-aided molecular docking incorporate a sensibly precise style of energy and
the ability to cope with the combinatorial complexity sustained by molecular ver-
satility for the docking partners. In both the fields, in the last few years, significant
development has been observed. Interactions between biomolecules are the founda-
tion to any or all the biological procedures. Using these interactions, living organisms
preserve complex regulating and metabolic interaction networks that together con-
stitute the processes of life. Evaluation of experimental work and computer system
simulations are the primary scientific tools to find the molecules that can be used
as bioactive substances to change and manage the processes of life. The calcula-
tion of specific molecular interactions appears approximately in the chain, i.e. ana-
lytic to understand the life’s procedures. On the one hand, assessment of molecular
interactions needs very least and ideally considerable levels of familiarity with the
three-dimensional frameworks. Understanding the specific molecular interactions
had been purposed as the requirement to develop a global model for the biological
process inside an organism. For better understanding of biomolecular structures at a
volatile rate and computer system models for biomolecular docking can consequently
be regulated up to rapidly developed data sets. In inclusion, new algorithms are now
being designed that are focused on the target of considerable combinatorial com-
plexities with conformational spaces docking problems as well as modelling energy.
Methods of docking can often aim in a very precise and step-by-step evaluation of a
solitary example of rating various molecular buildings agonists [28].

References

1. M.A. Johnson, G.M. Maggiora, Concepts and Applications of Molecular Similarity (Wiley,
USA, 1990)
2. J. Mestres, L. Martín-Couce, E. Gregori-Puigjané, M. Cases, S. Boyer, Ligand-based approach
to in silico pharmacology: nuclear receptor profiling. J. Chem. Inf. Model. 46, 2725–2736
(2006)
18 2 Ligand-Based Approach for In-silico Drug Designing

3. R.D. Cramer, D.E. Patterson, J.D. Bunce, Comparative molecular field analysis (CoMFA). 1.
Effect of shape on binding of steroids to carrier proteins. J. Am. Chem. Soc. 110, 5959–5967
(1988)
4. C. Acharya, A. Coop, J.E. Polli, A.D. MacKerell, Recent advances in ligand-based drug design:
relevance and utility of the conformationally sampled pharmacophore approach. Curr. Comput.
Aided Drug Des. 7, 10–22 (2011)
5. Y. Marrero-Ponce, O.M. Santiago, Y.M. López, S.J. Barigye, F. Torrens, Derivatives in discrete
mathematics: a novel graph-theoretical invariant for generating new 2/3D molecular descrip-
tors. I. Theory and QSPR application. J. Comput. Aided Mol. Des. 26, 1229–1246 (2012)
6. R. Todeschini, V. Consonni, Handbook of Molecular Descriptors (Wiley, USA, 2008)
7. Q. Du, P.G. Mezey, K.C. Chou, Heuristic molecular lipophilicity potential (HMLP): a 2D-
QSAR study to LADH of molecular family pyrazole and derivatives. J. Comput. Chem. 26,
461–470 (2005)
8. H. Kubinyi, 3D QSAR in Drug Design: Volume 1: Theory Methods and Applications (Springer
Science & Business Media, Germany, 1993)
9. V. Consonni, R. Todeschini, M. Pavan, P. Gramatica, Structure/response correlations and sim-
ilarity/diversity analysis by GETAWAY descriptors. 2. Application of the novel 3D molecular
descriptors to QSAR/QSPR studies. J. Chem. Inf. Comput. Sci. 42, 693–705 (2002)
10. A. Vedani, M. Dobler, Multidimensional QSAR: moving from three-to five-dimensional con-
cepts. Mol. Inform. 21, 382–390 (2002)
11. A. Hopfinger, S. Wang, J.S. Tokarski, B. Jin, M. Albuquerque, P.J. Madhav, C. Duraiswami,
Construction of 3D-QSAR models using the 4D-QSAR analysis formalism. J. Am. Chem. Soc.
119, 10509–10524 (1997)
12. A. Vedani, M. Dobler, 5D-QSAR: the key for simulating induced fit? J. Med. Chem. 45,
2139–2149 (2002)
13. S. Gosav, M. Praisler, D. Dorohoi, G. Popa, Structure–activity correlations for illicit
amphetamines using ANN and constitutional descriptors. Talanta 70, 922–928 (2006)
14. Y. Zhang, I-TASSER server for protein 3D structure prediction. BMC Bioinform. 9, 40 (2008)
15. C. Hansch, T. Fujita, p-σ-π analysis. A method for the correlation of biological activity and
chemical structure. J. Am. Chem. Soc. 86, 1616–1626 (1964)
16. S.M. Free, J.W. Wilson, A mathematical contribution to structure-activity studies. J. Med.
Chem. 7, 395–399 (1964)
17. J. Polanski, A. Bak, R. Gieleciak, T. Magdziarz, Modeling robust QSAR. J. Chem. Inf. Model.
46, 2310–2318 (2006)
18. C. Hansch, D. Hoekman, A. Leo, D. Weininger, C.D. Selassie, Chem-bioinformatics: com-
parative QSAR at the interface between chemistry and biology. Chem. Rev. 102, 783–812
(2002)
19. A. Kurup, C-QSAR: a database of 18,000 QSARs and associated biological and physical data.
J. Comput. Aided Mol. Des. 17, 187–196 (2003)
20. J. Auer, J. Bajorath, Molecular similarity concepts and search calculations, in Bioinformatics:
Structure, Function and Applications (2008), pp. 327–347
21. P. Willett, Similarity-based virtual screening using 2D fingerprints. Drug Discov. Today 11,
1046–1053 (2006)
22. G.R. Sliwoski, 3D Enantioselective Descriptors for Ligand-based Computer-aided Drug
Design (Vanderbilt University, USA, 2012)
23. D. Stumpfe, A. Bill, N. Novak, G. Loch, H. Blockus, H. Geppert, T. Becker, A. Schmitz,
M. Hoch, W. Kolanus, Targeting multifunctional proteins by virtual screening: structurally
diverse cytohesin inhibitors with differentiated biological functions. ACS Chem. Biol. 5,
839–849 (2010)
24. M.J. Keiser, V. Setola, J.J. Irwin, C. Laggner, A.I. Abbas, S.J. Hufeisen, N.H. Jensen, M.B.
Kuijer, R.C. Matos, T.B. Tran, R. Whaley, R.A. Glennon, J. Hert, K.L.H. Thomas, D.D.
Edwards, B.K. Shoichet, B.L. Roth, Predicting new molecular targets for known drugs. Nature
462(7270), 175–181 (2009)
References 19

25. M.J. Keiser, B.L. Roth, B.N. Armbruster, P. Ernsberger, J.J. Irwin, B.K. Shoichet, Relating
protein pharmacology by ligand chemistry. Nat. Biotechnol. 25, 197–206 (2007)
26. G. Hessler, M. Zimmermann, H. Matter, A. Evers, T. Naumann, T. Lengauer, M. Rarey,
Multiple-Ligand-Based Virtual Screening: Â Methods and Applications of the MTree
Approach. J. Med. Chem. 48(21), 6575–6584 (2005)
27. A. Evers, G. Hessler, H. Matter, T. Klabunde, Virtual screening of biogenic amine-binding
G-protein coupled receptors: comparative evaluation of protein-and ligand-based virtual
screening protocols. J. Med. Chem. 48, 5448–5465 (2005)
28. T. Lengauer, M. Rarey, Computational methods for biomolecular docking. Curr. Opin. Struct.
Biol. 6, 402–406 (1996)
Chapter 3
Structure-Based Approach for
In-silico Drug Designing

Abstract In recent years, research area of structure-based drug design is a rising field
that has been used to achieve many successes. Structure-based computer-aided drug
design (SB-CADD) depends on the ability to determine and analyse the 3D structures
of the target of interest. In other words, a prerequisite for the SB-CADD approach can
be defined based on molecule’s ability to interrelate with a specific ligand, that can be
a chemical species or biomolecule such as protein, and a desired biological activity
based on its ability to favourably interact at a binding site on the selected target. This
purposed that the molecules sharing those favourable interactions will reflect the
similar biological effects. Therefore, novel ligands can be predicted and concluded
by careful analysis of a protein’s binding site. Also, structure-based approach for
drug designing allows a rapid selection of potential ligands from different and large
compound libraries that can be later validated through modelling/simulation and
visualization techniques.

Keywords SB-CADD · 3D structures · Protein’s binding site


Modelling/simulation and visualization techniques

3.1 Introduction

With the advent of modern science, rational drug design based on the protein structure
was an unrealistic goal to attain as purposed by the structural biologists. However,
during the mid-80s, and by the early 1990s, the rational drug design was underway
in the first success stories that get published [1, 2]. However, in the present scenario,
although there is still quite a bit of fine-tuning necessary to predict and optimize the
process, structure-based drug design is an essential branch and popularly used in most
of the industrial drug discovery programs [4] as well as occupies many academic
laboratories as key topic of research [3].
Recent developments in the information technology have been employed on the
large amount of data generated to identify novel drug molecules and improve upon
the existing drugs [3]. Recently, high-throughput crystallography techniques, such as
automation at all stages, more intense synchrotron radiation, and new developments

© The Author(s) 2018 21


A. C. Kaushik et al., Bioinformatics Techniques for Drug Discovery,
SpringerBriefs in Computer Science, https://doi.org/10.1007/978-3-319-75732-2_3
22 3 Structure-Based Approach for In-silico Drug Designing

in phase determination, have reduced the intervals to determine the structures. In this
regards, structure determination using nuclear magnetic resonance (NMR) has been
broadly employed in the last few years, in addition with magnet and probe improve-
ments, automated assignment [4, 5], and new experimental methods to elucidate the
larger structures [6]. Structure-based drug development is at most influential when
it contributes in an entire drug discovery process. It is also significant to contem-
plate that structure-based drug design guides the discovery of a drug lead, which
is not a drug product, however, precisely predicts a compound or lead with at least
micromolar affinity to the selected target [7].

3.2 Protein Docking

A number of computational investigators working on molecular docking assume one


of many docking partners to be necessary a protein. That is because interactions
regarding proteins are specifically intriguing and partial as relatively many protein
structures are known. Proteins can bind to DNA, RNA, various other proteins, small
organic or metal-organic substances. According to molecular characteristics of the
target protein predicted by docking studies, various computer model and algorithms
have been purposed.

3.2.1 Protein–Protein Docking

Several protein–protein docking techniques are derived based on the ‘rigid body’ pre-
sumption. With the best abstraction, this extremely simplistic design considered the
two proteins as two rigid solid bodies. Geometric surface model and data structures
are utilized to find the reasonable binding modes and heuristic cost functions. For
the intended purpose of rapidly locating the contacting surfaces on the two proteins
within the rigid body method, accordingly simple and contented information on the
surface structures is highly required. A few research reports have centered on this
problem. Lin et al. [8] and Norel et al. [9] had actually supplied a simple worldwide
surface information by various techniques in the form of grid-based representation for
the necessary protein area [10, 11]. Walls and Sternberg [12] explained the necessary
protein area within a two-dimensional grid of geometric functional values produced
by forecasts of area on the airplanes. Also, Helmer-Citterich and Tramontano [13]
used the projection on a cylindrical area. A unique and interesting concept is to utilize
the spherical harmonics for explaining necessary protein areas at various quantities
of reliability [14]. To exactly do the match amongst two-point units representing the
areas of two docking partners, unique algorithms are required. Shoichet et al. [15]
used the DOCK that was well-known algorithm for this function. Another paradigm,
this is certainly and specifically helpful is the geometric that has been calculated
over from the field of computer system vision [16, 17]. Another method is to utilize
3.2 Protein Docking 23

the quick Fourier transform for the competent calculation of optimal computation
for translations, coupled with rotational sampling [10, 18, 19]. Duncan and Olson
[14] use an algorithm that evolutionary enhanced the geometric fit between the two
proteins. This could be done by utilizing the worldwide optimization techniques on
accordingly defined conformational spaces or molecular dynamics methods. Totrov
and Abagyan [20] present this kind of technique that can be in a position to repli-
cate the complex between lysozyme and an antibody through the coordinates of the
uncomplexed molecules. However, these optimization techniques are dependent on
the most complex types of energy reported within the literature and integrates Monte
Carlo methods.
For being able to level the processing time, conformational versatility of the
protein is specifically bound on appropriate flexible chain in the area of residues. The
proteins conformational flexibility is limited to relevant motions on the side chains
of surface residues to limit the amount of computing time. Nonetheless, a substantial
number of processing time are necessary for such optimizations. Acquiring very
precise outcomes critically depends on the type of energy that precisely makes up
about all the appropriate enthalpic and entropic efforts. Abagyan and Totrov [21] took
one step in this way by including terms for electrostatic and side chain entropies into
the energy estimation. Nonetheless generally, even more exploration needs to be
performed in this field [14].

3.2.2 Protein–Ligand Docking

Docking small, mainly organic molecules or proteins both are pertinent to com-
prehending biological procedure that can be helpful in the drug designing. In the
recent years, a big group difference has been created against testing the ligands
database and precisely examined the specific molecular communication. These
databases are available to the researchers to investigate and conduct the most spe-
cific docking experiments. Wherein, complementary contact areas amongst the lig-
and and the receptor are much less discriminating in comparison to full instance of
protein–protein docking studies. In fact, these tiny ligands tend to be very versatile,
that means they could be employed on the area to check the receptor pocket. Conse-
quently, in protein–ligand docking, the prime challenge is to deal with the modelling
of ligand flexibleness accurately to understand the weak interactions between ligand
and the receptor. Progress along these outlines has been provided by several research
groups in the past few years. Miller et al. [22] and Klebe and Mietzner [23] have
developed different ways to design simple conformational units that can be used for
rigid docking method. Whereas, unique energy feature already has been created that
includes essential efforts for the docking [24].
New combinatorial algorithm has been purposed to directly tackle the difficulty of
ligand versatility, initiating the quickest method availability for versatile ligand dock-
ing. Evolutionary algorithm has been used to solve the flexible ligand docking strain.
Also, structural versatility, there are more important phenomena that are important
24 3 Structure-Based Approach for In-silico Drug Designing

in the formation of protein–ligand complexes, however already been tackled. For


example, solitary liquid particles between ligand and the protein of interest can play
a crucial role in complex development because of the intermediation of hydrogen
bonds. Poornima and Dean [25] documented the analysis of liquid particles that bind
between the target of interest protein and the respective ligand. For instance, HIV-1
protease forms complexes with its inhibitors wherein liquid molecule plays a crucial
role in binding [26].

References

1. N.A. Roberts, J.A. Martin, D. Kinchington, A.V. Broadhurst, J.C. Craig, I.B. Duncan,
S.A. Galpin, B.K. Handa, J. Kay, A. Krohn, Rational design of peptide-based HIV proteinase
inhibitors. Science 248, 358–361 (1990)
2. J. Erickson, D.J. Neidhart, J. VanDrie, D.J. Kempf, D.A. Paul, Design, activity, and 2.8
(angstrom) crystal structure of a C (2) symmetric inhibitor complexed to HIV-1 protease.
Science 249, 527 (1990)
3. A.C. Anderson, The process of structure-based drug design. Chem. Biol. 10, 787–797 (2003)
4. D. Zheng, Y.J. Huang, H.N. Moseley, R. Xiao, J. Aramini, G. Swapna, G.T. Montelione,
Automated protein fold determination using a minimal NMR constraint strategy. Protein Sci.
12, 1232–1246 (2003)
5. N. Oezguen, L. Adamian, Y. Xu, K. Rajarathnam, W. Braun, Automated assignment and 3D
structure calculations using combinations of 2D homonuclear and 3D heteronuclear NOESY
spectra. J. Biomol. NMR 22, 249–263 (2002)
6. K. Pervushin, R. Riek, G. Wider, K. Wüthrich, Attenuated T2 relaxation by mutual cancellation
of dipole–dipole coupling and chemical shift anisotropy indicates an avenue to NMR structures
of very large biological macromolecules in solution. Proc. Natl. Acad. Sci. 94, 12366–12371
(1997)
7. C.L. Verlinde, W.G. Hol, Structure-based drug design: progress, results and challenges. Struc-
ture 2, 577–587 (1994)
8. S.L. Lin, R. Nussinov, D. Fischer, H.J. Wolfson, Molecular surface representations by sparse
critical points. Proteins: Struct. Funct. Bioinfor. 18, 94–101 (1994)
9. R. Norel, S.L. Lin, H.J. Wolfson, R. Nussinov, Molecular surface complementarity at protein-
protein interfaces: the critical role played by surface normals at well placed, sparse, points in
docking. J. Mol. Biol. 252, 263–273 (1995)
10. I.A. Vakser, C. Aflalo, Hydrophobic docking: a proposed enhancement to molecular recognition
techniques. Proteins: Struct. Funct. Bioinform. 20, 320–329 (1994)
11. F. Ackermann, G. Herrmann, F. Kummert, S. Posch, G. Sagerer, D. Schomburg, Protein dock-
ing: combining symbolic descriptions of molecular surfaces and grid-based scoring functions,
in ISMB (1995), pp. 3–11
12. P.H. Walls, M.J. Sternberg, New algorithm to model protein-protein recognition based on
surface complementarity: Applications to antibody-antigen docking. J. Mol. Biol. 228, 277–297
(1992)
13. M. Helmer-Citterich, A. Tramontano, PUZZLE: a new method for automated protein docking
based on surface shape complementarity. J. Mol. Biol. 235, 1021–1031 (1994)
14. T. Lengauer, M. Rarey, Computational methods for biomolecular docking. Curr. Opin. Struct.
Biol. 6, 402–406 (1996)
15. B.K. Shoichet, I.D. Kuntz, D.L. Bodian, Molecular docking using shape descriptors. J. Comput.
Chem. 13, 380–397 (1992)
16. D. Fischer, S.L. Lin, H.L. Wolfson, R. Nussinov, A geometry-based suite of molecular docking
processes. J. Mol. Biol. 248, 459–477 (1995)
References 25

17. H.-P. Lenhof, An Algorithm for the Protein Docking Problem (1995)
18. E. Katchalski-Katzir, I. Shariv, M. Eisenstein, A.A. Friesem, C. Aflalo, I.A. Vakser, Molecu-
lar surface recognition: determination of geometric fit between proteins and their ligands by
correlation techniques. Proc. Natl. Acad. Sci. 89, 2195–2199 (1992)
19. I.A. Vakser, Protein docking for low-resolution structures. Protein Eng. Des. Sel. 8, 371–378
(1995)
20. M. Totrov, R. Abagyan, Detailed ab initio prediction of lysozyme–antibody complex with 1.6
Å accuracy. Nat. Struct. Mol. Biol. 1, 259–263 (1994)
21. R. Abagyan, M. Totrov, Biased probability Monte Carlo conformational searches and electro-
static calculations for peptides and proteins. J. Mol. Biol. 235, 983–1002 (1994)
22. M.D. Miller, S.K. Kearsley, D.J. Underwood, R.P. Sheridan, FLOG: a system to select ‘quasi-
flexible’ligands complementary to a receptor of known three-dimensional structure. J. Comput.
Aided Mol. Des. 8, 153–174 (1994)
23. G. Klebe, T. Mietzner, A fast and efficient method to generate biologically relevant conforma-
tions. J. Comput. Aided Mol. Des. 8, 583–606 (1994)
24. H.-J. Böhm, The development of a simple empirical scoring function to estimate the binding
constant for a protein-ligand complex of known three-dimensional structure. J. Comput. Aided
Mol. Des. 8, 243–256 (1994)
25. C. Poornima, P. Dean, Hydration in drug design. 1. Multiple hydrogen-bonding features of water
molecules in mediating protein-ligand interactions. J. Comput. Aided Mol. Des. 9, 500–512
(1995)
26. A. Wlodawer, Rational drug design: the proteinase inhibitors, Pharmacotherapy. J. Human
Pharmacol. Drug Ther. 14 (1994)
Chapter 4
Three-Dimensional (3D) Pharmacophore
Modelling-Based Drug Designing by
Computational Technique

Abstract Three-dimensional (3D) pharmacophore modelling is a modern approach


used to elucidate the intermolecular interaction of ligands with the target of interest.
In the past few years, pharmacophore models have been developed with chemical
features and are intuitively understandable and broadly employed successfully in
computational drug discovery by the researchers. The concert and utility of pharma-
cophore modelling are demarcated by the two major factors; (i) definition and place-
ment of pharmacophoric features and (ii) the arrangement approaches used to overlay
the 3D pharmacophore models and small molecules. This chapter provides a brief
account of the recent technologies and developed model used in pharmacophores-
based drug design.

Keywords 3D pharmacophore · Modelling · Pharmacophore models


Computational drug discovery

4.1 Introduction

With the aid of pharmacophore modelling, a simple technique that produces results
that would be intuitive to an experienced medicinal chemist, this approach inflex-
ibly models the different interactions that could possibly be produced between a
ligand and its binding site in a specific binding situation at the target of interest [1].
This produced chemical features results in three-dimensional (3D) spatial arrange-
ment using algorithms that further derive information based on the standard rules
on chemical features. These designed models, also known as 3D pharmacophores,
can be employed to search the similarities between binding situations or even for
similarities between different molecules [1]. This standardized the pharmacophore
modelling into its advantages and disadvantages; (i) the rule-based deigning of chem-
ical features based on an ideal interface between medicinal chemistry and computer
science, provides the means to add intentional and necessary bias to the medicinal
or computational chemist for still imperfect representation of molecules in the com-
puters, (ii) heuristic modelling is not a systematic approach: important interactions
may not be well represented in a specific chemical feature model, increasing the

© The Author(s) 2018 27


A. C. Kaushik et al., Bioinformatics Techniques for Drug Discovery,
SpringerBriefs in Computer Science, https://doi.org/10.1007/978-3-319-75732-2_4
28 4 Three-Dimensional (3D) Pharmacophore Modelling-Based Drug …

likelihood of important information loss in the resulting 3D pharmacophore, and as


a result, estimating the binding energy contributions of particular chemical features
is practically impossible. Hence, a pharmacophore model describing the interactions
between ligand and target of interest can be resultant either in the form of structure-
based by determining the complementarities between a ligand and its respective
binding site on the target of interest, or in a ligand-based manner, where the flexible
overlaying of a set of energetic fragments and determination of conformations that
geometrically overlap with maximum number of important chemical features. This
ligand-based method integrally comprises the flexible alignment of molecules, that
can be done only by considering atom contributions or by some other methods that are
unrelated to 3D pharmacophore depictions. Also, all the possible chemical features
of a molecule with respect to geometric information can be used as a contribution
for the flexible alignment. These alignments collectively turn into the most computa-
tionally expensive and algorithmically most challenging part. Some pharmacophore
alignment approaches include algorithm where the computing time required grows
exponentially with a number of involved chemical features [2]. This limits their
scalability and applicability––also for small molecules as soon as the describing
chemical feature set involves a larger number of chemical features. Practically, on
current hardware, these approaches are limited to small molecules and simple chemi-
cal feature descriptions. Other approaches use algorithms that do not cover the whole
search space but deliver one single optimal solution, and thus, can be solved in poly-
nomial time. Common feature based approaches is more flexible, since it can use
more feature definitions and even place multiple features on the same atom group
along with some more scalable, that allows their application to larger as well as more
feature-rich molecules such as peptides [3, 4]. The challenge of molecular super-
positioning (3D alignment) incorporates the problem of conformational flexibility
that can be addressed by pre-enumerating a general-purpose conformational model
or by changing the molecule coordinates as required for the alignment algorithms.
Both shows advantages and disadvantages: by pre-enumerating conformations, less
computational time is required during the alignment process as the detriment unable
to get a ‘tailored’ conformational model for the specific problem. Current confor-
mational model generators, however, seem to perform sufficiently well with respect
to this problem [5, 6]. If a conformational model is pre-generated, pattern-matching
techniques can be applied to geometric pharmacophore patterns that bear significant
performance and practical advantages in the alignment step mentioned above. In
general, this is also applicable to the different 3D substructure searching and vir-
tual screening techniques, and therefore, pre-generating conformational models are
commonly used to address the need for faster search times.
4.1 Introduction 29

4.1.1 Pharmacophore Model

A pharmacophore model of the target-binding site summarizes steric and digital fea-
tures required for the ideal interaction of the ligand with the target of interest. Most
frequent pharmacophores that have been established are hydrogen bond acceptors,
hydrogen bond donors, fundamental groups acid groups, limited charge, aliphatic
hydrophobic moieties and aromatic moieties. Pharmacophore functions now have
been utilized in drug discovery for digital evaluation, de novo design and lead opti-
mization [7]. A pharmacophore model of the prospective target binding site can be
employed partially to use for screening a putative hit from a collection of substance.
Aside from querying information based on energetic substances, pharmacophore
model can additionally be used by de novo design algorithms to guide the synthe-
sis of new substances. Structure-based pharmacophore techniques are dependent
on the evaluation of site based on a target–ligand complex structure. Ligand Scout
[8] used the protein–ligand data that was complex map interactions between ligand
and target. An understanding-based guideline set acquired through the PDB can be
used to instantly identify and classify the relations into hydrogen bond interactions,
charge transfers and lipophilic areas [8]. The Pocket v.2 algorithm [9] can perform
and instantly develop a pharmacophore model from the target–ligand complex. The
algorithm produces frequently spaced grids across the ligand and the residues. Probe
atoms that represent a hydrogen bond donor, a hydrogen bond acceptor as well as
a hydrophobic group are utilized to scan the grids. An empirical scoring function,
SCORE, can be used to explain the binding constant between probe atoms and the
target.

4.1.1.1 Virtual Screening Using a Pharmacophore Model

These pharmacophore models represent the binding mode of steroidal substance


and small hybrid substances containing a steroidal component and adenosine corre-
spondingly. The 1I5R-based pharmacophore model had been utilized to monitor the
NCI and SPECS information bases for brand-new inhibitors as a use of CATALYST.
Most readily useful scoring struck substances were docked into the binding pocket of
1EQU using GOLD, and last choice for in vitro screening assessment was performed
based on the most readily useful fit price. Aesthetic inspection of predicted dock-
ing presents the ChemScore OLD scoring purpose worth. Four out of 14 substances
tested in vitro revealed an IC50 worth of less than 50 mm, with the most potent being
5.7 mM [10].
30 4 Three-Dimensional (3D) Pharmacophore Modelling-Based Drug …

4.1.1.2 Multitarget Inhibitors Using Common Pharmacophore Models

Wei et al. (2008) utilized Pocket v.2 to spot typical pharmacophore for the two
targets taking part in inflammatory signalling; human being leukotriene A4 hydro-
lase (LTA4H-h) and non-pancreatic secretory phospholipase A2 (PLA2). The
co-crystal structure PDB code 1HS6 of LTA4H-h with 2-(3amino-2-hydroxy-4-
phenylbutyrylamino)-4-methylpentanoic acid (bestatin) and the structure (PDB code
1DB4) of PLA2 with [3-(1-benzyl-3-carbamoylmethyl-2methyl-1H-indol-5-yloxy)
propyl] phosphonic acid (indole 8) were utilized to derive the two goals of phar-
macophores. For LTA4H-h, six pharmacophore facilities had been identified that
included four hydrophobic, one hydrogen bond acceptor, and zinc metal coordina-
tion pharmacophore. Within the pocket that is binding of three hydrophobic centres,
one hydrogen bond acceptor and calcium ion control centres had been identified
[11]. The contrast of two units of pharmacophore models disclosed that two pharma-
cophores are hydrophobic; a pharmacophore that coordinated with the material and
ended up being typical of both the proteins. The authors purposed that substances ful-
fil the requirement of typical pharmacophores that would prevent both the proteins.
The MDL substance information base had been screened practically with LTA4H-h
and PLA2 utilizing Dock4.0 and binding conformation of the top 150,000 substances
(60% of database ranked by Dock rating) was extracted and examined for confor-
mity to typical pharmacophores. The most useful inhibitor, substance 10, inhibited
LTA4H-h at submicromolar range and PLA2 having an IC50 value of 7.3 mM.

4.1.1.3 Dynamic Pharmacophore Models That Account for Protein


Flexibility

The overexpression of murine dual moment 2-min oncoprotein (MDM2), that


prevents p53 tumefaction suppressor, is responsible for approximately one-half
of all the individual human being types of cancer. Reactivation of p53MDM2
integration has been confirmed to become a unique approach for enhancing cancer
tumours cell demise [12]. The linking technique is comparable wherein numerous
tiny fragments are docked into adjacent binding pockets of the target. Consequently,
the fragments are associated with one another to create a solitary substance [12].
This method is just a computational version of the well-known structure-activity
commitment by NMR method as introduced by Shuker [13]. A few techniques have
already been created that can be applied to both ligand-growing and ligand-linking
bind at the offered target. LigBuilder [14] developed ligands in a step-by-step
manner by utilizing a collection of fragments. The design procedure can be executed
by different functions like ligand growth and binding whilst the construction can
be managed with a genetic algorithm. The target–ligand complex binding affinity is
examined using an empirical scoring function. Perspective program initially reads
the prospective necessary protein and analyses the binding pocket. With respect
to the selection of an individual, it could then often use a developing or even a
linking method. Within the developing method, a seed structure is positioned inside
4.1 Introduction 31

a binding-pocket following this program which replaces the user-defined developing


sites with prospective fragments. This provides a brand-new seed construction
that may then be utilized in addition to rounds of development. For the linking
method, a few fragments placed at different areas of the target protein acted as seed
construction. The developing system occurs simultaneously for each fragment.

References

1. G. Wolber, T. Seidel, F. Bendix, T. Langer, Molecule-pharmacophore superpositioning and


pattern matching in computational drug design. Drug Discov. Today 13, 23–29 (2008)
2. Y. Patel, V.J. Gillet, G. Bravi, A.R. Leach, A comparison of the pharmacophore identification
programs: catalyst, DISCO and GASP. J. Comput. Aided Mol. Des. 16, 653–681 (2002)
3. G. Wolber, A.A. Dornhofer, T. Langer, Efficient overlay of small organic molecules using 3D
pharmacophores. J. Comput. Aided Mol. Des. 20, 773–788 (2006)
4. G. Wolber, R. Kosara, Pharmacophores from macromolecular complexes with LigandScout.
Pharmacophores Pharmacophore Searches 32, 131–150 (2006)
5. J. Kirchmair, C. Laggner, G. Wolber, T. Langer, Comparative analysis of protein-bound lig-
and conformations with respect to catalyst’s conformational space subsampling algorithms. J.
Chem. Inf. Model. 45, 422–430 (2005)
6. J. Kirchmair, G. Wolber, C. Laggner, T. Langer, Comparative performance assessment of the
conformational model generators omega and catalyst: a large-scale survey on the retrieval of
protein-bound ligand conformations. J. Chem. Inf. Model. 46, 1848–1861 (2006)
7. S.-Y. Yang, Pharmacophore modeling and applications in drug discovery: challenges and recent
advances. Drug Discov. Today 15, 444–450 (2010)
8. G. Wolber, T. Langer, LigandScout: 3-D pharmacophores derived from protein-bound ligands
and their use as virtual screening filters. J. Chem. Inf. Model. 45, 160–169 (2005)
9. D.S.H. Chan, H.M. Lee, F. Yang, C.M. Che, C.C. Wong, R. Abagyan, C.H. Leung, D.L. Ma,
Structure-based discovery of natural-product-like TNF-α inhibitors. Angew. Chem. Int. Ed. 49,
2860–2864 (2010)
10. M. Brvar, A. Perdih, M. Oblak, L.P. Mašič, T. Solmajer, In silico discovery of 2-amino-4-(2,
4-dihydroxyphenyl) thiazoles as novel inhibitors of DNA gyrase B. Bioorg. Med. Chem. Lett.
20, 958–962 (2010)
11. D. Wei, X. Jiang, L. Zhou, J. Chen, Z. Chen, C. He, K. Yang, Y. Liu, J. Pei, L. Lai, Discov-
ery of multitarget inhibitors by combining molecular docking with common pharmacophore
matching. J. Med. Chem. 51, 7882–7888 (2008)
12. A.L. Bowman, Z. Nikolovska-Coleska, H. Zhong, S. Wang, H.A. Carlson, Small molecule
inhibitors of the MDM2-p53 interaction discovered by ensemble-based receptor models. J.
Am. Chem. Soc. 129, 12809–12814 (2007)
13. S.B. Shuker, P.J. Hajduk, R.P. Meadows, S.W. Fesik, Discovering high-affinity ligands for
proteins: SAR by NMR. Science, 274(5292), 1531–1534 (1996)
14. R. Wang, X. Fang, Y. Lu, S. Wang, The PDBbind database: collection of binding affinities
for protein–ligand complexes with known three-dimensional structures. J. Med. Chem. 47,
2977–2980 (2004)
Chapter 5
Molecular Dynamics Simulation
Approach to Investigate Dynamic
Behaviour of System Through
the Application of Newtonian Mechanics

Abstract Molecular dynamics simulations have been successfully incorporated and


evolved into a mature technique within a variety of pharmaceutical research pro-
grams to study the complex biological and chemical systems. Broadly used in mod-
ern drug design, molecular docking methods can be used effectively to understand
the macromolecular structure-to-function relationships and ligand conformations
adopted within the binding sites of macromolecular targets. Information gathered
about the dynamic properties of ligand–receptor binding such as free energy by
evaluating critical phenomena involved in the intermolecular recognition process.
These results can be employed to shift the usual paradigm of structural bioinfor-
matics from studying single structures to analyse conformational ensembles. Today,
as a variety of docking algorithms are available, an understanding of advantages
and limitations of each method is of fundamental importance in the development of
effective strategies and the generation of relevant results. The purpose of this chapter
is to examine the current molecular docking strategies used in drug discovery and
medicinal chemistry, exploring the advancements in the field and role played by
integration of structure-and ligand-based methods.

Keywords Molecular dynamics simulations · Ligand–receptor binding


Structural bioinformatics · Docking algorithms · Drug discovery

5.1 Introduction

The term molecular mechanics (MM) refers to the use of simple potential energy
functions (e.g. harmonic oscillator or Coulombic potentials) to model molecular
systems. Molecular mechanics approaches are widely applied in molecular structure
refinement, molecular dynamics (MD) simulations, Monte Carlo (MC) simulations
and ligand-docking simulations [1].
Dynamic simulation methods are widely used to obtain the information on the
time evolution of conformations of proteins and other biological macromolecules
[2, 3] and also kinetic and thermodynamic information [1]. Simulations can provide
fine details concerning the motions of individual particles as a function of time.

© The Author(s) 2018 33


A. C. Kaushik et al., Bioinformatics Techniques for Drug Discovery,
SpringerBriefs in Computer Science, https://doi.org/10.1007/978-3-319-75732-2_5
34 5 Molecular Dynamics Simulation Approach to Investigate Dynamic …

They can be utilized to quantify the properties of a system with precision and on a
timescale that is otherwise inaccessible, and simulation is, therefore, a valuable tool
in extending our understanding of model systems. Theoretical consideration of a
system additionally allows one to investigate the specific contributions to a property
through ‘computational alchemy’, that is, by modifying the simulation in such a way
that it is nonphysical but nonetheless allows a model’s characteristics to be probed.
One example is the artificial conversion of energy function from one system to that of
another during a simulation. This is an important technique in free energy calculations
[4]. Thus, molecular dynamics simulations, along with a range of complementary
computational approaches, have become valuable tools for investigating the basis of
protein structure and function.

5.2 Molecular Dynamics Simulations

MD methods were originally conceived within the theoretical physics community


during the 1950s [1]. In 1957, Alder and Wainwright [5] performed the earliest MD
simulation using the so-called hard-sphere model, in which atoms interacted only
through perfect collisions. Rahman [6] subsequently applied a smooth, continuous
potential to mimic real atomic interactions. During the 1970s, as computers became
more widespread, MD simulations were developed for more complex systems, cul-
minating in 1976 with the first simulation of a protein [7] using an empirical energy
function constructed using physics-based first-principle assumptions. MD simula-
tions are now widely and routinely applied and especially popular in the fields of
materials science and biophysics [1].
Molecular dynamics (MD) simulation calculates the trajectory of the operational
system through the application of Newtonian mechanics. Nevertheless, standard MD
techniques greatly rely on the beginning conformation which is not precise for the
simulation of ligand–target interactions. Due to its nature, MD struggles to get across
a high-energy obstacle in the simulation’s lifetime and is certainly not efficient for
traversing the durable hypersurface of protein–ligand interactions. Techniques like
simulated annealing already have been purposed for much efficient utilization of
MD in docking. Mangoni et al. (1999) described an MD protocol for docking small
versatile ligands that were flexible objectives in liquid [8]. They separated the mid-
dle of size action of ligand from the large-scale inner and rotational motions, and
inner movements had been combined to various temperature baths, enabling sepa-
rate control towards the various movements. Appropriate values of temperature and
coupling constants allowed versatile or rigid ligand and/or receptor [8]. The Mc
typical group developed a ‘relaxed-complex’ method that explores binding confor-
mations that could happen just seldom at the unbound target protein. Docking of
ligands will be carried out in target conformation and snapshots taken at various
time intervals associated with MD run. This relaxed complex technique ended up
being utilized to realize the novel modes of inhibition for HIV integrase and resulted
in very first breakthrough associated with the medically authorized HIV integrase
5.2 Molecular Dynamics Simulations 35

inhibitor, raltegravir. This MD technique has also been used in various other pro-
motions to recognize inhibitors of the target of great interest [9]. Metadynamics is
really a MD-based way of predicting and ligand binding, i.e. scoring. This tech-
nique maps the entire free energy landscape to energy which is a free accelerated
method as it monitors the reputation for currently sampled areas. Throughout the
MD simulation of protein–ligand complex, a Gaussian repulsive potential is added
on explored regions, steering the simulation towards new free energy regions of a pro-
tein–ligand complex. Millisecond timescale MD simulations are now feasible with
special-purpose devices like Anton. Such lengthy simulations permit the research of
medication binding events to their necessary target protein. Anton has been utilized
effectively for complete resolution through atomic folding. Improvements in comput-
ing device abilities suggested that the necessary protein versatility can be routinely
accessed more on longer timescales. This will provide the extended information on
conformational versatility.

5.3 Monte Carlo Research with Metropolis Criterion

Stochastic algorithms make arbitrary modifications to either ligand that is being


used in docking study or the target binding sites. These arbitrary modifications could
possibly be translational or rotational in the case of ligand whilst arbitrary con-
formational assigned to the residue side chains site in the selected target. Whether
a step is acknowledged or declined, the stochastic search is determined based on
the Metropolis criterion. This typically allows to measure the general lower energy
and sporadically allows the actions that elevate the energy whilst making it possi-
ble for departure from the local minimum energy. The chances of acceptance for
a demanding action reduce the step with increasing energy space and depends on
the ‘temperature associated with MCM simulation’. MCM simulations have already
been used for versatile docking programs such as in MCDOCK [10]. MCM exempli-
fies the conformational space quicker than molecular dynamics because it needs just
energy function analysis rather than the by-product of this energy features. A Monte
Carlo-based energy minimization system that decreases the amount of conforma-
tions must be sampled whilst providing a faster rate system than supplied through
molecular mechanics push areas.
ROSETTALIGAND includes side chain and ligand versatility in high-resolution
sophistication action through Monte Carlo-based sampling of torsional perspectives.
All torsion perspectives of necessary protein and ligand are tending to be optimized
through gradient-based minimization, mimicking an induced fit situation that is
induced. MCDOCK utilizes two phases of docking as well as a last energy min-
imization action for producing target–ligand structure. In the first docking phase,
the ligand and docking site occurred as rigid whilst the ligand is positioned into the
binding site of the target. Rating is totally performed on such basis as brief quick
connections. This permits identification of nonbinding positions postures that binds
into the next phase when energy-based Metropolis sampling is performed to test the
36 5 Molecular Dynamics Simulation Approach to Investigate Dynamic …

binding pocket [10]. QXP optimizes grid map energy and internal ligand energy for
searching the ligand–target structure. The algorithm carries out a rigid-body position-
ing of ligand–target complex followed closely by MCM interpretation and rotation
of ligand. This task is closely followed by another rigid-body positioning body that
is rigid, and rating utilized the energy grid map. The general opportunities of ligand
and target molecule compensate the inner factors associated with the strategy. Inter-
nal factors are susceptible to random modification used by neighbourhood energy
minimization and choice by Metropolis criterion. ICM performed satisfactorily in
creating protein–ligand buildings for 68 diverse, high-resolution X-ray buildings
present in DUD.

References

1. S.A. Adcock, J.A. McCammon, Molecular dynamics: survey of methods for simulating the
activity of proteins. Chem. Rev. 106, 1589–1615 (2006)
2. T.E. Cheatham III, P.A. Kollman, Molecular dynamics simulation of nucleic acids. Annu. Rev.
Phys. Chem. 51, 435–471 (2000)
3. M. Karplus, J.A. McCammon, Molecular dynamics simulations of biomolecules. Nat. Struct.
Mol. Biol. 9, 646–652 (2002)
4. T. Simonson, G. Archontis, M. Karplus, Protein–ligand recognition: free energy simulations
come of age. Acc. Chem. Res. 35, 430–437 (2002)
5. H. Longuet-Higgins, B. Widom, A rigid sphere model for the melting of argon. Mol. Phys. 8,
549–556 (1964)
6. A. Rahman, J. Chern, Phys. 45, 2585 (1966).| l3] A. Ralıman, Phys. Rev, 136 405 (1964)
7. J.A. McCammon, B.R. Gelin, M. Karplus, Dynamics of folded proteins. Nature 267, 585–590
(1977)
8. M. Mangoni, D. Roccatano, A. Di Nola, Docking of flexible ligands to flexible receptors in
solution by molecular dynamics simulation, Proteins: Structure. Funct. Bioinform. 35, 153–162
(1999)
9. M.R. Landon, R.E. Amaro, R. Baron, C.H. Ngan, D. Ozonoff, J. Andrew McCammon, S. Vajda,
Novel druggable hot spots in avian influenza neuraminidase H5N1 revealed by computational
solvent mapping of a reduced and representative receptor ensemble. Chem. Biol. Drug Des.
71, 106–116 (2008)
10. M. Liu, S. Wang, MCDOCK: a Monte Carlo simulation approach to the molecular docking
problem. J. Comput. Aided Mol. Des. 13, 435–451 (1999)
Chapter 6
Receptor Thermodynamics
of Ligand–Receptor or Ligand–Enzyme
Association

Abstract Experimental techniques that directly assess the thermodynamics of lig-


and–receptor or ligand–enzyme association, such as isothermal titration calorimetry,
have been improved in recent years and can provide thermodynamic details of the
binding process. Parallel to the continuous increase in computational power, several
classes of computational methods have been developed that can be used to get a
more detail insight into the mode and affinity of compounds (drug) to their target
(off). Such methods are affiliated with a qualitative and/or quantitative assessment
of binding free energies, and differently trade off speed versus physical accuracy.
With the current wealth of available three-dimensional structures of proteins and
their complexes with ligands, structure-based drug design studies can be used to
identify the key ligand interactions and free energy calculations, and can quantify
the thermodynamics of binding between ligand and the target of interest.

Keywords Ligand–receptor · Ligand–receptor binding · Thermodynamics


Enzyme association

6.1 Introduction

The aim of both qualitative and quantitative approaches is to determine or predict the
mode of binding, selectivity and binding free energy that is associated with the pro-
tein–ligand interactions. These computational methods can be efficiently employed
to assess the factors determining the binding process, such as specific interactions
contributing to protein–ligand recognition. Based on a qualitative or quantitative
manner, binding free energy methods (and the associated current challenges to their
application) form the underlying common motives.
In this chapter, a general distinction is made that divides the various computational
methods into two categories. (1) A structure-based, qualitative assessment of pro-
tein–ligand interactions governing the binding process. (2) A quantitative assessment
of the binding affinity of ligands for protein targets. Note that this is a simplifica-
tion and overlap may occur between the two categories. Contributions of represen-
tative computational methods (docking, quantitative structure–activity relationship

© The Author(s) 2018 37


A. C. Kaushik et al., Bioinformatics Techniques for Drug Discovery,
SpringerBriefs in Computer Science, https://doi.org/10.1007/978-3-319-75732-2_6
38 6 Receptor Thermodynamics of Ligand–Receptor or Ligand–Enzyme …

A D E

Receptor Based Virtual Screening Validation- Pharmacophore Modeling Validation-Systems Biology

Modeling and validation of modeled Target interaction mechanism using


Known Compounds
structure Systems Biology

Active sites prediction and grid generation Prepare compounds Drug Kinetics Simulation

Clean Structures
Receptor based Virtual Screening of Target Investigate Drug effect
from various kind of Database
Generate Conformers

HTVS Docking
Create Pharmacophore Site

F
SP Docking Common Pharmacophore
Hypotheses MD Simulations

Search
XP Docking
3D Database of Target
Lead Compound
Screened Compounds

Ligand Scoring

B C

Screened Compounds
Validation-Blind Docking Validation-Induced Fit Docking

Blind Docking for cross validation of Induced Fit docking compounds analysis
active site prediction

New Lead Compounds

Fig. 6.1 Overview of the workflow of computer-aided drug design

(QSAR) and free energy calculation) to publications in the field of computational


chemistry (panel A, Fig. 6.1). Free energy methods are further subdivided in panel B
(Fig. 6.1) into: free energy perturbation (FEP), linear interaction energy theory (LIE),
molecular-mechanics Poisson Boltzmann/generalized Born solvent accessible sur-
face area methods (MM-PBSA), one-step perturbation (OSP) and thermodynamic
integration (TI). QSAR and automated docking studies are the most commonly used
virtual screening methods in computational drug design. Whereas the former relates
physicochemical properties of compounds to their biological activity for datasets of
potential receptor or enzyme agonists/antagonists and latter predicts their binding
modes and scores their affinities.
Molecular dynamics (MD) simulations provide detailed insight into molecular
movements, accounting for a greater number of relevant molecular conformations
by more extensive sampling, and, if carried out with an explicit representation of the
solvent, allow for a better description of solvation effects. MD simulations commonly
form the basis of methods for free energy calculation, such as FEP, LIE, OSP or TI,
which quantify ligand–protein binding affinities.
6.1 Introduction 39

The concept that therapeutic agents produce their selective action in modifying
disease symptoms by acting as ‘magic bullets’ at discrete molecular targets within the
body, is generally attributed to Paul Ehrlich during the turn of the nineteenth century
as a part of seminal ‘lock and key’ hypothesis. This hypothesis has described drugs
as receptor’s ligands or enzyme substrates that selectively modulate the function
of unknown molecular targets to produce beneficial effects. The receptor theory
involves, to a very major extent, the classical enzyme kinetic model based on the law
of mass action and derived by Michaelis and Menten in 1913 [1]. The interaction
between receptor and a ligand can be looked upon as

Receptor + Ligand [RL] R + Cellular Effect (6.1)

The ligand L binds to the receptor R and alters the nature of receptor interac-
tion with its associated membrane components to induce a change in the cellular
and ultimately, tissue function. Ligands interacting with receptors have two intrinsic
properties: Affinity and Efficacy. Affinity is the ability to recognize and binds to the
receptor while ability of the ligand to bring change in the cellular processes via acti-
vation of transmembrane transduction mechanisms involving G-protein complexes
or ion channels is defined as efficacy. In addition to the affinity of a receptor for its
ligand, the response to the ligand is also dependent on the number of receptors. An
additional ligand property is selectivity that is defined as the degree to which the
ligand interacts with the target of interest in comparison to related structural targets.
The degree of selectivity typically determines the side effect profile of the new com-
pound, given that the targeted mechanism itself does not produce untoward effects
when stimulated beyond the therapeutic range. Ligands may be either agonists or
antagonists. Agonists have intrinsic efficacy and their binding to the receptor leads to
activation of intracellular components involved in the physiological or pharmacolog-
ical responsiveness of cell or tissue. This efficacy may be manifested by changes in
the activity of an enzyme like adenylate cyclase or by an alteration in the contractile
response of an isolated, intact tissue preparation. However, antagonists bind to the
receptor and block the interaction of agonist while producing no effect on the tissue
on their own. Antagonism can be of several types: competitive, non-competitive and
inverse [2]. Competitive antagonism is usually associated with ligands that directly
interact with the agonist binding site, i.e. recognition element of the receptor. The
non-competitive or uncompetitive antagonists interact at sites distinct from the ago-
nist recognition site and can modulate agonist binding. A third class of ligand is that
of inverse agonist. Ligands of this class interact with a defined recognition site on a
receptor and are not only able to block the effects of an agonist at the receptor but
also able to produce effects opposite to that of agonist at varying degrees. Hence,
a biological response is produced by the interaction of a drug with the biological
receptor. This selective binding and its extent are governed by the molecular recog-
nition phenomenon. In molecular modelling, this process of molecular recognition
is simulated to understand the drug–receptor interaction (this equation means ligand
binds with the receptor).
40 6 Receptor Thermodynamics of Ligand–Receptor or Ligand–Enzyme …

L1
Ligand + Receptor  L-R Complex → Response (6.2)
L2

The rate constant for association of the complex is L 1 , the rate constant for the
dissociation of complex is L 2 and the affinity or association constant (L as ) can be
expressed as

L as  L 1 /L 2 .

The thermodynamic parameters of interest for the above reactions are standard
free energy (G0 ), enthalpy (H 0 ) and entropy (S 0 ) of association. These parameters
are related to the Gibbs free energy equation,

G 0  −RT ln L as (6.3)
G  H − T S0
0 0
(6.4)

The most fundamental forces involved in the interaction of ligand and recep-
tor is covalent, reinforced ionic, ionic, ion–dipole, dipole–dipole, van der Waals
and hydrophobic forces. In molecular modelling, every effort is made to measure
the free energy of association (G). Various computational chemistry methods and
assumptions are adopted to arrive at a measure of association [3].

6.2 Database Searching

The pharmacophores obtained from similarity analysis and 3D-QSAR analysis can
be used to search the compounds from a database holding similar features are defined
in the pharmacophores. Whereas QSAR focuses on a set of descriptors like electro-
static and thermodynamic properties while pharmacophore mapping is a geomet-
ric approach. There are various programs like UNITY, CATALYST, MENTHOR,
MACCS-3D CAVEAT that converts these pharmacophores into search queries. Var-
ious databases available commercially are Comprehensive Medicinal Chemistry-
3D (CMC-3D), Fine Chemicals Directory-3D (FCD-3D), National Cancer Institute
(NCI), Maybridge, Derwent World Drug Index, BioByte, etc. These search queries
can be combined with ORACLE program to perform the rational database search to
conclude the potential molecule with drug-like properties.
6.2 Database Searching 41

6.2.1 De Novo Drug Design

With the increase in understanding of drug–receptor theory along with thermody-


namics of binding, it is now possible to design new molecules from the scratch. This
methodology allows designing of new types of molecules. This method coupled with
docking algorithms provides a powerful tool for discovery of new molecules. There
are various methods available for the de novo design, however, the basic principle
involved in these methods is quite similar. Some of the widely accepted methods in
de novo design are Group Build, SMOG, MCSS and LeapFrog.
The application of computer-aided drug design, involving quantitative struc-
ture–activity relationship (QSAR), Pharmacophore generation, Molecular modelling
methods to design and develop new chemical entities (NCEs) as anti-inflammatory
agents. The work also involves the synthesis of NCEs and determination of their
activity by in vivo pharmacological model such as Carrageenan-induced rat paw
oedema model.

6.3 State-of-the-Art Free Energy Calculations

Together with the continuous increase in computer power and advances in related
areas of statistical mechanics and enhanced sampling techniques, binding free energy
calculations have become useful tools in drug design and in the rationalization of
biophysical experiments. This has been also reflected from the relative increase in
number of scientific reports over the past years on this topic. In structure-based drug
design, free energy calculations are often applied in the context of a thermodynamic
cycle approach combining the so-called alchemical transformations between struc-
turally related compounds. This has been proven as a successful tool to guide the drug
development. Since it is virtually infeasible to run some molecular dynamics simula-
tions long enough to thoroughly capture the ligand–protein association/dissociation
equilibrium, calculation of absolute free energy differences, associated with ligand
binding (Gbind ) mostly remains outside the range of computational chemistry. Alter-
natively, the absolute binding free energy may also be calculated from alchemical
approaches, vanishing a ligand from the protein active site and from an aqueous solu-
tion. Note that the term absolute binding free energy, commonly used in the field,
still refers to free energy differences along the binding process. However, in the drug
development process, the main interest is typically to determine the affinities of a
series of potential drug candidates relative to each other. Therefore, the focus usually
lies on the calculation of relative binding free energies (Gbind ) between (series
of) compounds or ligands. The use of thermodynamic cycles involving alchemical
transformations between two ligands (L1 and L2 ) is to calculate the Gbind and
a given protein target (P) in aqueous solution. The free energy is a thermodynamic
state function and it is a path-independent quantity. This reflected that the order of
42 6 Receptor Thermodynamics of Ligand–Receptor or Ligand–Enzyme …

Fig. 6.2 Standard thermodynamic cycle for relative binding free energy calculations. To compute
the relative binding free energy of two ligands (L 1 in L 2 ) for a given protein (P), L 1 is alchemically
mutated to L 2 while both are in aqueous solution and in the protein environment. According to
Eq. (6.1), Gbind is derived by relating the difference between G1 and G2 to the difference
between G3 and G4

binding event does not matter and the computed free energy only depends on the
representation of initial (ligand in solution) and final (ligand bound to protein) state
of the binding process.
Therefore, the free energy changes along the cycle in Fig. 6.2 sums to zero, so
that Gbind can be expressed as: Gbind = G2 − G1 = G4 − G3 , which
relates the free energy difference of the two horizontal branches (G1 and G2 ).
This indicates the individual affinities of the ligands for the protein while free energy
difference for the vertical branches (G3 and G4 ) that correspond to non-physical
alchemical transformation of L 1 in L 2 for the bound and free state, respectively. The
use of thermodynamic cycles is a standard approach to calculate the relative binding
free energies. However, note that the thermodynamic cycle approach (and calculation
of alchemical free energy differences) can also be applied to calculate the free energy
changes of different types of (bio)chemical events other than ligand binding, such as
protein folding, solvation or conformational changes.
Ultimately, the challenge lies in the development of more robust and efficient free
energy calculations to reduce the computational cost and thus, makes this approach
more feasible for the large-scale industrial applications.

References

1. K.A. Johnson, R.S. Goody, The original Michaelis constant: translation of the 1913 Michaelis-
Menten paper. Biochemistry 50, 8264–8269 (2011)
2. T. Albers, tures?, in Protein Structure, Folding and Design: GENEX-UCLA Symposium, Vol.
39, ed. by D. L. Oxender (Allan R. Liss, New York, pp. 283–289) Alt, J, vol. 113, p. 125
3. J.K. Seydel, Sulfonamides, structure-activity relationship, and mode of action. Structural prob-
lems of the antibacterial action of 4-aminobenzoic acid (PABA) antagonists. J. Pharm. Sci. 57,
1455–1478 (1968)
Chapter 7
Thermodynamic Cycles and Their
Application in Protein Targets

Abstract A key part of drug design and development is the optimization of molecu-
lar interactions between an engineered drug candidate and its binding target. Thermo-
dynamic characterization provides information about the balance of energetic forces
driving binding interactions and is essential for understanding and optimizing molec-
ular interactions. Comprehensive thermodynamic evaluation is vital in the drug devel-
opment process to speed drug development towards an optimal energetic interaction
profile while retaining good pharmacological properties. Practical thermodynamic
approaches, such as enthalpic optimization, thermodynamic optimization plots and
the enthalpic efficiency index, have now been developed to provide proven utility in
design process. Improved throughput in calorimetric methods remains essential for
even greater integration of thermodynamics into drug design.

Keywords Thermodynamic characterization · Pharmacological properties


Enthalpic optimization · Enthalpic efficiency index

7.1 Introduction

Thermodynamics has found increasing adoption in the drug design and development
process in both academic and commercial endeavours and is increasingly prevalent
alongside longer standing structure- and molecular modelling-based approaches. The
integration of thermodynamic measurements has grown with a better understand-
ing of energetic data, the increasing demonstration of the utility and application of
these measurements, and the availability of ever-improving instrumentation. How-
ever, there is still much that is not understood about the basis of binding interactions
and how these can be interpreted from thermodynamic data. Advances in instrumen-
tation have increased throughput and reduced sample demands, but still only offer
moderate throughput for a drug discovery effort that demands much higher. Despite
these limitations, useful practical approaches have been developed and advances
are being made that present a bright future for thermodynamics in drug design and
development.

© The Author(s) 2018 43


A. C. Kaushik et al., Bioinformatics Techniques for Drug Discovery,
SpringerBriefs in Computer Science, https://doi.org/10.1007/978-3-319-75732-2_7
44 7 Thermodynamic Cycles and Their Application in Protein Targets

Historically, rational drug design has been based upon seeking structural com-
plementarity and optimizing binding contacts between an engineered drug and a
target binding site to generate lead compounds [1]. Of course, drug design is part
of a bigger picture involving consideration and optimization of solubility, selectiv-
ity, ADMET (absorption, distribution, metabolism, excretion and toxicology) and
pharmacokinetic/pharmacodynamic properties, but rational design and engineering
of ligands for molecular recognition of a given target is the core of the process. In
the past, drug designing involved utilization of structural information of the target
site concluded by X-ray crystallography and NMR alongside molecular modelling
of drug–target interactions. Drug development was driven by the goal of optimiz-
ing molecular recognition, seeking high affinity compounds that were considered to
possess optimal binding interactions. However, a purely structure-based approach is
incomplete, and it is essential to incorporate complementary approaches to under-
stand the driving forces underlying the molecular interactions of the binding process
[2]. Approaches that are solely based on the structural data are often sought by
binding affinity optimization, which provides an oversimplified picture of molecular
interactions with isostructural complexes. Similar binding affinities potentially hide
the disparate binding thermodynamics and revealed only one part of the binding
picture.

7.2 Protein Targets and Applications

The fate of a drug entering the body is crucially determined by drug-metabolizing


enzymes. Apart from of drug breakdown and facilitated excretion, metabolic trans-
formations sometimes give rise to the incidence of toxic effects. Cytochrome P450
enzymes (CYP450s) constitute a family of heme containing enzymes. They are
involved in endogenous processes, as well as the biotransformation of xenobiotics,
such as drugs. As such, they play an important role in the disposition of drugs and
their pharmacological and toxicological effects. Most members of the CYP450 family
carry out monooxygenation (Phase 1 drug metabolism) reactions, in which molecular
oxygen is reduced by NADPH-derived electrons to the oxidize a substrate molecule,
via insertion of an oxygen atom into a substrate’s C–H bond while simultaneously
forming a water molecule.
Complementary to experimental studies, many computational efforts have been
performed to predict the mode and affinity of drug binding (i.e. activity and substrate
selectivity) to CYP450s, to predict the possible toxicity effects and to rationalize
the selective substrate binding phenomena. Typically, computational prediction of
enzyme activity and selectivity involves substrate recognition, the site and rate of
metabolism, complete catalytic process and the ease of product release Cytochrome
P450 BM3. Due to their high catalytic activity and broad substrate specificity,
CYP450s are interesting targets in biotechnological research. They can serve as
biocatalysts to produce, e.g., human metabolites. CYP450 enzymes are also used as
biocatalysts for industrial purposes, for instance in the synthesis of fine chemicals and
7.2 Protein Targets and Applications 45

commercial products. CYP450 BM3, also known as CYP102A1. Wild-type CYP450


BM3 is a fatty acid hydroxylase, which shows one of the highest hydroxylation activ-
ities ever reported for the CYP450. Although the natural CYP450 BM3 substrates are
long-chain fatty acids, its substrate specificity has been broadened by site-directed
and random mutagenesis. Moreover, many of the previously designed and identified
biocatalytically active CYP450 BM3 mutants are potent candidates for use in biotech-
nology, because they convert a variety of substrates into therapeutically or diagnos-
tically useful products and display a broad range of substrate specificity as well as
stereoselectivity. By employing genetic engineering techniques, these enzymes can
be further improved by random or site-directed mutagenesis to increase activity, sta-
bility, substrate specificity as well as stereoselectivity. To rationalize the results of
such mutagenesis studies or to even guide them, in silico modelling has proven to
be a useful and synergetic tool, for instance to predict and structurally rationalize
the effect of mutations. Previous combined experimental and computational efforts
in our molecular toxicology laboratory have led to designing and elucidation of new
drug-metabolizing mutants of BM338 (mutants M01 and M11) that can convert a
variety of drug-like compounds such as 3,4-methylenedioxy-methylamphetamine
(MDMA) and dextromethorphan. Mutants, M01 A82W and M11 L437N, were pos-
tulated to base on computational modelling. Also, their experimental characterization
in the metabolism of testosterone and α-ionones is rationalized using docking and
molecular dynamics.

7.3 4-Hydroxyphenylpyruvate Dioxygenase (HPPD)

HPPD is a Fe(II)-dependent, non-heme oxygenase and catalyzes the conversion of


4-hydroxyphenylpyruvate to homogentisate, i.e. one of the first steps in the tyro-
sine catabolic pathway. This reaction is a chemically complex transformation, with
many structural modifications that all occurring in a single catalytic cycle. HPPD is
a relevant target protein for both the therapeutic and agrochemical research. Because
HPPD is involved in tyrosine catabolism, blocking the formation and accumulation
of toxic catabolites by HPPD inhibition has proven a successful strategy to treat type
I tyrosinemia in mammals. Interestingly, an effective HPPD inhibitor used to treat
this inherited metabolic disorder in mammals was originally developed to serve as
herbicidal agent. In plants, HPPD is a key enzyme in the pathway producing plas-
toquinone and tocopherol, which are both essential cofactors in the photosynthesis
cascade from homogentisate. By the suppression of important cofactors formation,
the photosynthesis route can be disrupted. Inhibition of HPPD, thus, leads to bleach-
ing and ultimately followed by necrosis, and death. For many years, HPPD has been
a target of interest in the agrochemical industry, and many efforts have been made
in the screening and synthesis of inhibitors, which have led to many commercially
available herbicides.
46 7 Thermodynamic Cycles and Their Application in Protein Targets

7.4 Oligopeptide-Binding Protein a (OppA)

Water molecules can be of considerable importance for the binding and selectivity of a
substrate to its receptor, for instance water-mediated hydrogen bonds between protein
and ligand. The bacterial oligopeptide-binding protein A of Salmonella typhimurium
(OppA) is a well-studied example for which water molecules have a profound effect
on ligand binding. OppA binds with small peptides of 3–5 residues regardless of their
amino acid sequence. Whereas other proteins need water molecules to establish high
selectivity in the ligand binding, OppA relies on water molecules to accommodate a
broad range of ligands with diverse physicochemical properties. This lack of speci-
ficity is due to most of interactions between OppA and peptide ligands being mediated
by water, thus stabilizing the positive and negative charges or dipole moments of the
ligand side chains. For instance, crystal structure of charged tripeptide Lys-Glu-Lys
(KEK) in complex with OppA (PDB code 1JEU), showed that the ligand is buried in
the active site, and that most of the interactions between KEK and OppA are mediated
by nine water molecules. For different tripeptides, diverse water configurations have
been observed in the active site, as well as dissimilar numbers of water molecules.
The challenges associated with the simulation of highly flexible peptidic ligands,
combined with the presence of water molecule networks in the active site pocket
are addressed, in which thermodynamic cycles were constructed for three different
peptides binding to OppA.

References

1. T.L. Blundell, Structure-based drug design. Nature 384, 23 (1996)


2. N.C. Garbett, J.B. Chaires, Thermodynamic studies for drug design and screening. Expert Opin.
Drug Discov. 7, 299–314 (2012)
Chapter 8
Genomics and Proteomics Using
Computational Biology

Abstract Current functional genomics relies on known and characterised genes, but
despite significant efforts in the field of genome annotation, accurate identification
and elucidation of protein coding gene structures remains challenging. Methods
are limited to computational predictions and transcript-level experimental evidence;
hence translation cannot be verified. Proteomic mass spectrometry is a method that
enables sequencing of gene product fragments, enabling the validation and refinement
of existing gene annotation as well as the elucidation of novel protein coding regions.
However, the application of proteomics data to genome annotation is hindered by the
lack of suitable tools and methods to achieve automatic data processing and genome
mapping at high accuracy and throughput.

Keywords Computational genomics · Computational proteomics · MS


Genome annotation · Functional genomics · Genome · Proteomics

8.1 Introduction

Mass spectrometry (MS) has become the method of choice for protein identification
and quantification [1, 2]. The main reasons for this success include the availabil-
ity of high-throughput technology coupled with high sensitivity, specificity and a
good dynamic range [3]. These advantages are achieved by various separation tech-
niques coupled with high performance MS instrumentation. In a modern bottom-up
LC-MS/MS proteomics experiment [4], a complex protein mixture is often sepa-
rated via gel electrophoresis first to simplify the sample [5]. Subsequently, proteins
are digested with a specific enzyme such as trypsin, generating peptides that are
amenable for subsequent MS analysis. To further reduce sample complexity, pep-
tides are separated by liquid chromatographic (LC) systems [6], allowing direct
analysis without the need for further fractionation: eluents are ionised, separated by
their mass over charge ratios and subsequently registered by the detector. In a tan-
dem MS experiment (MS/MS), low energy collision-induced dissociation is used to
fragment the precursor ions, usually along the peptide bonds. Product fragments are

© The Author(s) 2018 47


A. C. Kaushik et al., Bioinformatics Techniques for Drug Discovery,
SpringerBriefs in Computer Science, https://doi.org/10.1007/978-3-319-75732-2_8
48 8 Genomics and Proteomics Using Computational Biology

Fig. 8.1 Schematic of a generic bottom-up proteomics MS experiment. a Sample preparation and
fractionation, b protein separation via gel-electrophoresis, c protein extraction, d enzymatic protein
digestion, e separation of peptides in one or multiple steps of liquid chromatography, followed by
ionisation of eluents and f tandem mass spectrometry analysis

measured as mass over charge ratios, which commonly reflect the primary structure
of the peptide ion [7]. This simplified process is illustrated in Fig. 8.1.
Today this technology allows researchers to identify complex protein mixtures
and enables them to build protein expression landscapes of any biological material
[8]. However, protein sequence coverage varies largely [3, 9] while protein inference
can be challenging if identified sequences are shared between different proteins
[10, 11]. The alternative top-down MS approach allows us to identify and sequence
intact proteins directly and does not limit the analysis to the fraction of detectable
enzyme digests [12]. However, this method is currently not applicable to complex
protein samples in a high throughput fashion. Firstly, there is an insufficiency of
efficient whole protein separation techniques and secondly commercially available
MS instruments are either limited by efficient fragmentation or by molecular weight
restrictions of the analytes [13]. Proteins directly and does not limit the analysis to the
fraction of detectable enzyme digests [12, 14]. However, this method is currently not
applicable to complex protein samples in a high throughput fashion. Firstly, there
is an insufficiency of efficient whole protein separation techniques and secondly
commercially available MS instruments are either limited by efficient fragmentation
or by molecular weight restrictions of the analyses [13].

8.2 Peptide Identification

Many computational tools have been developed to support high throughput peptide
and protein identification by automatically assigning sequences to tandem MS
spectra [15] shown in Fig. 8.1. Three types of approaches are used: (a) de novo
sequencing (b) database searching and (c) hybrid approaches.
8.3 De Novo and Hybrid Algorithms 49

8.3 De Novo and Hybrid Algorithms

De novo algorithms infer the primary sequence directly from the MS/MS spectrum by
matching the mass differences between peaks to the masses of corresponding amino
acids [16]. These algorithms do not need a priori sequence information and hence
can potentially identify protein sequences that are not available in a protein database.
However, de novo implementations do not yet reach the overall performance of
database search algorithms and often only a part of the whole peptide sequence
is reliably identified [17–19]. High accuracy mass spectrometry circumvents many
sequence ambiguities, and de novo methods can reach new levels of performance
[20]. Moreover, hybrid algorithms become more important, which build upon the de
novo algorithms, but compare the generated lists of potential peptides [21] or short
sequence tags [22] with available protein sequence databases to limit and refine the
search results. With the constant advances in instrument technology and improved
algorithms, de novo and hybrid methods may have a more important role in the
future, however database searching remains the most widely used method for peptide
identification.

8.4 Sequence Database Search Algorithms

Sequence database search algorithms resemble the experimental steps in silico


(Fig. 8.2): a protein sequence database is digested into peptides with the same enzyme
that is used in the actual experiment, most often trypsin that cuts very specifically
after Arginine (R) and Lysine (K) [23]. All peptide sequences (candidates) that match
the experimental peptide mass within an allowed maximum mass deviation (MMD)
are selected from this in silico digested protein sequence database. Each candidate
is then further investigated at the MS/MS level by correlating the experimental with
the theoretical peptide fragmentation patterns and scoring the correlation quality
[24, 25]. It should be noted that the sequence database is usually supplemented with
expected experimental contaminant proteins. This avoids spectra that originate from
contaminant proteins to incorrectly match to other proteins.

8.5 Scoring of Peptide Identifications

Most of these database search algorithms provide one or more peptide-spectrum


match (PSM) scores that correlate with the quality of the match, but are typically hard
to interpret and are not associated with any valid statistical meaning. Researchers
face the problem of computing identification error rates or PSM significance
50 8 Genomics and Proteomics Using Computational Biology

Fig. 8.2 Concept of sequence database searching resembles a generic bottom-up MS experiment,
as for each stage of the experiment, an in silico equivalent component is available

measures and need to deal with post-processing software that converts search scores
into meaningful statistical measures. Therefore, the following sections are focussed
on scoring and assessment of database search results, providing a brief overview of
common methods, their advantages and disadvantages.

8.6 Peptide-Spectrum Match Scores and Common


Thresholds

Sequest [24] was the first sequence database search algorithm for tandem MS data
and is today, together with Mascot [26] one of the most widely used tools for pep-
tide and protein identification. These are representative of the numerous database
search algorithms that report for every PSM, a score that reflects the quality of the
cross correlation between the experimental and the computed theoretical peptide
spectrum. Although Sequest and Mascot scores are fundamentally different in their
8.6 Peptide-Spectrum Match Scores and Common Thresholds 51

calculation, they facilitate good relative PSM ranking: all peptide candidates that
were matched against an experimental spectrum are ranked according to the PSM
score and only the best matches are reported. Often only the top hit is considered
for further investigation and some search engines [27] exclusively report that very
best match. However, not all these identifications are correct. Sorting all top hit
PSMs (absolute ranking) according to their score enables the selective investigation
of the very best matched PSMs. This approach was initially used to aid manual
interpretation and validation. As the field of MS-based proteomics moved towards
high-throughput methods, researchers started to define empirical score thresholds.
PSMs scoring above these thresholds were accepted and assumed to be correct, while
anything else was classified as incorrect. Depending on how well the underlying PSM
score discriminates, the correct and incorrect scores overlap significantly (Fig. 8.3)
and therefore thresholding is always a trade-off between sensitivity (fraction of true
positive identifications) and the acceptable error rate (fraction of incorrect identifi-
cations). Low score thresholds will accept more PSMs at the cost of a higher error
rate and on the other hand a high score threshold reduces the error rate at the cost of
sensitivity.
Many groups also apply heuristic rules that combine the score threshold with
some other validation properties such as charge state, the difference in score to the
second-best hit, amongst others. The problem with these methods is that the actual
error rate remains unknown and the decision of accepting assignments is only based
on judgement of an expert. Moreover, results between laboratories or even between
experiments cannot be reliably compared, since different search algorithms, pro-
tein databases, search parameters, instrumentation and sample complexity require
adaptation of acceptance criteria. A recent HUPO study [28] investigated the repro-
ducibility between laboratories. Amongst the 18 laboratories, each had their own
criteria of what was considered a high and low confidence protein identification,
which were mostly based on simple heuristic rules and score thresholds [28]. It was
found that the number of high confidence assignments between two different labo-
ratories could vary by as much as 50%, despite being based on the same data. As
a result, many proteomic journals require the validation and assessment of score
thresholds, ideally with significance measures such as genome annotation.

8.7 Fundamentals of Gene Transcription and Translation

The genomic sequence encodes the blueprint of an organism. The instruction sets are
encoded in protein coding and non-coding genes, which are dined stretches of DNA
sequence that contain the information required to construct proteins and functional
RNA molecules respectively. The realisation of genes is initiated by transcription,
whereby genomic DNA is transcribed into RNA.
52 8 Genomics and Proteomics Using Computational Biology

Fig. 8.3 Illustration of gene transcription and translation according to the standard model

This premature RNA sequence comprises two different types of segments in


eukaryotes, exons and introns, the latter of which is removed during splicing. This
process enables the construction of alternative products (alternative splicing) by vary-
ing the joining of exons: these can be extended at the 5 donor or 3 acceptor site, one
or multiple exons can be skipped or rarely introns can be retained. Products that are
derived from non-coding RNA genes, code for RNA molecules and are not further
translated into proteins. These non-coding molecules have been studied extensively
in the last decade and are involved in many cellular processes, although the function
is unknown for some of these elements [29–31].
Spliced RNA sequence that was derived from protein coding genes is referred to as
messenger RNA (mRNA). Mature mRNA comprises the open reading frame (ORF)
that codes for the protein and the untranslated sequences (5 UTR upstream and
3 UTR downstream of the ORF). During protein translation, three nucleotides are
read at a time (codons) and specific transfer RNAs (tRNA) match these codons with
three unpaired complementary bases (anticodon). Each anticodon denes a specific
amino acid that is bound to the tRNA, which upon binding of mRNA and tRNA
is ligated to the growing polypeptide chain. The newly synthesised protein must
fold to its active three-dimensional structure before it can carry out its function.
This simplified standard model describing the unfolding of genomic sequence, also
known as the “central dogma of molecular biology” [32, 33], is further illustrated in
Fig. 8.3.
8.8 Genome Sequencing 53

8.8 Genome Sequencing

Sequencing starts in the last decade generated a large amount of raw genomic DNA
sequence data. To date there are 118 complete eukaryotic genomes sequenced [34]
and more sophisticated sequencing technologies will even speed up this data collec-
tion process. A project to sequence 10,000 vertebrate species has just been proposed,
even though technology is not yet up to it [35]. Genomes can be large, for example
the human genome comprises approximately 3.2 × 109 base pairs, yet only about
1–2% of its DNA codes for proteins [36].

8.9 Definition of Genome Annotation

Genome annotation can be defined as augmenting these raw DNA sequences with
additional layers of information [37, 38]. It can be distinguished between structural
and functional annotation. The former is the process of identifying important genomic
elements such as genes, the precise localisation of genes within the genome and
the elucidation of exon/intron structures, while the latter deals with the biological
function, regulation and expression analysis of these elements. For clarification,
when the term “genome annotation” is used in the remainder of this work, it refers
to structural annotation only. The task of accurately annotating the complete set
of protein coding genes and their alternative splice forms is considered one of the
hardest and yet most important steps towards understanding a genome, since proteins
are central to virtually every biological process in a cell. However, the difficulty of
gene identification and gene structure elucidation is determined by the complexity
of the underlying genome: for example, identification of ORFs in bacteria, which are
not discussed in this work, is relatively easy due to the lack of alternative splicing
and a compact genome; simpler eukaryotes, such as yeast with limited splicing and
short intronic regions are much easier to annotate than vertebrates, since extensive
alternative splicing, long introns and intergenic regions further complicate sensitive
and specific annotation.

8.10 Genome Annotation Strategies

With the ever-increasing availability of sequenced genomes, automatic genome anno-


tation is an active area of research. Figure 8.4 provides an overview of the different
available annotation strategies, which will be brief discussed.
54 8 Genomics and Proteomics Using Computational Biology

Fig. 8.4 Overview of the different gene-finding strategies. Figure was adapted from Harrow et al.
2009

8.11 Proteogenomics

The automatic Ensembl pipeline and the HAVANA manual curation pipeline incor-
porate protein data from the UniProtKB database [39], where more than 99% of
the protein sequences are derived from genomic translations and cDNA sequences,
but only 13% are supported by protein level evidence such as mass spectrometry
identification (UniProt release notes 15.11, http://www.uniprot.org/news/2009/11/
24/release). Proteins that are detected [40] demonstrated the concept of searching
MS/MS data directly against a six-frame translation of the genome, but it was [41–43]
that applied this approach to eukaryotic genomes with the purpose of validating and
refining gene annotation as well as the identification of novel genes. In these studies,
a six-frame translation was used as a search database, however in higher eukaryotes
this is problematic: only 1–2% of the human genome encodes proteins [30, 36],
therefore most of the six-frame translation is essentially random sequence. The Pep-
tide Atlas project [44, 45], the first large-scale proteogenomic pipeline and MS/MS
peak lists and raw data repository, employs the standard International Protein Index
(IPI) database as an alternative approach to six-frame translation. IPI provides a min-
imally redundant yet maximally complete sets of protein sequences from Ensembl,
Vega, RefSeq and UniProtKB. Later versions of Peptide Atlas complement the IPI
database with protein isoforms from Ensembl. Peptide Atlas comprises an analysis
pipeline to processes MS data with Sequest and PeptideProphet and provides access
to these peptide identifications, which are persisted in a comprehensive relational
8.11 Proteogenomics 55

database. As an additional feature, Peptide Atlas maps peptide identifications to the


genome using the sequence alignment tool BLAST [46]. These mappings are made
available with a distributed annotation server (DAS), allowing peptide identification
results to be integrated into various genome browsers, such as Ensembl. The cur-
rently available DAS source (http://www.peptideatlas.org/setup_genome_browser.
php) does not provide maintain information of the uniqueness of the peptide within
the genome, limiting the direct use for annotation, since the peptide could match
multiple different genomic loci. The system is not available for download, providing
little flexibility for required changes or extensions, such as support of Mascot and
Mascot Percolator or different search databases.

References

1. R. Aebersold, M. Mann, Mass spectrometry-based proteomics. Nature 422, 198–207 (2003)


2. S.D. Patterson, R.H. Aebersold, Proteomics: the first decade and beyond. Nat. Genet. 33,
311–323 (2003)
3. L.M. de Godoy, J.V. Olsen, G.A. de Souza, G. Li, P. Mortensen, M. Mann, Status of complete
proteome analysis by mass spectrometry: SILAC labeled yeast as a model system. Genome
Biol. 7, R50 (2006)
4. A.L. McCormack, D.M. Schieltz, B. Goode, S. Yang, G. Barnes, D. Drubin, J.R. Yates, Direct
analysis and identification of proteins in mixtures by LC/MS/MS and database searching at the
low-femtomole level. Anal. Chem. 69, 767–776 (1997)
5. A. Shevchenko, M. Wilm, O. Vorm, M. Mann, Mass spectrometric sequencing of proteins from
silver-stained polyacrylamide gels. Anal. Chem. 68, 850–858 (1996)
6. D.A. Wolters, M.P. Washburn, J.R. Yates, An automated multidimensional protein identification
technology for shotgun proteomics. Anal. Chem. 73, 5683–5690 (2001)
7. K. Biemann, Contributions of mass spectrometry to peptide and protein structure. Biol. Mass
Spectrom. 16, 99–111 (1988)
8. L.J. Foster, C.L. de Hoog, Y. Zhang, Y. Zhang, X. Xie, V.K. Mootha, M. Mann, A mammalian
organelle map by protein correlation profiling. Cell 125, 187–199 (2006)
9. R.J. Simpson, L.M. Connolly, J.S. Eddes, J.J. Pereira, R.L. Moritz, G.E. Reid, Proteomic
analysis of the human colon carcinoma cell line (LIM 1215): development of a membrane
protein database. Electrophoresis 21, 1707–1732 (2000)
10. A.I. Nesvizhskii, R. Aebersold, Analysis, statistical validation and dissemination of large-scale
proteomics datasets generated by tandem MS. Drug Discov. Today 9, 173–181 (2004)
11. A.I. Nesvizhskii, A. Keller, E. Kolker, R. Aebersold, A statistical model for identifying proteins
by tandem mass spectrometry. Anal. Chem. 75, 4646–4658 (2003)
12. B.A. Parks, L. Jiang, P.M. Thomas, C.D. Wenger, M.J. Roth, M.T. Boyne, P.V. Burke, K.E.
Kwast, N.L. Kelleher, Top-down proteomics on a chromatographic time scale using linear ion
trap Fourier transform hybrid mass spectrometers. Anal. Chem. 79, 7984–7991 (2007)
13. X. Han, M. Jin, K. Breuker, F.W. McLafferty, Extending top-down mass spectrometry to pro-
teins with masses greater than 200 kilodaltons. Science 314, 109–112 (2006)
14. M.J. Roth, B.A. Parks, J.T. Ferguson, M.T. Boyne, N.L. Kelleher, “Proteotyping”: popula-
tion proteomics of human leukocytes using top down mass spectrometry. Anal. Chem. 80,
2857–2866 (2008)
15. A.I. Nesvizhskii, O. Vitek, R. Aebersold, Analysis and validation of proteomic data generated
by tandem mass spectrometry, Nat. Meth. 4 (2007)
16. J.A. Taylor, R.S. Johnson, Sequence database searches via de novo peptide sequencing by
tandem mass spectrometry. Rapid Commun. Mass Spectrom. 11, 1067–1075 (1997)
56 8 Genomics and Proteomics Using Computational Biology

17. M. Mann, M. Wilm, Error-tolerant identification of peptides in sequence databases by peptide


sequence tags. Anal. Chem. 66, 4390–4399 (1994)
18. E. Pitzer, A. Masselot, J. Colinge, Assessing peptide de novo sequencing algorithms perfor-
mance on large and diverse data sets. Proteomics 7, 3051–3054 (2007)
19. D.L. Tabb, A. Saraf, J.R. Yates, GutenTag: high-throughput sequence tagging via an empirically
derived fragmentation model. Anal. Chem. 75, 6415–6421 (2003)
20. A.M. Frank, M.M. Savitski, M.L. Nielsen, R.A. Zubarev, P.A. Pevzner, De novo peptide
sequencing and identification with precision mass spectrometry. J. Proteome Res. 6, 114–123
(2007)
21. S. Kim, N. Gupta, N. Bandeira, P.A. Pevzner, Spectral dictionaries integrating de novo peptide
sequencing with database search of tandem mass spectra. Mol. Cell. Proteomics 8, 53–69
(2009)
22. S. Tanner, H. Shu, A. Frank, L.-C. Wang, E. Zandi, M. Mumby, P.A. Pevzner, V. Bafna, InsPecT:
identification of posttranslationally modified peptides from tandem mass spectra. Anal. Chem.
77, 4626–4639 (2005)
23. J.V. Olsen, S.-E. Ong, M. Mann, Trypsin cleaves exclusively C-terminal to arginine and lysine
residues. Mol. Cell. Proteomics 3, 608–614 (2004)
24. J.K. Eng, A.L. McCormack, J.R. Yates, An approach to correlate tandem mass spectral data
of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5,
976–989 (1994)
25. J.S. Cottrell, U. London, Probability-based protein identification by searching sequence
databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999)
26. P. Carella, D.C. Wilson, R.K. Cameron, Some things get better with age: differences in salicylic
acid accumulation and defense signaling in young and mature Arabidopsis. Front. Plant Sci. 5
(2014)
27. R. Craig, R.C. Beavis, TANDEM: matching proteins with tandem mass spectra. Bioinformatics
20, 1466–1467 (2004)
28. G.S. Omenn, T.W. Blackwell, D. Fermin, J. Eng, D.W. Speicher, S.M. Hanash, Challenges
in deriving high-confidence protein identifications from data gathered by a HUPO plasma
proteome collaborative study. Nat. Biotechnol. 24, 333–338 (2006)
29. M. Clamp, B. Fry, M. Kamal, X. Xie, J. Cuff, M.F. Lin, M. Kellis, K. Lindblad-Toh, E.S.
Lander, Distinguishing protein-coding and noncoding genes in the human genome. Proc. Natl.
Acad. Sci. 104, 19428–19433 (2007)
30. J.-M. Claverie, Fewer genes, more noncoding RNA. Science 309, 1529–1530 (2005)
31. S. Washietl, J.S. Pedersen, J.O. Korbel, C. Stocsits, A.R. Gruber, J. Hackermüller, J. Hertel,
M. Lindemeyer, K. Reiche, A. Tanzer, Structured RNAs in the ENCODE selected regions of
the human genome. Genome Res. 17, 852–864 (2007)
32. F.H. Crick, The biological replication of macromolecules. Symp. Soc. Exp. Biol, pp. 138–163
(1958)
33. F. Crick, Central dogma of molecular biology. Nature 227, 561–563 (1970)
34. K. Liolios, I.-M.A. Chen, K. Mavromatis, N. Tavernarakis, P. Hugenholtz, V.M. Markowitz,
N.C. Kyrpides, The genomes on line database (GOLD) in 2009: status of genomic and metage-
nomic projects and their associated metadata. Nucleic Acids Res. 38, D346–D354 (2009)
35. E. Pennisi, No genome left behind. Science 326, 794–795 (2009)
36. E. Birney, J.A. Stamatoyannopoulos, A. Dutta, R. Guigó, T.R. Gingeras, E.H. Margulies, Z.
Weng, M. Snyder, E.T. Dermitzakis, R.E. Thurman, Identification and analysis of functional
elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816
(2007)
37. M.R. Brent, Genome annotation past, present, and future: how to define an ORF at each locus.
Genome Res. 15, 1777–1786 (2005)
38. L. Stein, Genome annotation: from sequence to biology. Nat. Rev. Genet. 2, 493–503 (2001)
39. C.H. Wu, R. Apweiler, A. Bairoch, D.A. Natale, W.C. Barker, B. Boeckmann, S. Ferro, E.
Gasteiger, H. Huang, R. Lopez, The Universal Protein Resource (UniProt): an expanding
universe of protein information. Nucleic Acids Res. 34, D187–D191 (2006)
References 57

40. J.R. Yates III, J.K. Eng, A.L. McCormack, Mining genomes: correlating tandem mass spectra
of modified and unmodified peptides to sequences in nucleotide databases. Anal. Chem. 67,
3202–3210 (1995)
41. J.S. Andersen, M. Mann, Mass spectrometry allows direct identification of proteins in large
genomes. Proteomics 1 641g650 (2001)
42. J.S. Choudhary, W.P. Blackstock, D.M. Creasy, J.S. Cottrell, Matching peptide mass spectra
to EST and genomic DNA databases. Trends Biotechnol. 19, 17–22 (2001)
43. J.S. Choudhary, W.P. Blackstock, D.M. Creasy, J.S. Cottrell, Interrogating the human genome
using uninterpreted mass spectrometry data. Proteomics 1, 651–667 (2001)
44. F. Desiere, E.W. Deutsch, A.I. Nesvizhskii, P. Mallick, N.L. King, J.K. Eng, A. Aderem, R.
Boyle, E. Brunner, S. Donohoe, Integration with the human genome of peptide sequences
obtained by high-throughput mass spectrometry. Genome Biol. 6, R9 (2004)
45. F. Desiere, E.W. Deutsch, N.L. King, A.I. Nesvizhskii, P. Mallick, J. Eng, S. Chen, J. Eddes,
S.N. Loevenich, R. Aebersold, The peptideatlas project. Nucleic Acids Res. 34, D655–D658
(2006)
46. S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, Basic local alignment search tool.
J. Mol. Biol. 215, 403–410 (1990)

You might also like