Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Supplement S1: RNA secondary structure

RNA structure is often expressed schematically by its base pairing: the Watson-Crick (WC) base
pairs A (Adenine) with U (Uracil), and G (Guanine) with C (Cytosine) and also the
non-Watson-Crick (non-WC) base pair G with U. RNA sequences are typically written from left to
right. The beginning of the sequence is usually on the left hand side and is called the 5’ end, and the
opposite end is called the 3’ end.
A number of simple structural motifs are generated by the way that the RNA molecule forms
these base pairs. In secondary structure of RNA, the typical structural motifs are usually classified as
stem, internal loop (or interior loop), and multibranch loop. This standard neglects a large group of
possible structures called pseudoknots (PK).

Stem:
When more than one base pair appears in the form of a group of contiguous base pairs, the resulting
structure motif is described as a stem (Fig. S1a). For RNA, this stem motif appears as a flat object.
However, the actual structure in three dimensions (3D) has a twist that makes a 360o rotation roughly
every 10 bps (for A-RNA: the most commonly found structure for RNA).

a b
5' 3'
CUAGU
3' G G U C A 5'
structure + sequence
structure format
format
Supplement Figure S1. An example of a stem motif represented as secondary structure.
(a) A stem including both the secondary structure and the sequence labels. (b) A stem
that only includes the base pairing information.

Loops:
The other simple structural motifs, based on the types of loops, can also be found in RNA structures.

Hairpin loop (H-loop):


The simplest such motif is the H-loop (Fig. S2). The H-loop consists of two complimentary
sequences joined by some non-pairing bases in a loop. A simple example would be the sequence
AAAAACCCCUUUUU (Fig. S2a). The Figure also contains three additional representations. In
Figure S2b, only the base pairing information of the structure diagram is shown (in a similar way as
Fig. S1b). Figure S2c is a bracket graph developed by Hofacker et al. [1], where the bracket is
matched by an equal and opposite bracket on the 3’ side. Figure S2d is a highly simplified diagram
similar to Fig. S2b, but lacking specific details about the exact size of the loop region.

a b
C C
C C
A U
A U
A U
A U
A U
5' 3' 5' 3'

c d
5' (((((....))))) 3'
5' 3'

Supplement Figure S2. An example of a hairpin loop (H-loop). (a) Secondary structure
with base index included. (b) Secondary structure with only the base pairing and base
position included. (c) A simplified representation of this secondary structure using
bracket notation. (d) A secondary structure that only specifies the stem base pairs
without specifying the exact size of the loop regions.

Internal loop (I-loop):


Another common structure is the internal loop (I-loop), Fig. S3a-h. An internal loop appears between
two stems and has n1 unpaired bases on the 5’ side and n2 unpaired bases on the 3’ side; where
n1 = 0,1, 2… nucleotides (nt) and likewise for n2 . Fig. S3a shows a symmetric internal loop
( n1 = n2 ) where n1 = n2 = 1 nt: i.e., the number of bases in on each side of the loop are equal to
one. This I-loop motif also includes bulges which have the property that n1 > 0 and n2 = 0 or
n1 = 0 and n2 > 0 . Large structures of RNA typically have many loops, bulges, and internal loops.
There are many examples of symmetric I-loops found in RNA structure databases. Some asymmetric
I-loops ( n1 ≠ n2 ) can also be found.
a b e f
C C C
C C C C
A U A C
A U C U
A U A G
U
A U n AA G C
A U AA
n1 C A n2 CG
G C GC
G C GC
G C GC
G C 5' 3' 5' 3'
5' 3' 5' 3'

c g
5' ((((.(((((....))))).)))) 3' 5' ((((....(((((....))))))))) 3'

n1 n
5'
d h
3'
5'
n2
3'

Supplement Figure S3. Examples of secondary structure for internal loops and bulges
(I-loops). (a) Secondary structure of internal loops with base index included. (b)
Secondary structure with only the base pairing and base position included. (c) A
simplified representation of this secondary structure using bracket notation. (d) A
secondary structure that only specifies the stem locations and is not specific about the
loop regions. (e-h) The same definitions apply for the example of a bulge.

Multibranch loop (M-loop):


A third common secondary structure motif is known as a multibranch loop (M-loop or MBL). These
are more complex structures that consist of several of these previously described stem-loop type
structures. These structures are also quite common like the other secondary structure motifs. An
example of a multibranch loop is shown in Fig. S4.
The stem, H-loop, I-loop and M-loop are the four fundamental motifs that make up secondary
structure. The fundamental feature of RNA secondary structure is that base indices i and j
( i < j ) are allowed to base pair with each other if they satisfy the following properties with all other
base pairs ( i ' and j ' , with i ' < j ' )

(1) if i and j are contained within i ' and j ' , then i ' < i < j < j '
(2) if i and j are not contained within i ' and j ' , then either i and j
are less than i ' , or i and j are greater than j ' ,
(3) if neither case is true, then i = i ' and j = j ' .
a b
C CC
C C A A
C C
UU
U C C AA UU U
CAAA G C U
AC GU
U C
GC
CG
GC
GC
C G
5' 3' 5' 3'
c
5' 3'
.((((.((((.....))))..(((((.....))))).)))).
Supplement Figure S4. Secondary structure of multibranch loops (M-loop or MBL). (a)
Secondary structure of M-loop with base index included. (b) Secondary structure with
only the base pairing and base position included. (c) A simplified representation of this
secondary structure using bracket notation.

Pseudoknot (PK) and knots:


Whereas many RNA structures are known to satisfy these rules (for example, tRNA), this is not
always the case. The most common deviations from these base pairing rules is a class of structures
called pseudoknots (PKs). A pseudoknot permits violation of the above three pairing relationship
rules.

(4) For pseudoknots: parts of the structure still satisfy cases 1 through 3.
However, in addition, in at least one part of a region between k and l
such that k ≤ i, j , i ', j ' ≤ l , there exists some base pairs that satisfy
either i ' < i < j ' < j or i < i ' < j < j ' .

Hence, many more possibilities can be generated once we start allowing pseudoknots.
Pseudoknots differ from real knots in the sense that the strand does not pass completely through
the loop but only becomes potentially entangled with it. Anyone who has tried to untangle a pair of
earphones or tried to untangle a recently, neatly wound up cord can realize that knots naturally occur
on long flexible cords. Indeed, great effort seems to be required to avoid tangling a heap of cords.
Figure S5a shows a common pseudoknot known as an H-type pseudoknot. This is also known
as an ABAB pseudoknot. The structure is notated below with the standard parenthesis notation for
the basic secondary structure (here shown in blue) and square brackets for the pseudoknot linkage
(here shown in red). Green indicates the regions of free strand that are not forming base pairs (bps).
The color distinction of the stems in this example is not important because both stems are the same
length and both stems are less than 10 bps.

a 5' c 5'

b 3' d 3'

5' ...(((((......[[[[[.)))))....]]]]]... 3' 5' ...((((((((((((((................[[[[[[[[[[[[[[.))))))))))))))


.................]]]]]]]]]]]]]]... 3'

e g 5'

3'
f 5' h 3'
5'...(((((......[[[[[.)))))....(((((.]]]]]......)))))... 3' 5' ...((((((((((((((................[[[[[.))))))))))))))
.................]]]]]... 3'

Supplement Figure S5. Examples of pseudoknots and knots. (a) An H-type pseudoknot
where the linkage stem is denoted in red, standard secondary structure in blue and free
strand (regions of no base pairing) in green. (b) The same structure in (a) denoted in
bracket notation. (c) An example of a knot where the stem length of both the secondary
structure and the linage stem are longer than 9 base pairs. (d) The same structure in
bracket notation. (e) An extended pseudoknot, sometimes referred to as a kissing-loop
and also known as an ABACBC pseudoknot. (f) The structure shown in bracket
notation. (g) A pseudoknot. The difference between (c) and (g) is that the linkage stem
is shorter than 9 contiguous bps. (h) The same structure in bracket notation.

Figure S5c shows a knot and the corresponding bracket structure is shown below in Fig. S5d.
Both stems in Fig. S5c contain 14 base pairs (bp). Since the helical axis makes a rotation of 360o
every 10 bps, this means that the structure in Fig. S5c is tangled in a knot. There is no reason why
such a structure cannot form. Indeed, knots are known to form in some rare proteins [2,3]. However,
it is not a pseudoknot and the current approach is not designed to estimate its existence or the
likelihood of its formation. One important feature of a pseudoknot is, therefore, that the linkage stem
(here shown in red) must be shorter than 10 contiguous bps.
Figure S5e shows two stem-hairpins-loops (blue and purple) that are joined by a linkage stem
(red). The stems are all short as in Fig. S5a. This is a pseudoknot, sometimes referred to as a kissing
loop. It is also known as an ABACBC pseudoknot. It is also observed in a number of places,
although less frequently than H-type pseudoknots. The corresponding bracket notation for this
structure is shown below (Fig. S5f).
Figure S5g shows a structure intermediate between Figs. S5a and c. This structure is also an
H-type pseudoknot. Here we see a necessary condition for defining a segment as a linkage stem
and the most important distinction between a knot and a pseudoknot: a linkage stem cannot contain
more than 9 contiguous bps. When an internal loop breaks the continuity, this rule may not
necessarily apply. Therefore, longer overall structure could form, but only if the stem is not
contiguous. Moreover, this would have to be considered on a case by case basis.
As a final curiosity, in Fig. S6, a secondary structure (left hand side) and its corresponding knot
(right hand side) are shown in equilibrium. The knot renders a peculiar 2D illusion perhaps
reminiscent of artist M.C. Escher’s “Belvedere” (1958). The knotted structure could conceivably
exist in chemical equilibrium with a standard secondary structure, though electrostatic effects may
render it less favorable thermodynamically. It could result from a simple misfolding.

secondary structure Belvedere knot

Supplement Figure S6. A special type of knot (right hand side) that becomes entangled
due to the way the structure folds up. The two dimensional nature of the RNA
schematics tends to hide curious possibilities as this. The knot is seen in equilibrium
between the unknotted structure (left hand side) and the “Belvedere knot” (right hand
side).

Functional RNA structures that contain knots of the form shown in Figs. S5c or S6 are currently
unknown or have not been reported. However, pseudoknots are often observed in functional RNA
structures, particularly H-type pseudoknots (Fig. S5a). Single strand RNA sequences such as
messenger RNA with introns can have sequences with lengths that number in the tens of thousands
of nucleotides. With such a propensity for a few simple cords to become tangled, and, since the cell
can have many thousands of protein and RNA strands present within the cellular environment, this
suggests that there is a fair amount of effort made within the cell to prevent or get rid of knots [2].
In this work, we are concerned with the prediction of pseudoknots. These structures have the
property that they can be evaluated as structures resulting from reversible folding.

References
1. Hofacker IL, Fontana W, Stadler PF, Bonhoeffer S, Tacker M, et al. (1994) Fast folding and
comparison of RNA secondary structures. Monatshefte f Chemie 125: 167-188.
2. Lua RC, Grosberg Y (2006) Statistics of knots, geometry of conformations, and evolution of
proteins. PLoS Comp Biol 2: e45.
3. Virnau P, Mirny LA, and Kardar M (2006) Intricate knots in proteins: function and evolution.
PLoS Comp Biol 2: e122.
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16

CHAPTER SIXTEEN

RNA Structure Prediction

RNA is one of the three major types of biological macromolecules. Understanding


the structures of RNA provides insights into the functions of this class of molecules.
Detailed structural information about RNA has significant impact on understand-
ing the mechanisms of a vast array of cellular processes such as gene expres-
sion, viral infection, and immunity. RNA structures can be experimentally deter-
mined using x-ray crystallography or NMR techniques (see Chapter 10). However,
these approaches are extremely time consuming and expensive. As a result, com-
putational prediction has become an attractive alternative. This chapter presents
the basics of RNA structures and current algorithms for RNA structure prediction,
with an emphasis on secondary structure prediction.

INTRODUCTION

It is known that RNA is a carrier of genetic information and exists in three main forms.
They are messenger RNA (mRNA), ribosomal RNA (rRNA), and transfer RNA (tRNA).
Their main roles are as follows: mRNA is responsible for directing protein synthesis;
rRNA provides structural scaffolding within ribosomes; and tRNA serves as a carrier
of amino acids for polypeptide synthesis.
Recent advances in biochemistry and molecular biology have allowed the discovery
of new functions of RNA molecules. For example, RNA has been shown to possess
catalytic activity and is important for RNA splicing, processing, and editing. A class of
small, noncoding RNA molecules, termed microRNA or miRNA, have recently been
identified to regulate gene expression through interaction with mRNA molecules.
Unlike DNA, which is mainly double stranded, RNA is single stranded, although an
RNA molecule can self-hybridize at certain regions to form partial double-stranded
structures. Generally, mRNA is more or less linear and nonstructured, whereas rRNA
and tRNA can only function by forming particular secondary and tertiary structures.
Therefore, knowledge of the structures of these molecules is particularly impor-
tant for understanding their functions. Difficulties in experimental determination
of RNA structures make theoretical prediction a very desirable approach. In fact,
computational-based analysis is a main tool in RNA-based drug design in pharma-
ceutical industry. In addition, knowledge of the secondary structures of rRNA is key
for RNA-based phylogenetic analysis.

231

Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16
Figure 16.1: The primary, secondary, and tertiary structures of a tRNA molecule illustrating the three levels of RNA structural organization.
232
Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16

TYPES OF RNA STRUCTURES 233

Figure 16.2: Schematic diagram of a hypothetical RNA molecular containing four basic types of RNA
loops: a hairpin loop, bulge loop, interior loop, and multibranch loop. Dashed lines indicate base pairings
in the helical regions of the molecule.

TYPES OF RNA STRUCTURES

RNA structures can be described at three levels as in proteins: primary, secondary,


and tertiary. The primary structure is the linear sequence of RNA, consisting of four
bases, adenine (A), cytosine (C), guanine (G), and uracil (U). The secondary structure
refers to the planar representation that contains base-paired regions among single-
stranded regions. The base pairing is mainly composed of traditional Watson–Crick
base pairing, which is A–U and G–C. In addition to the canonical base pairing, there
often exists noncanonical base pairing such as G and U base paring. The G–U base
pair is less stable and normally occurs within a double-strand helix surrounded
by Watson–Crick base pairs. Finally, the tertiary structure is the three-dimensional
arrangement of bases of the RNA molecule. Examples of the three levels of RNA struc-
tural organization are illustrated in Figure 16.1.
Because the RNA tertiary structure is very difficult to predict, attention has been
mainly focused on secondary structure prediction. It is therefore important to learn
in more detail about RNA secondary structures. Based on the arrangement of helical
base pairing in secondary structures, four main subtypes of secondary structures can
be identified. They are hairpin loops, bulge loops, interior loops, and multibranch
loops (Fig. 16.2).
The hairpin loop refers to a structure with two ends of a single-stranded region
(loop) connecting a base-paired region (stem). The bulge loop refers to a single
stranded region connecting two adjacent base-paired segments so that it “bubbles”
out in the middle of a double helix on one side. The interior loop refers to two single-
stranded regions on opposite strands connecting two adjacent base-paired segments.
It can be said to “bubble” out on both sides in the middle of a double helical segment.
The multibranch loop, also called helical junctions, refers to a loop that brings three
or more base-paired segments in close vicinity forming a multifurcated structure.

Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16

234 RNA STRUCTURE PREDICTION

Figure 16.3: A hypothetical RNA structure containing a pseudoknot, kissing hairpin, and hairpin–bulge
contact.

In addition to the traditional secondary structural elements, base pairing between


loops of different secondary structural elements can result in a higher level of struc-
tures such as pseudoknots, kissing hairpins, and hairpin–bulge contact (Fig. 16.3). A
pseudoknot loop refers to base pairing formed between loop residues within a hair-
pin loop and residues outside the hairpin loop. A kissing hairpin refers to a hydro-
gen bonded interaction formed between loop residues of two hairpin structures. The
hairpin–bulge contact refers to interactions between loop residues of a hairpin loop
and a bulge loop. This type of interaction forms supersecondary structures, which are
relatively rare in real structures and thus are ignored by most conventional prediction
algorithms.

RNA SECONDARY STRUCTURE PREDICTION METHODS

At present, there are essentially two types of method of RNA structure prediction.
One is based on the calculation of the minimum free energy of the stable structure
derived from a single RNA sequence. This can be considered an ab initio approach. The
second is a comparative approach which infers structures based on an evolutionary
comparison of multiple related RNA sequences.

AB INITIO APPROACH

This approach makes structural predictions based on a single RNA sequence. The
rationale behind this method is that the structure of an RNA molecule is solely deter-
mined by its sequence. Thus, algorithms can be designed to search for a stable RNA
structure with the lowest free energy. Generally, when a base pairing is formed, the
energy of the molecule is lowered because of attractive interactions between the two
strands. Thus, to search for a most stable structure, ab initio programs are designed
to search for a structure with the maximum number of base pairs.
Free energy can be calculated based on parameters empirically derived for small
molecules. G–C base pairs are more stable than A–U base pairs, which are more stable
than G–U base pairs. It is also known that base-pair formation is not an independent
event. The energy necessary to form individual base pairs is influenced by adjacent
base pairs through helical stacking forces. This is known as cooperativity in helix
formation. If a base pair is next to other base pairs, the base pairs tend to stabilize

Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16

AB INITIO APPROACH 235

each other through attractive stacking interactions between aromatic rings of the base
pairs. The attractive interactions lead to even lower energy. Parameters for calculating
the cooperativity of the base-pair formation have been determined and can be used
for structure prediction.
However, if the base pair is adjacent to loops or bulges, the neighboring loops
and bulges tend to destabilize the base-pair formation. This is because there is a
loss of entropy when the ends of the helical structure are constrained by unpaired
loop residues. The destabilizing force to a helical structure also depends on the
types of loops nearby. Parameters for calculating different destabilizing energies
have also been determined and can be used as penalties for secondary structure
calculations.
The scoring scheme based on the combined stabilizing and destabilizing inter-
actions forms the foundation of the ab initio RNA secondary structure prediction
method. This method works by first finding all possible base-pairing patterns from a
sequence and then calculating the total energy of a potential secondary structure by
taking into account all the adjacent stabilizing and destabilizing forces. If there are
multiple alternative secondary structures, the method finds the conformation with
the lowest energy, meaning that it is energetically most favorable.

Dot Matrices
In searching for the lowest energy form, all possible base-pair patterns have to be
examined. There are several methods for finding all the possible base-paired regions
from a given nucleic acid sequence. The dot matrix method and the dynamic program-
ming method introduced in Chapter 3 can be used in detecting self-complementary
regions of a sequence. A simple dot matrix can find all possible base-paring patterns of
an RNA sequence when one sequence is compared with itself (Fig. 16.4). In this case,
dots are placed in the matrix to represent matching complementary bases instead of
identical ones.
The diagonals perpendicular to the main diagonal represent regions that can self-
hybridize to form double-stranded structure with traditional A–U and G–C base pairs.
In reality, the pattern detection in a dot matrix is often obscured by high noise levels.
As discussed in Chapter 3, one way to reduce the noise in the matrix is to select
an appropriate window size of a minimum number of contiguous base matches.
Normally, only a window size of four consecutive base matches is used. If the dot plot
reveals more than one feasible structures, the lowest energy one is chosen.

Dynamic Programming
The use of a dot plot can be effective in finding a single secondary structure in a small
molecule (see Fig. 16.4). However, if a large molecule contains multiple secondary
structure segments, choosing a combination that is energetically most stable among
a large number of possibilities can be a daunting task. To overcome the problem,
a quantitative approach such as dynamic programming can be used to assemble a
final structure with optimal base-paired regions. In this approach, an RNA sequence

Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16

236 RNA STRUCTURE PREDICTION

Figure 16.4: Example of a dot plot used for RNA secondary structure prediction. In this plot, an RNA
sequence is compared with itself. Dots are placed for matching complementary bases when a window
size of four nucleotide match is used. A main diagonal, which is perpendicular to the short diagonals, is
placed for self-matching. Based on the dot plot, the predicted secondary structure for this sequence is
shown on the right.

is compared with itself. A scoring scheme is applied to fill the matrix with match
scores based on Watson–Crick base complementarity. Often, G–U base pairing and
energy terms of the base pairing are also incorporated into the scoring process. A path
with the maximal score within a scoring matrix after taking into account the entire
sequence information represents the most probable secondary structure form.
The dynamic programming method produces one structure with a single best score.
However, this is potentially a drawback of this approach because in reality an RNA
may exist in multiple alternative forms with near minimum energy but not necessarily
the one with maximum base pairs.

Partition Function
The problem of dynamic programming to select one single structure can be comple-
mented by adding a probability distribution function, known as the partition function,
which calculates a mathematical distribution of probable base pairs in a thermody-
namic equilibrium. This function helps to select a number of suboptimal structures
within a certain energy range. The following lists two well-known programs using the
ab initio prediction method.
Mfold (www.bioinfo.rpi.edu/applications/mfold/) is a web-based program for RNA
secondary structure prediction. It combines dynamic programming and thermody-
namic calculations for identifying the most stable secondary structures with the lowest
energy. It also produces dot plots coupled with energy terms. This method is reliable
for short sequences, but becomes less accurate as the sequence length increases.
RNAfold (http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi) is one of the web pro-
grams in the Vienna package. Unlike Mfold, which only examines the energy terms of

Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16

COMPARATIVE APPROACH 237

Figure 16.5: Example of covariation of residues among three homologous RNA sequences to maintain
the stability of an existing secondary structure.

the optimal alignment in a dot plot, RNAfold extends the sequence alignment to the
vicinity of the optimal diagonals to calculate thermodynamic stability of alternative
structures. It further incorporates a partition function to select a number of statisti-
cally most probable structures. Based on both thermodynamic calculations and the
partition function, a number of alternative structures that may be suboptimal are
provided. The collection of the predicted structures may provide a better estimate
of plausible foldings of an RNA molecule than the predictions by Mfold. Because of
the much larger number of secondary structures to be computed, a more simplified
energy rule has to be used to increase computational speed. Thus, the prediction
results are not always guaranteed to be better than those predicted by Mfold.

COMPARATIVE APPROACH

The comparative approach uses multiple evolutionarily related RNA sequences to


infer a consensus structure. This approach is based on the assumption that RNA
sequences that deem to be homologous fold into the same secondary structure. By
comparing related RNA sequences, an evolutionarily conserved secondary structure
can be derived.
To distinguish the conserved secondary structure among multiple related RNA
sequences, a concept of “covariation” is used. It is known that RNA functional motifs
are structurally conserved. To maintain the secondary structures while the homol-
ogous sequences evolve, a mutation occurring in one position that is responsible
for base pairing should be compensated for by a mutation in the corresponding
base-pairing position so to maintain base pairing and the stability of the secondary
structure (Fig. 16.5). This is the concept of covariation. Any lack of covariation can
be deleterious to the RNA structure and functions. Based on this rule, algorithms
can be written to search for the covariation patterns after a set of homologous RNA
sequences are properly aligned. The detected correlated substitutions help to deter-
mine conserved base pairing in a secondary structure.
Another aspect of the comparative method is to select a common structure through
consensus drawing. Because predicting secondary structures for each individual
sequence may produce errors, by comparing all predicted structures of a group of
aligned RNA sequences and drawing a consensus, the commonly adopted structure
can be selected; many other possible structures can be eliminated in the process. The

Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16

238 RNA STRUCTURE PREDICTION

comparative-based algorithms can be further divided into two categories based on


the type of input data. One requires predefined alignment and the other does not.

Algorithms That Use Prealignment


This type of algorithm requires the user to provide a pairwise or multiple alignment as
input. The sequence alignment can be obtained using standard alignment programs
such as T-Coffee, PRRN, or Clustal (see Chapter 5). Based on the alignment input,
the prediction programs compute structurally consistent mutational patterns such
as covariation and derive a consensus structure common for all the sequences. In
practice, the consensus structure prediction is often combined with thermodynamic
calculations to improve accuracy.
This type of program is relatively successful for reasonably conserved sequences.
The requirement for using this type of program is an appropriate set of homologous
sequences that have to be similar enough to allow accurate alignment, but diver-
gent enough to allow covariations to be detected. If this condition is not met, correct
structures cannot be inferred. The method also depends on the quality of the input
alignment. If there are errors in the alignment, covariation signals will not be detected.
The selection of one single consensus structure is also a drawback because alterna-
tive and evolutionarily unconserved structures are not predicted. The following is an
example of this type of program based on predefined aligned sequences.
RNAalifold (http://rna.tbi.univie.ac.at/cgi-bin/alifold.cgi) is a program in the
Vienna package. It uses a multiple sequence alignment as input to analyze covari-
ation patterns on the sequences. A scoring matrix is created that combines minimum
free energy and covariation information. Dynamic programming is used to select the
structure that has the minimum energy for the whole set of aligned RNA sequences.

Algorithms That Do Not Use Prealignment


This type of algorithm simultaneously aligns multiple input sequences and infers a
consensus structure. The alignment is produced using dynamic programming with
a scoring scheme that incorporates sequence similarity as well as energy terms.
Because the full dynamic programming for multiple alignment is computationally
too demanding, currently available programs limit the input to two sequences.
Foldalign (http://foldalign.kvl.dk/server/index.html) is a web-based program for
RNA alignment and structure prediction. The user provides a pair of unaligned
sequences. The program uses a combination of Clustal and dynamic programming
with a scoring scheme that includes covariation information to construct the align-
ment. A commonly conserved structure for both sequences is subsequently derived
based on the alignment. To reduce computational complexity, the program ignores
multibranch loops and is only suitable for handling short RNA sequences.
Dynalign (http://rna.urmc.rochester.edu/) is a UNIX program with a free source
code for downloading. The user again provides two input sequences. The program
calculates the possible secondary structures of each using a method similar to Mfold.

Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16

SUMMARY 239

By comparing multiple alternative structures from each sequence, a lowest energy


structure common to both sequences is selected that serves as the basis for sequence
alignment. The unique feature of this program is that it does not require sequence
similarity and therefore can handle very divergent sequences. However, because of
the computation complexity, the program only predicts small RNA sequences such as
tRNA with reasonable accuracy.

PERFORMANCE EVALUATION

Rigorously evaluating the performance of RNA prediction programs has traditionally


been hindered by the dearth of three-dimensional structural information for RNA.
The availability of recently solved crystal structures of the entire ribosome provides
a wealth of structural details relating to diverse types of RNA molecules. The high-
resolution structural information can then be used as a benchmark for evaluating
state-of-the-art RNA structure prediction programs in all categories.
If prediction accuracy can be represented using a single parameter such as the cor-
relation coefficient, which takes into account both sensitivity and selectivity informa-
tion (see Chapter 8), the ab initio–based programs score roughly 20% to 60% depend-
ing on the length of the sequences. Generally speaking, the programs perform better
for shorter RNA sequences than for longer ones. For small RNA sequences, such as
tRNA, some programs may be able to produce 70% accuracy. The major limitation for
performance gains of this category appears to be dependence on energy parameters
alone, which may not be sufficient to distinguish different structural possibilities of
the same molecule.
Based on recent benchmark comparisons, the comparative-type algorithms can
reach an accuracy range of 20% to 80%. The results depend on whether a pro-
gram is prealignment dependent or not. Most of the superior performance comes
from prealignment-dependent programs such as RNAalifold. The prealignment-
independent programs fare much worse for predicting long sequences. For small
RNA sequences such as tRNA, both subtypes can achieve very high accuracy (up to
100%). This illustrates that the comparative approach is consistently more accurate
than the ab initio one.

SUMMARY

Detailed understanding of RNA structures is important for understanding the func-


tional role of RNA in the cell. The demand for structural information about RNA has
motivated the development of a large number of prediction algorithms. Current RNA
structure prediction is predominantly focused on secondary structures owing to the
difficulty in predicting tertiary structures. The secondary structure prediction meth-
ods can be classified as either ab initio or comparative. The ab initio method is based
on energetic calculations from a single query sequence. However, the accuracy of
the ab initio method is limited. The comparative approach, which requires multiple

Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16

240 RNA STRUCTURE PREDICTION

sequences, is able to achieve better accuracy. However, the obvious drawback of the
consensus approach is the requirement for a unique set of homologous sequences.
Neither type of the prediction methods currently considers pseudoknots in the RNA
structure because of the much greater computational complexity involved. To fur-
ther increase prediction performance, the research and development should focus
on alleviating some of the current drawbacks.

FURTHER READING

Doshi, K. J., Cannone, J. J., Cobaugh, C. W., and Gutell, R. R. 2004. Evaluation of the suitability
of free-energy minimization using nearest-neighbor energy parameters for RNA secondary
structure prediction. BMC Bioinformatics 5:105.
Doudna, J. A. 2000. Structural genomics of RNA. Nat. Struct. Biol. Suppl:954–6.
Gardner, P. P., and Giegerich, R. 2004. A comprehensive comparison of comparative RNA struc-
ture prediction approaches. BMC Bioinformatics 5:140.
Gorodkin, J. Stricklin, S. L., and Stormo, G. D. 2001. Discovering common stem-loop motifs in
unaligned RNA sequences. Nucleic Acids. Res. 10:2135–44.
Leontis, N. B., Stombaugh, J., and Westhof, E. 2002. Motif prediction in ribosomal RNAs lessons
and prospects for automated motif prediction in homologous RNA molecules. Biochimie
84:961–73.
Major, F., and Griffey, R. 2001. Computational methods for RNA structure determination. Curr.
Opin. Struct. Biol. 11:282–6.
Westhof, E., Auffinger, P., and Gaspin, C. 1997. “DNA and RNA structure prediction.: In: DNA
and Protein Sequence Analysis, edited by M. J. Bishop and C. J. Rawlings, 255–78. Oxford, UK:
IRL Press.

Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
Protein structure prediction

Protein structure prediction (PSP) is the prediction of the three-dimensional structure of a


protein from its amino acid sequence i.e. the prediction of its tertiary structure from its
primary structure. Protein structure prediction is one of the most important goals pursued
by bioinformatics and theoretical chemistry as it is very important in the field of
advanced biology and biotechnology.

Primary Structure
The primary structure of a protein is its linear sequence of amino acids and the location of
any disulfide (-S-S-) bridges.

Fig: Primary structure representation of protein

Amino acids are the building blocks (monomers) of proteins. 20 different amino acids are
used to synthesize proteins. The shape and other properties of each protein is dictated by
the precise sequence of amino acids in it.
Each amino acid consists of an alpha carbon atom to which is attached

 a hydrogen atom
 an amino group (hence "amino" acid)
 a carboxyl group (-COOH). This gives up a proton and is thus an acid (hence
amino "acid")
 one of 20 different "R" groups. It is the structure of the R group that determines
which of the 20 it is and its special properties. The amino acid shown here is
Alanine.

Secondary Structure
Most proteins contain one or more stretches of amino acids that take on a characteristic
structure in 3-D space. The most common of these are the alpha helix and the beta
conformation.

Alpha Helix

 The R groups of the amino acids all extend to the outside.


 The helix makes a complete turn every 3.6 amino acids.
 The helix is right-handed; it twists in a clockwise direction.

1
 The carbonyl group (-C=O) of each peptide bond extends parallel to the axis of
the helix and points directly at the -N-H group of the peptide bond 4 amino acids
below it in the helix. A hydrogen bond forms between them

Beta Conformation

 consists of pairs of chains lying side-by-side and


 stabilized by hydrogen bonds between the carbonyl oxygen atom on one chain
and the -NH group on the adjacent chain.
 The chains are often "anti-parallel"; the N-terminal to C-terminal direction of one
being the reverse of the other.

Tertiary Structure
The above diagram represents the tertiary structure of the antigen-binding portion of an
antibody molecule. Each circle represents an alpha carbon in one of the two polypeptide
chains that make up this protein. (The filled circles at the top are amino acids that bind to
the antigen.) Most of the secondary structure of this protein consists of beta conformation
labelled as beta sheet

Fig: Tertiary structure representation of protein

2
Why tertiary structure is important:

The function of a protein (except as food) depends on its tertiary structure. If this is
disrupted, the protein is said to be denatured and it loses its activity. For example:
denatured enzymes lose their catalytic power, denatured antibodies can no longer
bind antigen

A mutation in the gene encoding a protein is a frequent cause of altered tertiary


structure.

Protein Domains

The tertiary structure of many proteins is built from several domains. Often each domain
has a separate function to perform for the protein, such as:

 binding a small ligand


 spanning the plasma membrane (transmembrane proteins)
 containing the catalytic site (enzymes)
 DNA-binding (in transcription factors)
 providing a surface to bind specifically to another protein

Quaternary Structure

Complexes of 2 or more polypeptide chains held together by non-covalent forces


(usually) but in precise ratios and with a precise 3-D configuration.The noncovalent
association of a molecule of beta-2 microglobulin with the heavy chain of each class I
histocompatibility molecule is an example.

Fig:-A Quaternary representation of proteins

Protein Identification and Characterization


Many of the tools are protein identification and characterization are available at ExPASy
(http://www.expasy.org/) . Some of these tools can be identified as unknown protein
isolated through 2-D gel electrophoresis. Another set of tools can be help in predicting
physical properties of unknown proteins.

3
Some of the ExPASy and other tools are discussed as follows:

(a) AAComIdent: AAComIdent (http://us.expasy.org/tools/aacomp/) is an important tool


to identify a protein by its amino acid composition. It uses the amino acid composition of
an unknown protein to identify known proteins of the same composition.
As the input to AAComIdent, it needs to give the following information:
1. Amino acid composition of the protein to identify
2. A name for this protein, so that we can recognize it later in the results.
3. The pI and Mw of that protein (if known)
4. The species or group of species for which we would like to perform the search
(example: HOMO SAPIENS or MAMMALIA). This will produce the list of proteins
from this species, as well as a list of proteins independently of species. We may also
just specify ALL for all Swiss-Prot / TrEMBL entries; If in doubt about the search
term to use, we can consult the Swiss-Prot list of species.
5. For scan in Swiss-Prot only: the keyword for which we would like to perform the
search (example: ZINC-FINGER). This will produce the list of proteins matching this
keyword. We may also just specify ALL for all Swiss-Prot entries; If in doubt about
the exact keyword to use, consult the list of keywords used in Swiss-Prot.
6. Amino acid composition of a known protein, obtained in the same run as the amino
acid composition of the unknown protein. This is for calibration; if you do not have a
calibration protein, leave NULL.
7. The Swiss-Prot identifier (ID) of the calibration protein (example: ALBU_HUMAN).
8. The search results will be mailed back to the user automatically probably within 15
minutes.

(b) TagIdent (http://us.expasy.org/tools/tagident.html): It is a tool which generates

1. a list of proteins close to a given isoelectric point(pI) and molecular weight (Mw),
2. the identification of proteins by matching a short sequence tag of up to 6 amino
acids against proteins in the UniProt Knowledgebase (Swiss-Prot and TrEMBL)
databases close to a given pI and Mw,
3. the identification of proteins by their mass, if this mass has been determined by
mass spectrometric techniques

(c) PeptIdent (http://www.expasy.org/tools/aldente/) is used to identify proteins with


peptide mass fingerprinting data, pI and Mw. Experimentally measured, user-specified
peptide masses are compared with the theoretical peptides calculated for all proteins in
SWISS-PROT, making extensive use of database annotations.

(d) MultiIdent (http://us.expasy.org/tools/multiident/multiident-doc.html) is a tool that


allows the identification of proteins using pI, Mw, amino acid composition, sequence tag
and peptide mass fingerprinting data. One or more species and a SWISS-PROT keyword
can also be specified for the search.

(e) PROPSEARCH (http://abcis.cbs.cnrs.fr/propsearch/Presentation.html) is a tool to find


the putative protein family if querying a new sequence has failed using alignment

4
methods. PROPSEARCH uses the amino acid composition as input. In addition, other
properties like molecular weight, content of bulky residues, content of small residues,
average hydrophobicity, average charge and the content of selected dipeptide-groups are
calculated from the sequence as well.

(f) PepSea (http://vsites.unb.br/cbsp/paginiciais/pepseaseqtag.htm) is a tool for protein


identification by peptide mapping or peptide sequencing. We can search the non
redundant protein sequence database by: 1) A list of peptide masses,
2) A peptide sequence tag, 3)Sequence only

(g) PepMAPPER (http://www.nwsr.manchester.ac.uk/mapper/) takes peptide mass as the


key input.

(h) Mascot Search (http://www.matrixscience.com/search_form_select.html) can take the


following inputs:

5
 Peptide Mass Fingerprint: The experimental data are a list of peptide mass values
from an enzymatic digest of a protein.

 Sequence Query: One or more peptide mass values associated with information
such as partial or ambiguous sequence strings, amino acid composition
information, MS/MS fragment ion masses, etc. A super-set of a sequence tag
query.

 MS/MS Ion Search: Identification based on raw MS/MS data from one or more
peptides.

(i) FindPeptFindPept (http://www.expasy.ch/tools/findpept.html) is an ExPASy tool. It


can be used to identify peptides that result from unspecific cleavage of proteins from
their experimental masses, taking into account artefactual chemical modifications, post-
translational modifications (PTM) and protease autolytic cleavage. The experimentally
measured peptide masses are compared with the theoretical peptides calculated from a
specified Swiss-Prot entry or from a user-entered sequence.

Primary Structure Analysis and prediction


There are various tools for predicting the physical properties using the sequence
information. Some of the major ones are discussed below:

6
(a) Compute pI/Mw: Compute pI/Mw (http://ca.expasy.org/tools/pi_tool.html) is a tool
that calcultaes the isoelectric point and molecular weight of an input sequence. The
sequence can be input in the FASTA format, the output is the pI and molecular weight for
the entire length of the sequence.

(b) PeptideMass (http://ca.expasy.org/tools/peptide-mass.html): It cleaves a protein


sequence from the UniProt Knowledgebase (Swiss-Prot and TrEMBL) or a user-entered
protein sequence with a chosen enzyme, and computes the masses of the generated
peptides. The tool also returns theoretical isoelectric point and mass values for the
protein of interest. If desired, PeptideMass can return the mass of peptides known to carry
post-translational modifications, and can highlight peptides whose masses may be
affected by database conflicts, polymorphisms or splice variants.

© SAPS- Statistical Analysis of Protein Sequences: evaluates by statistical criteria a wide


variety of protein sequence properties. Properties considered include compositional biases,
clusters and runs of charge and other amino acid types, different kinds and extents of repetitive
structures, locally periodic motifs, and anomalous spacing between identical residue types. The
statistics are computed for any single (or appropriately concatenated) protein sequence input.
(http://www.ebi.ac.uk/Tools/saps/)

7
(d) ProtParam (http://www.expasy.ch/tools/protparam.html) is a tool which allows the
computation of various physical and chemical parameters for a given protein stored in
Swiss-Prot or TrEMBL or for a user entered sequence. The computed parameters include
the molecular weight, theoretical pI, amino acid composition, atomic composition,
extinction coefficient, estimated half-life, instability index, aliphatic index and grand
average of hydropathicity.

Secondary Structure Analysis and prediction

There are several protein secondary structure prediction methods and the most important
of these methods are:

(a) Chou-Fasman method: The Chou-Fasman method was among the first secondary
structure prediction algorithms developed and relies predominantly on probability
parameters determined from relative frequencies of each amino acid's appearance in
each type of secondary structure. The original Chou-Fasman parameters, determined
from the small sample of structures solved in the mid-1970s, produce poor results
compared to modern methods, though the parameterization has been updated since it
was first published. The Chou-Fasman method is roughly 50-60% accurate in
predicting secondary structures.

(b) GOR method: The GOR method, named for the three scientists who developed it -
Garnier, Osguthorpe, and Robson - is an information theory-based method developed
not long after Chou-Fasman that uses more powerful probabilistic techniques of
Bayesian inference. The GOR method takes into account not only the probability of
each amino acid having a particular secondary structure, but also the conditional
probability of the amino acid assuming each structure given that its neighbors assume
the same structure. This method is both more sensitive and more accurate because
amino acid structural propensities are only strong for a small number of amino acids

8
such as proline and glycine. The original GOR method is roughly 65% accurate and is
dramatically more successful in predicting alpha helices than beta sheets, which it
frequently mispredicts as loops or disorganized regions.

© GOR IV (http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_gor4.html)
uses all possible pair frequencies within a window of 17 amino acid residues. One
output gives the sequence and the predicted secondary structure in rows. H=helix,
E=extended or beta strand and C=coil. The other output gives the probability values for
each secondary structure at each amino acid composition.

(d) Hidden Markov Methods (HMMs): HMMs method is used to predict the secondary
structure of a protein of a given structural class (e.g. +) as used in the structural
classification databases. Each HMM is trained with the sequences of the proteins in that
structural class. The models are used with a query sequence to predict both the class and
the secondary structure of the protein.

(i) Pfam (http://www.sanger.ac.uk/Software/Pfam/search.shtml) uses the HMM approach.

(f) Neural Networks: Most of the effective structure prediction models extract patterns
from databases of known protein structures. Neural networks comprise a particular tool
for protein recognition and classification.

9
(i) HNN (http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_nn.html) is a
Hierarchial Neural Network based program that gives a secondary structure prediction.

(ii) nnPredict (http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html) predicts the


secondary structure type for each residue in an amino acid sequence. The basis of the
prediction is a two layer, feed forward neural network. The predicted type will be either:
‘H- a helix element, ‘E’-a beta strand element or ‘-‘, a turn element. nnPredict uses the
tertiary class of a protein for prediction.

(iii) PSA (http://bmerc-www.bu.edu/psa/request.htm): It is also a secondary structure


prediction tool. It has 3 options for analysis:(i) Monomeric-Soluble Type-I analysis, (ii)
Minimal Type-2 analysis, and WD-repeat analysis

10
(iv) PSIPRED (http://bioinf.cs.ucl.ac.uk/psipred/) PSIPRED Protein Structure Prediction
Server aggregates several of our structure prediction methods into one location. Users can
submit a protein sequence, perform the prediction of their choice and receive the results
of the prediction via e-mail. It is a highly accurate method for protein secondary structure
prediction

MEMSAT and MEMSAT-SVM – Both are widely used transmembrane topology


prediction method
GenTHREADER, pGenTHREADER and pDomTHREADER –These all are sequence
profile based fold recognition methods.

Multiple Alignment Based Self-Optimization Method

11
(a) SOPMA: It is a secondary structure prediction program (Self-Optimized Prediction
Method) that uses multiple alignments. SOPMA correctly predicts 69.5% of amino acids
for a three-state description of the secondary structure (alpha-helix, beta-sheets and coil)
in a whole database containing 126 chains of non-homologous proteins. The server is
available at
(http://npsapbil.ibcp.fr/cgibin/npsa_ automat.pl?page=/NPSA/npsa_sopma.html).

Joint prediction with SOPMA and PHD correctly predicts 82.2% of residues for 74% of
co-predicted amino acids.

Tertiary Structure Analysis and prediction


There are three methods for the tertiary structure prediction.
(a) ab-initio approach
(b) Fold Recoginition
(c) Homology Modelling

(a) Ab-initio approach: Ab initio- or de novo- It is a protein modelling methods seek to


build three-dimensional protein models "from scratch", i.e., based on physical principles
rather than (directly) on previously solved structures. The goal of the Ab-initio prediction
is to build a model for a given sequence without using a template. Ab-initio prediction
relies upon thermodynamic hypothesis of protein folding (Alfinsen hypothesis). The Ab-
initio prediction methods are based on the premise that are native structure of a protein
sequence corresponds to its global free energy minimum state.
The method for ab-initio prediction are of the following:
(i) Molecular Dynamics (MD) Simulations: MD Simulations are of proteins and protein-
substrate complexes. MD methods provides a detailed and dynamic picture of the nature
of inter-atomic interactions with regards to protein structure and function.
(ii) Monte Carlo (MC) Simulations:-These are methods that do not use forces but rather
compare energies via the use of Boltzmann probabilities.

(iii) Genetic Algorithms (GA) Simulations:-GA methods try to improve on the sampling

12
and the convergence of MC approaches.

(iv) Lattice Models:-Lattice methods are based on using a crude/approximate fold


representation (such as two residues per lattice) and then exploring all or large amounts
of conformational space, given the crude representation.

Insilico tools for abinitio protein structure prediction


(1) QUARK and I-TASSER: Both the servers are developed at Zhang Lab of University of
Michigan.

I-TASSER is an internet service for protein structure and function predictions. 3D models
are built based on multiple-threading alignments by LOMETS and iterative TASSER
simulations; function inslights are then derived by matching the predicted models with
protein function databases.

QUARK is the computer algorithm for ab initio protein folding and protein structure
prediction. It aims to construct the protein 3D structures from scratch using replica-
exchange Monte Carlo simulations under the guide of a knowledge-based atomic force
field. QUARK movements include free atom relocation as well as rigid-body replacement
of small fragments (1 to 20 residues long) excised from solved experimental structures.
The server is therefore suitable for proteins which are considered without homologous
templates. Protein sequences having length more than 200 residues are not preferable in
QUARK so I-TASSER can be used.

Fig-14: QUARK ab-initio Protein Structure Prediction Tool

13
Fig-15: I-TASSER ab-initio Protein Structure Prediction Tool

(2) BHAGEERATH: It is an energy based computer software suite developed at


Supercomputing Facility for Bioinformatics & Computational Biology, IIT Delhi for
narrowing down the search space of tertiary structures of small globular proteins. The
protocol comprises eight different computational modules that form an automated
pipeline. It combines physics based potentials with biophysical filters to arrive at 5
plausible candidate structures starting from sequence and secondary structure information.
The methodology has been validated here on 50 small globular proteins consisting of 2–3
helices and strands with known tertiary structures. For each of these proteins, a structure
within 3-6 Å. RMSD (root mean square deviation) of the native has been obtained in the
10 lowest energy structures.

Fig-15: BHAGEERATH: an energy based computer software suite for ab-initio Protein Structure
Prediction Tool

14
(3) DTMM- Desktop Molecular Modeller: It is a simple-to-use molecular modelling
program that enables you to perform powerful molecular synthesis, editing, energy
minimizations, and display. The package, substantially enhanced from previous versions
of DTMM, will run on any PC with Windows 95, 98, Me, 2000, NT, or XP. The
webserver is available at the website (http://www.polyhedron.co.uk/MFSQMC33264)

(b) Fold Recognition: Fold recognition and threading methods can be used to assign
tertiary structures to protein sequences, even in the absence of clear homology. The
ongoing development of such methods has had a significant impact on structural biology,
providing us with an increasing ability to accurately model 3D protein structures using
very evolutionary distant fold templates. Although fold recognition and threading
techniques will not yield equivalent results as those from X-ray crystallography, they are
comparatively fast and inexpensive way to build a close approximation of a structure
from a sequence, without the time and costs of experimental procedures. Using fold
recognition proteins with known structures that share common folds with the target
sequences can be identified. The identified structures can be used as templates from
which the folds of the target sequences are modeled.
(1) PHYRE- Protein Homology/analogY Recognition Engine. The webserver is available
at (http://www.sbg.bio.ic.ac.uk/~phyre/)

15
(2) 3DPSSM- A Fast, web-based method for protein Fold recognition using 1D and 3D
sequence Profiles coupled with secondary structure and salvation potential information.
(http://www.sbg.bio.ic.ac.uk/~3dpssm/index2.html)

(c) Homology Modelling: It is based on the reasonable assumption that two homologous
proteins will share very similar structures. Because a protein's fold is more evolutionarily
conserved than its amino acid sequence, a target sequence can be modeled with
reasonable accuracy on a very distantly related template, provided that the relationship
between target and template can be discerned through sequence alignment. It has been
suggested that the primary bottleneck in comparative modelling arises from difficulties in
alignment rather than from errors in structure prediction given a known-good
alignment.Unsurprisingly, homology modelling is most accurate when the target and
template have similar sequences.

(1) 3D Structure Prediction of Target Protein by MODELLER

MODELLER is used for homology or comparative modeling of protein three-


dimensional structures. The user provides an alignment of a sequence to be modeled with
known related structures and MODELLER automatically calculates a model containing
all non-hydrogen atoms. MODELLER implements comparative protein structure
modeling by satisfaction of spatial restraints and can perform many additional tasks,
including de novo modeling of loops in protein structures, optimization of various models
of protein structure with respect to a flexibly defined objective function, multiple
alignment of protein sequences and/or structures, clustering, searching of sequence
databases, comparison of protein structures, etc. MODELLER is available for download
for most Unix/Linux systems, Windows, and Mac.
(http://www.salilab.org/modeller/download_installation.html)

Method for comparative protein structure modeling by Modeller

Modeller implements an automated approach to comparative protein structure modeling


by satisfaction of spatial restraints [Sali & Blundell, 1993]. Briefly, the core modeling

16
procedure begins with an alignment of the sequence to be modeled (target) with related
known 3D structures (templates). This alignment is usually the input to the program. The
output is a 3D model for the target sequence containing all main chain and side chain
non-hydrogen atoms. Given an alignment, the model is obtained without any user
intervention. First, many distance and dihedral angle restraints on the target sequence are
calculated from its alignment with template 3D structures. The form of these restraints
was obtained from a statistical analysis of the relationships between many pairs of
homologous structures. This analysis relied on a database of 105 family alignments that
included 416 proteins with known 3D structure [ˇSali & Overington, 1994]. By scanning
the database, tables quantifying various correlations were obtained, such as the
correlations between two equivalents C_ – C_ distances, or between equivalent main
chain dihedral angles from two related proteins. These relationships were expressed as
conditional probability density functions (pdf’s) and can be used directly as spatial
restraints.
For example, probabilities for different values of the main chain dihedral angles are
calculated from the type of a residue considered, from main chain conformation of an
equivalent residue, and from sequence similarity between the two proteins. Finally, the
model is obtained. The optimization is carried out by the use of the variable target
function method [Braun & Go, 1985] employing methods of conjugate gradients and
molecular dynamics with simulated annealing (Figure 1.3). Several slightly different
models can be calculated by varying the initial structure. The variability among these
models can be used to estimate the errors in the corresponding regions of the fold.

17
First, the known, template 3D structures are aligned with the target sequence to be
modelled
Second, spatial features, such as - distances, hydrogen bonds, and main chain and
side chain dihedral angles, are transferred from the templates to the target. Thus, a
number of spatial restraints on its structure are obtained.
Third, the 3D model is obtained by satisfying all the restraints as well as possible.

Steps to perform task:


Modeller requires three different input files. Error free preparation of the following
3 files costs 3 to 4 hour preparation for execution of your prepared Protein Model.
The 3 files are:-
1. Alignment file (*.ali)
2. Atom file (*.atm)
3. Script file (*.py)

1. Preparing alignment file: Following are the steps to prepare an alignment file:

(a) Take protein sequence of your interest and search it in NCBI and save its fasta
sequence as in following example

Target
>Q865F8|AHSP_BOVIN Alpha-hemoglobin-stabilizing protein - Bos taurus
MALIQTNKDLISKGIKEFNILLNQQVFSDPAISEEAMVTVVNDWVSFYINY YK
KQLSGEQDEQDKALQEFRQELNTLSASFLDKYRNFLKSS

(b) Copy your target protein sequence and paste in NCBI blast-p page by choosing
PDB database in BLASTP option.

18
(c) Get the structural sequence based on degree of similarity and note down pdb id
basing on 3 parameters identity more than 40 %,x-ray crystallographic structure,
resolution<3Angstrom,R-value<0.5,for this case the template is 1y01

(d) Copy the sequence which shows highest similarity with your target protein
sequence and paste it in same note pad in which your target sequence was pasted
in first step. The note pad should be like below.

>1y01
LLKANKDLISAGLKEFSVLLNQQVFNDALVSEEDMVTVVEDWMNFYINY Y
RQ QVTGEPQERDKALQELRQELNTLANPFLAKYRDFLKS

>target
MALIQTNKDLISKGIKEFNILLNQQVFSDPAISEEAMVTVVNDWVSFYINY YK
KQLSGEQDEQDKALQEFRQELNTLSASFLDKYRNFLKSS

(e) Save the file where you wish to store then go to Clustalx

(f) Open ClustalX software choose upload sequence and load your file which you have
saved at step 5 on note pad.

(g) After uploading click alignment button then click output format options where a
Sub-window displayed on your monitor and adjust the parameters.
Select -parameter out – ON
Select-PIR format- ON
Remove Clustal W format.

Then close the window.

(h) Again click Alignment button and click “do complete alignment button”.

(i) Now the output files will be created in your folder


(from there you have been uploaded your sequence file to Clustalx).

(j) Open the out put file, which has * .pir extension. Open that file through word pad

and save as filename with ali extension (*.ali). (* Indicates file name). File should
be saved within double quotes (“*.ali”)

(k) Then do modifications in *.ali file by observing standard file, which is given below.
.Ali File:
>P1;1y01
structureX:1y01:3 :A:91 : :ALPHA-HEMOGLOBIN STABILIZING PROTEIN:HOMO SAPIENS:2.80:
0.273
--LLKANKDLISAGLKEFSVLLNQQVFNDALVSEEDMVTVVEDWMNFYINYYRQQVTGEP
QERDKALQELRQELNTLANPFLAKYRDFLKS-
*

19
>P1;target
sequence:target:1 : :102 : :Alpha-hemoglobin-stabilizing protein:Bos taurus: 2.80:0.273
MALIQTNKDLISKGIKEFNILLNQQVFSDPAISEEAMVTVVNDWVSFYINYYKKQLSGEQ
DEQDKALQEFRQELNTLSASFLDKYRNFLKSS
*

2. Preparing Atom file:

(a) Take the pdb id, which has got match with your target protein, submit to pdb
database (Access at www.rcsb.org).
(b) Structural data will be opened and choose explore button then download and
display option and choose pdb format in download options for downloading.
(d) save the *.pdb file as *.atm.

3. Preparing Script files:

(a) Write the standard format of script file, which is given below in notepad.

(b) Then fill the fields required by the modeler.

© Then save the file by giving the file name with the extension of *.py.

(d) Copy all three (*.ali, *.py & *.atm) and paste it in C:\mod9v7\bin.

(e) Then open DOS program go to specified path to be given here.

(C:\Program Files\Modeller9v7\bin)

(f) Type command “mod9v7 *.py (* indicates filename).

(g) Out put file will be given created automatically by Modeller in bin directory
approximately 3 min of analysis
(h) Copy that output file and paste it in your folder then save the same file as *.pdb
.Py File (script file):
# Homology modeling by the automodel class
from modeller import * # Load standard Modeller classes
from modeller.automodel import * # Load the automodel class
log.verbose() # request verbose output
env = environ() # create a new MODELLER environment to build this
model in
# directories for input atom files
env.io.atom_files_directory = './:../atom_files'
a = automodel(env,
alnfile = 'align.ali', # alignment filename
knowns = '1y01', # codes of the templates
sequence = 'target') # code of the target
a.starting_model= 1 # index of the first model
a.ending_model = 1 # index of the last model

20
# (determines how many models to
calculate)
a.make() # do the actual homology modeling

(i) Visualize the 3D model created by you in Rasmol , Swiss PDB viewer

References and suggested readings:

Rastogi, S.C., Mendiratta N., and Rastogi, P. (2009). Bioinformatics Methods and
Applications Genomics, Proteomics and Drug Discovery. PHI Learning Private
Limited. India.

A. Fiser, R.K.G. Do and A. Sali, Prot Sci, (2000) 9, 1753-1773


A. Fiser, and A. Sali, Bioinformatics, (2003) 18, 2500-01
http://zhanglab.ccmb.med. umich.edu/QUARK
http://modbase.compbio.ucsf.edu/modloop/

21
284 Bioinformatics

16.8. PROTEIN FUNCTION PREDICTION


Protein sequence determines protein structure determines protein function, Wa e.
rst try to
Then what learned, both on the way to predicti
structure to pre
protein structure. use we
diction,
predict
predicted structure itself to predict function. Predicting protein function from sed and from the
additional problems in comparison to the unsolved task of structure prediction:
1. Function is not entirely determined by sequence, the enVironment is crucially imno
2. Protein function' is rather intuitive but ill defined term. Function is a comnle
a
associated with many mutually overlapping levels: chemical, biochemical, ceilularn Cnon
cal, organism mediated, and developmental. cellular, physiolog
These levels arerelated in complex ways, e.g., protein kinases can be related to
different eal.
functions (such as cell cycle), and to a chemical function (transferase) plus a
complex cellular
nism by interaction with other control mas
mecha
proteins.
Labeling for most protein sequences has been done in, e.g., the yeast genome.
sible to label only proteins for which we find similarities in other However, it is POs. no

function. Also, even for those we label, our


proteins of experimentally know
knowledge is mostly quite restricted to rough detailtOWn
Predictions of active sites, binding sites, etc. are usually not successful.
understanding the detailed mechanism. Often we need knowledge about Labeling
also does not imolv
late functions. protein structure to manipu

Summary of different approaches in protein structure prediction


Single residue
Molecular docking

Molecular docking is a key tool in structural molecular biology and computer-assisted drug
design. The goal of ligand-protein docking is to predict the predominant binding mode(s)
of a ligand with a protein of known three-dimensional structure. Docking is a technique
which predicts the ideal orientation of ligand in the active site of the receptor when bound to
each other to form a stable complex.

From hit discovery through lead optimization and beyond, computational methods have
become an essential part of many drugs development processes. There are typically several
steps in the docking process, and each one provides a new level of complexity. Docking
methods are used to place small molecules in the active region of the enzyme. In addition to
these methods, scoring functions are used to estimate a compound's biological activity by
looking at how it interacts with prospective targets. Molecular docking is considered to be the
most widely utilized computational phenomenon in the field of computer-aided drug design
(CADD). It is being utilized at the academic level as well as in pharmaceutical companies for
the lead discovery process. Molecular docking is mainly associated with two terms: ligand and
protein. Protein is the target site where ligand may bind to give specific activity. Molecular
docking provides information on the ability of the ligand to bind with protein which is known
as binding affinity. Applications of molecular docking in drug development have evolved
significantly since it was first created to aid in the study of molecular recognition processes
between small and large compounds. This review emphasizes the basic features of molecular
docking along with the types, approaches and applications.

Docking is widely used to anticipate the alignment of small molecule therapeutic compounds
concerning their protein targets in anticipating the small molecule's affinity and
activity.Docking plays a critical role in rational drug design. Considering the biological and
pharmacological importance of docking studies, much effort has been made to improve the
algorithms for docking prediction. Docking is a mathematical technique that anticipates the
preferable orientation of one molecule relative to another when they are linked together to
create a stable complex. Using scoring functions, it is possible to estimate the strength of the
connection or binding affinity across two compounds based on their preferential orientation.
Signal transduction is dependent on the interactions of physiologically significant substances
such as proteins, nucleic acids, carbohydrates, and lipids. As a result, docking may be used to
forecast both the intensity and type of signals generated. Docking is widely used to anticipate
the alignment of drug candidates relative to specific target molecules to manage the small
molecule's affinity and activity. As a result, docking is critical in the structural characterization
of medications. The goal of docking studies is to optimize the shape of both the ligand and
protein, as well as the relative orientation of the protein and ligand, to reduce the total system's
free energy.

There are two distinct forms of docking.


1. Rigid docking
2. Flexible docking

Rigid docking
Assuming the compounds are inflexible, we are seeking a rearrangement of one of the
compounds in three-dimensional space that results in the best match to the other compounds in
parameters of a scoring system. The ligand's conformation can be formed with or without
receptor binding activity.

Flexible docking
In conjunction with transformation, we evaluate molecular flexibility to identify confirmations
for the receptor and ligand molecules as they exist in the complex.

Models of molecular docking


The lock and key theory
Emil Fischer created a concept termed the "lock-and-key model" in 1890, as seen in figure 4,
to describe how biological processes operate. A substrate is inserted into the active site of a
macromolecule in the same way as a key is inserted into a lock.
The induced-fit theory
Daniel Koshland proposed the "induced fit theory" in 1958. The fundamental concept is that
throughout the character recognition, both the ligand and target, as seen in figure 5, adapt to
one another by modest conformational changes until an ideal match is reached.

The conformation ensemble model


Apart from minor induced-fit modifications, proteins have been discovered to undergo
significantly greater conformational changes. According to a new concept, proteins are
composed of a pre-existing ensemble of conformational states. The protein's flexibility enables
it to transition between states.

Knowledge of the preferred orientation in turn may be used to predict the strength of
association or binding affinity between two molecules using, for example, scoring functions.
The associations between biologically relevant molecules such as proteins, peptides, nucleic
acids, carbohydrates, and lipids play a central role in signal transduction. Docking is useful for
predicting both the strength and type of signal produced. Molecular docking is one of the most
frequently used methods in structure-based drug design, due to its ability to predict the binding-
conformation of small molecule ligands to the appropriate target binding site. Characterisation
of the binding behaviour plays an important role in rational design of drugs as well as to
elucidate fundamental biochemical processes

Applications

A binding interaction between a small molecule ligand and an enzyme protein may result in
activation or inhibition of the enzyme. If the protein is a receptor, ligand binding may result in
agonism or antagonism. Docking is most commonly used in the field of drug design — most
drugs are small organic molecules, and docking may be applied to:

hit identification – docking combined with a scoring function can be used to quickly screen
large databases of potential drugs in silico to identify molecules that are likely to bind to protein
target of interest. Reverse pharmacology routinely uses docking for target identification.

lead optimization – docking can be used to predict in where and in which relative orientation a
ligand binds to a protein (also referred to as the binding mode or pose). This information may
in turn be used to design more potent and selective analogs.

Bioremediation – Protein ligand docking can also be used to predict pollutants that can be
degraded by enzymes
Software available for docking

Gold

Genetic Enhancement and Receptor Docking make use of numerous ligand subgroups. Three
terms comprise the force-field-based scoring function: The phrase "H-bonding" refers to the
potential for intermolecular dispersion.8 The word "intramolecular potential" refers to the
potential for intramolecular dispersion. 71% success rate in determining the experimental
binding mode for 100 protein complexes.

Autodock

Consists of a three-dimensional lattice of regularly spaced points encircling and cantered about
the macromolecule's region of interest.

Flex-X

Using the "position clustering" technique, the base fragment is picked up and docked. A
clustering approach is used to combine related ligand changes into active site modifications.8
Flexible fragments are sequentially added using MIMUMBA and assessed using the overlap
function, followed by energy calculations to finish the ligand construction.8 Final assessment
using Böhm's scoring system, which incorporates hydrogen bonds, ionic, aromatic, and
lipophilic terms.8

You might also like