Professional Documents
Culture Documents
Unit V
Unit V
RNA structure is often expressed schematically by its base pairing: the Watson-Crick (WC) base
pairs A (Adenine) with U (Uracil), and G (Guanine) with C (Cytosine) and also the
non-Watson-Crick (non-WC) base pair G with U. RNA sequences are typically written from left to
right. The beginning of the sequence is usually on the left hand side and is called the 5’ end, and the
opposite end is called the 3’ end.
A number of simple structural motifs are generated by the way that the RNA molecule forms
these base pairs. In secondary structure of RNA, the typical structural motifs are usually classified as
stem, internal loop (or interior loop), and multibranch loop. This standard neglects a large group of
possible structures called pseudoknots (PK).
Stem:
When more than one base pair appears in the form of a group of contiguous base pairs, the resulting
structure motif is described as a stem (Fig. S1a). For RNA, this stem motif appears as a flat object.
However, the actual structure in three dimensions (3D) has a twist that makes a 360o rotation roughly
every 10 bps (for A-RNA: the most commonly found structure for RNA).
a b
5' 3'
CUAGU
3' G G U C A 5'
structure + sequence
structure format
format
Supplement Figure S1. An example of a stem motif represented as secondary structure.
(a) A stem including both the secondary structure and the sequence labels. (b) A stem
that only includes the base pairing information.
Loops:
The other simple structural motifs, based on the types of loops, can also be found in RNA structures.
a b
C C
C C
A U
A U
A U
A U
A U
5' 3' 5' 3'
c d
5' (((((....))))) 3'
5' 3'
Supplement Figure S2. An example of a hairpin loop (H-loop). (a) Secondary structure
with base index included. (b) Secondary structure with only the base pairing and base
position included. (c) A simplified representation of this secondary structure using
bracket notation. (d) A secondary structure that only specifies the stem base pairs
without specifying the exact size of the loop regions.
c g
5' ((((.(((((....))))).)))) 3' 5' ((((....(((((....))))))))) 3'
n1 n
5'
d h
3'
5'
n2
3'
Supplement Figure S3. Examples of secondary structure for internal loops and bulges
(I-loops). (a) Secondary structure of internal loops with base index included. (b)
Secondary structure with only the base pairing and base position included. (c) A
simplified representation of this secondary structure using bracket notation. (d) A
secondary structure that only specifies the stem locations and is not specific about the
loop regions. (e-h) The same definitions apply for the example of a bulge.
(1) if i and j are contained within i ' and j ' , then i ' < i < j < j '
(2) if i and j are not contained within i ' and j ' , then either i and j
are less than i ' , or i and j are greater than j ' ,
(3) if neither case is true, then i = i ' and j = j ' .
a b
C CC
C C A A
C C
UU
U C C AA UU U
CAAA G C U
AC GU
U C
GC
CG
GC
GC
C G
5' 3' 5' 3'
c
5' 3'
.((((.((((.....))))..(((((.....))))).)))).
Supplement Figure S4. Secondary structure of multibranch loops (M-loop or MBL). (a)
Secondary structure of M-loop with base index included. (b) Secondary structure with
only the base pairing and base position included. (c) A simplified representation of this
secondary structure using bracket notation.
(4) For pseudoknots: parts of the structure still satisfy cases 1 through 3.
However, in addition, in at least one part of a region between k and l
such that k ≤ i, j , i ', j ' ≤ l , there exists some base pairs that satisfy
either i ' < i < j ' < j or i < i ' < j < j ' .
Hence, many more possibilities can be generated once we start allowing pseudoknots.
Pseudoknots differ from real knots in the sense that the strand does not pass completely through
the loop but only becomes potentially entangled with it. Anyone who has tried to untangle a pair of
earphones or tried to untangle a recently, neatly wound up cord can realize that knots naturally occur
on long flexible cords. Indeed, great effort seems to be required to avoid tangling a heap of cords.
Figure S5a shows a common pseudoknot known as an H-type pseudoknot. This is also known
as an ABAB pseudoknot. The structure is notated below with the standard parenthesis notation for
the basic secondary structure (here shown in blue) and square brackets for the pseudoknot linkage
(here shown in red). Green indicates the regions of free strand that are not forming base pairs (bps).
The color distinction of the stems in this example is not important because both stems are the same
length and both stems are less than 10 bps.
a 5' c 5'
b 3' d 3'
e g 5'
3'
f 5' h 3'
5'...(((((......[[[[[.)))))....(((((.]]]]]......)))))... 3' 5' ...((((((((((((((................[[[[[.))))))))))))))
.................]]]]]... 3'
Supplement Figure S5. Examples of pseudoknots and knots. (a) An H-type pseudoknot
where the linkage stem is denoted in red, standard secondary structure in blue and free
strand (regions of no base pairing) in green. (b) The same structure in (a) denoted in
bracket notation. (c) An example of a knot where the stem length of both the secondary
structure and the linage stem are longer than 9 base pairs. (d) The same structure in
bracket notation. (e) An extended pseudoknot, sometimes referred to as a kissing-loop
and also known as an ABACBC pseudoknot. (f) The structure shown in bracket
notation. (g) A pseudoknot. The difference between (c) and (g) is that the linkage stem
is shorter than 9 contiguous bps. (h) The same structure in bracket notation.
Figure S5c shows a knot and the corresponding bracket structure is shown below in Fig. S5d.
Both stems in Fig. S5c contain 14 base pairs (bp). Since the helical axis makes a rotation of 360o
every 10 bps, this means that the structure in Fig. S5c is tangled in a knot. There is no reason why
such a structure cannot form. Indeed, knots are known to form in some rare proteins [2,3]. However,
it is not a pseudoknot and the current approach is not designed to estimate its existence or the
likelihood of its formation. One important feature of a pseudoknot is, therefore, that the linkage stem
(here shown in red) must be shorter than 10 contiguous bps.
Figure S5e shows two stem-hairpins-loops (blue and purple) that are joined by a linkage stem
(red). The stems are all short as in Fig. S5a. This is a pseudoknot, sometimes referred to as a kissing
loop. It is also known as an ABACBC pseudoknot. It is also observed in a number of places,
although less frequently than H-type pseudoknots. The corresponding bracket notation for this
structure is shown below (Fig. S5f).
Figure S5g shows a structure intermediate between Figs. S5a and c. This structure is also an
H-type pseudoknot. Here we see a necessary condition for defining a segment as a linkage stem
and the most important distinction between a knot and a pseudoknot: a linkage stem cannot contain
more than 9 contiguous bps. When an internal loop breaks the continuity, this rule may not
necessarily apply. Therefore, longer overall structure could form, but only if the stem is not
contiguous. Moreover, this would have to be considered on a case by case basis.
As a final curiosity, in Fig. S6, a secondary structure (left hand side) and its corresponding knot
(right hand side) are shown in equilibrium. The knot renders a peculiar 2D illusion perhaps
reminiscent of artist M.C. Escher’s “Belvedere” (1958). The knotted structure could conceivably
exist in chemical equilibrium with a standard secondary structure, though electrostatic effects may
render it less favorable thermodynamically. It could result from a simple misfolding.
Supplement Figure S6. A special type of knot (right hand side) that becomes entangled
due to the way the structure folds up. The two dimensional nature of the RNA
schematics tends to hide curious possibilities as this. The knot is seen in equilibrium
between the unknotted structure (left hand side) and the “Belvedere knot” (right hand
side).
Functional RNA structures that contain knots of the form shown in Figs. S5c or S6 are currently
unknown or have not been reported. However, pseudoknots are often observed in functional RNA
structures, particularly H-type pseudoknots (Fig. S5a). Single strand RNA sequences such as
messenger RNA with introns can have sequences with lengths that number in the tens of thousands
of nucleotides. With such a propensity for a few simple cords to become tangled, and, since the cell
can have many thousands of protein and RNA strands present within the cellular environment, this
suggests that there is a fair amount of effort made within the cell to prevent or get rid of knots [2].
In this work, we are concerned with the prediction of pseudoknots. These structures have the
property that they can be evaluated as structures resulting from reversible folding.
References
1. Hofacker IL, Fontana W, Stadler PF, Bonhoeffer S, Tacker M, et al. (1994) Fast folding and
comparison of RNA secondary structures. Monatshefte f Chemie 125: 167-188.
2. Lua RC, Grosberg Y (2006) Statistics of knots, geometry of conformations, and evolution of
proteins. PLoS Comp Biol 2: e45.
3. Virnau P, Mirny LA, and Kardar M (2006) Intricate knots in proteins: function and evolution.
PLoS Comp Biol 2: e122.
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16
CHAPTER SIXTEEN
INTRODUCTION
It is known that RNA is a carrier of genetic information and exists in three main forms.
They are messenger RNA (mRNA), ribosomal RNA (rRNA), and transfer RNA (tRNA).
Their main roles are as follows: mRNA is responsible for directing protein synthesis;
rRNA provides structural scaffolding within ribosomes; and tRNA serves as a carrier
of amino acids for polypeptide synthesis.
Recent advances in biochemistry and molecular biology have allowed the discovery
of new functions of RNA molecules. For example, RNA has been shown to possess
catalytic activity and is important for RNA splicing, processing, and editing. A class of
small, noncoding RNA molecules, termed microRNA or miRNA, have recently been
identified to regulate gene expression through interaction with mRNA molecules.
Unlike DNA, which is mainly double stranded, RNA is single stranded, although an
RNA molecule can self-hybridize at certain regions to form partial double-stranded
structures. Generally, mRNA is more or less linear and nonstructured, whereas rRNA
and tRNA can only function by forming particular secondary and tertiary structures.
Therefore, knowledge of the structures of these molecules is particularly impor-
tant for understanding their functions. Difficulties in experimental determination
of RNA structures make theoretical prediction a very desirable approach. In fact,
computational-based analysis is a main tool in RNA-based drug design in pharma-
ceutical industry. In addition, knowledge of the secondary structures of rRNA is key
for RNA-based phylogenetic analysis.
231
Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16
Figure 16.1: The primary, secondary, and tertiary structures of a tRNA molecule illustrating the three levels of RNA structural organization.
232
Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16
Figure 16.2: Schematic diagram of a hypothetical RNA molecular containing four basic types of RNA
loops: a hairpin loop, bulge loop, interior loop, and multibranch loop. Dashed lines indicate base pairings
in the helical regions of the molecule.
Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16
Figure 16.3: A hypothetical RNA structure containing a pseudoknot, kissing hairpin, and hairpin–bulge
contact.
At present, there are essentially two types of method of RNA structure prediction.
One is based on the calculation of the minimum free energy of the stable structure
derived from a single RNA sequence. This can be considered an ab initio approach. The
second is a comparative approach which infers structures based on an evolutionary
comparison of multiple related RNA sequences.
AB INITIO APPROACH
This approach makes structural predictions based on a single RNA sequence. The
rationale behind this method is that the structure of an RNA molecule is solely deter-
mined by its sequence. Thus, algorithms can be designed to search for a stable RNA
structure with the lowest free energy. Generally, when a base pairing is formed, the
energy of the molecule is lowered because of attractive interactions between the two
strands. Thus, to search for a most stable structure, ab initio programs are designed
to search for a structure with the maximum number of base pairs.
Free energy can be calculated based on parameters empirically derived for small
molecules. G–C base pairs are more stable than A–U base pairs, which are more stable
than G–U base pairs. It is also known that base-pair formation is not an independent
event. The energy necessary to form individual base pairs is influenced by adjacent
base pairs through helical stacking forces. This is known as cooperativity in helix
formation. If a base pair is next to other base pairs, the base pairs tend to stabilize
Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16
each other through attractive stacking interactions between aromatic rings of the base
pairs. The attractive interactions lead to even lower energy. Parameters for calculating
the cooperativity of the base-pair formation have been determined and can be used
for structure prediction.
However, if the base pair is adjacent to loops or bulges, the neighboring loops
and bulges tend to destabilize the base-pair formation. This is because there is a
loss of entropy when the ends of the helical structure are constrained by unpaired
loop residues. The destabilizing force to a helical structure also depends on the
types of loops nearby. Parameters for calculating different destabilizing energies
have also been determined and can be used as penalties for secondary structure
calculations.
The scoring scheme based on the combined stabilizing and destabilizing inter-
actions forms the foundation of the ab initio RNA secondary structure prediction
method. This method works by first finding all possible base-pairing patterns from a
sequence and then calculating the total energy of a potential secondary structure by
taking into account all the adjacent stabilizing and destabilizing forces. If there are
multiple alternative secondary structures, the method finds the conformation with
the lowest energy, meaning that it is energetically most favorable.
Dot Matrices
In searching for the lowest energy form, all possible base-pair patterns have to be
examined. There are several methods for finding all the possible base-paired regions
from a given nucleic acid sequence. The dot matrix method and the dynamic program-
ming method introduced in Chapter 3 can be used in detecting self-complementary
regions of a sequence. A simple dot matrix can find all possible base-paring patterns of
an RNA sequence when one sequence is compared with itself (Fig. 16.4). In this case,
dots are placed in the matrix to represent matching complementary bases instead of
identical ones.
The diagonals perpendicular to the main diagonal represent regions that can self-
hybridize to form double-stranded structure with traditional A–U and G–C base pairs.
In reality, the pattern detection in a dot matrix is often obscured by high noise levels.
As discussed in Chapter 3, one way to reduce the noise in the matrix is to select
an appropriate window size of a minimum number of contiguous base matches.
Normally, only a window size of four consecutive base matches is used. If the dot plot
reveals more than one feasible structures, the lowest energy one is chosen.
Dynamic Programming
The use of a dot plot can be effective in finding a single secondary structure in a small
molecule (see Fig. 16.4). However, if a large molecule contains multiple secondary
structure segments, choosing a combination that is energetically most stable among
a large number of possibilities can be a daunting task. To overcome the problem,
a quantitative approach such as dynamic programming can be used to assemble a
final structure with optimal base-paired regions. In this approach, an RNA sequence
Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16
Figure 16.4: Example of a dot plot used for RNA secondary structure prediction. In this plot, an RNA
sequence is compared with itself. Dots are placed for matching complementary bases when a window
size of four nucleotide match is used. A main diagonal, which is perpendicular to the short diagonals, is
placed for self-matching. Based on the dot plot, the predicted secondary structure for this sequence is
shown on the right.
is compared with itself. A scoring scheme is applied to fill the matrix with match
scores based on Watson–Crick base complementarity. Often, G–U base pairing and
energy terms of the base pairing are also incorporated into the scoring process. A path
with the maximal score within a scoring matrix after taking into account the entire
sequence information represents the most probable secondary structure form.
The dynamic programming method produces one structure with a single best score.
However, this is potentially a drawback of this approach because in reality an RNA
may exist in multiple alternative forms with near minimum energy but not necessarily
the one with maximum base pairs.
Partition Function
The problem of dynamic programming to select one single structure can be comple-
mented by adding a probability distribution function, known as the partition function,
which calculates a mathematical distribution of probable base pairs in a thermody-
namic equilibrium. This function helps to select a number of suboptimal structures
within a certain energy range. The following lists two well-known programs using the
ab initio prediction method.
Mfold (www.bioinfo.rpi.edu/applications/mfold/) is a web-based program for RNA
secondary structure prediction. It combines dynamic programming and thermody-
namic calculations for identifying the most stable secondary structures with the lowest
energy. It also produces dot plots coupled with energy terms. This method is reliable
for short sequences, but becomes less accurate as the sequence length increases.
RNAfold (http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi) is one of the web pro-
grams in the Vienna package. Unlike Mfold, which only examines the energy terms of
Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16
Figure 16.5: Example of covariation of residues among three homologous RNA sequences to maintain
the stability of an existing secondary structure.
the optimal alignment in a dot plot, RNAfold extends the sequence alignment to the
vicinity of the optimal diagonals to calculate thermodynamic stability of alternative
structures. It further incorporates a partition function to select a number of statisti-
cally most probable structures. Based on both thermodynamic calculations and the
partition function, a number of alternative structures that may be suboptimal are
provided. The collection of the predicted structures may provide a better estimate
of plausible foldings of an RNA molecule than the predictions by Mfold. Because of
the much larger number of secondary structures to be computed, a more simplified
energy rule has to be used to increase computational speed. Thus, the prediction
results are not always guaranteed to be better than those predicted by Mfold.
COMPARATIVE APPROACH
Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16
Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16
SUMMARY 239
PERFORMANCE EVALUATION
SUMMARY
Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
P1: JZP
0521840988c16 CB1022/Xiong 0 521 84098 8 January 10, 2006 14:16
sequences, is able to achieve better accuracy. However, the obvious drawback of the
consensus approach is the requirement for a unique set of homologous sequences.
Neither type of the prediction methods currently considers pseudoknots in the RNA
structure because of the much greater computational complexity involved. To fur-
ther increase prediction performance, the research and development should focus
on alleviating some of the current drawbacks.
FURTHER READING
Doshi, K. J., Cannone, J. J., Cobaugh, C. W., and Gutell, R. R. 2004. Evaluation of the suitability
of free-energy minimization using nearest-neighbor energy parameters for RNA secondary
structure prediction. BMC Bioinformatics 5:105.
Doudna, J. A. 2000. Structural genomics of RNA. Nat. Struct. Biol. Suppl:954–6.
Gardner, P. P., and Giegerich, R. 2004. A comprehensive comparison of comparative RNA struc-
ture prediction approaches. BMC Bioinformatics 5:140.
Gorodkin, J. Stricklin, S. L., and Stormo, G. D. 2001. Discovering common stem-loop motifs in
unaligned RNA sequences. Nucleic Acids. Res. 10:2135–44.
Leontis, N. B., Stombaugh, J., and Westhof, E. 2002. Motif prediction in ribosomal RNAs lessons
and prospects for automated motif prediction in homologous RNA molecules. Biochimie
84:961–73.
Major, F., and Griffey, R. 2001. Computational methods for RNA structure determination. Curr.
Opin. Struct. Biol. 11:282–6.
Westhof, E., Auffinger, P., and Gaspin, C. 1997. “DNA and RNA structure prediction.: In: DNA
and Protein Sequence Analysis, edited by M. J. Bishop and C. J. Rawlings, 255–78. Oxford, UK:
IRL Press.
Downloaded from https://www.cambridge.org/core. University of Birmingham, on 18 Nov 2019 at 22:20:05, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511806087.017
Protein structure prediction
Primary Structure
The primary structure of a protein is its linear sequence of amino acids and the location of
any disulfide (-S-S-) bridges.
Amino acids are the building blocks (monomers) of proteins. 20 different amino acids are
used to synthesize proteins. The shape and other properties of each protein is dictated by
the precise sequence of amino acids in it.
Each amino acid consists of an alpha carbon atom to which is attached
a hydrogen atom
an amino group (hence "amino" acid)
a carboxyl group (-COOH). This gives up a proton and is thus an acid (hence
amino "acid")
one of 20 different "R" groups. It is the structure of the R group that determines
which of the 20 it is and its special properties. The amino acid shown here is
Alanine.
Secondary Structure
Most proteins contain one or more stretches of amino acids that take on a characteristic
structure in 3-D space. The most common of these are the alpha helix and the beta
conformation.
Alpha Helix
1
The carbonyl group (-C=O) of each peptide bond extends parallel to the axis of
the helix and points directly at the -N-H group of the peptide bond 4 amino acids
below it in the helix. A hydrogen bond forms between them
Beta Conformation
Tertiary Structure
The above diagram represents the tertiary structure of the antigen-binding portion of an
antibody molecule. Each circle represents an alpha carbon in one of the two polypeptide
chains that make up this protein. (The filled circles at the top are amino acids that bind to
the antigen.) Most of the secondary structure of this protein consists of beta conformation
labelled as beta sheet
2
Why tertiary structure is important:
The function of a protein (except as food) depends on its tertiary structure. If this is
disrupted, the protein is said to be denatured and it loses its activity. For example:
denatured enzymes lose their catalytic power, denatured antibodies can no longer
bind antigen
Protein Domains
The tertiary structure of many proteins is built from several domains. Often each domain
has a separate function to perform for the protein, such as:
Quaternary Structure
3
Some of the ExPASy and other tools are discussed as follows:
1. a list of proteins close to a given isoelectric point(pI) and molecular weight (Mw),
2. the identification of proteins by matching a short sequence tag of up to 6 amino
acids against proteins in the UniProt Knowledgebase (Swiss-Prot and TrEMBL)
databases close to a given pI and Mw,
3. the identification of proteins by their mass, if this mass has been determined by
mass spectrometric techniques
4
methods. PROPSEARCH uses the amino acid composition as input. In addition, other
properties like molecular weight, content of bulky residues, content of small residues,
average hydrophobicity, average charge and the content of selected dipeptide-groups are
calculated from the sequence as well.
5
Peptide Mass Fingerprint: The experimental data are a list of peptide mass values
from an enzymatic digest of a protein.
Sequence Query: One or more peptide mass values associated with information
such as partial or ambiguous sequence strings, amino acid composition
information, MS/MS fragment ion masses, etc. A super-set of a sequence tag
query.
MS/MS Ion Search: Identification based on raw MS/MS data from one or more
peptides.
6
(a) Compute pI/Mw: Compute pI/Mw (http://ca.expasy.org/tools/pi_tool.html) is a tool
that calcultaes the isoelectric point and molecular weight of an input sequence. The
sequence can be input in the FASTA format, the output is the pI and molecular weight for
the entire length of the sequence.
7
(d) ProtParam (http://www.expasy.ch/tools/protparam.html) is a tool which allows the
computation of various physical and chemical parameters for a given protein stored in
Swiss-Prot or TrEMBL or for a user entered sequence. The computed parameters include
the molecular weight, theoretical pI, amino acid composition, atomic composition,
extinction coefficient, estimated half-life, instability index, aliphatic index and grand
average of hydropathicity.
There are several protein secondary structure prediction methods and the most important
of these methods are:
(a) Chou-Fasman method: The Chou-Fasman method was among the first secondary
structure prediction algorithms developed and relies predominantly on probability
parameters determined from relative frequencies of each amino acid's appearance in
each type of secondary structure. The original Chou-Fasman parameters, determined
from the small sample of structures solved in the mid-1970s, produce poor results
compared to modern methods, though the parameterization has been updated since it
was first published. The Chou-Fasman method is roughly 50-60% accurate in
predicting secondary structures.
(b) GOR method: The GOR method, named for the three scientists who developed it -
Garnier, Osguthorpe, and Robson - is an information theory-based method developed
not long after Chou-Fasman that uses more powerful probabilistic techniques of
Bayesian inference. The GOR method takes into account not only the probability of
each amino acid having a particular secondary structure, but also the conditional
probability of the amino acid assuming each structure given that its neighbors assume
the same structure. This method is both more sensitive and more accurate because
amino acid structural propensities are only strong for a small number of amino acids
8
such as proline and glycine. The original GOR method is roughly 65% accurate and is
dramatically more successful in predicting alpha helices than beta sheets, which it
frequently mispredicts as loops or disorganized regions.
© GOR IV (http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_gor4.html)
uses all possible pair frequencies within a window of 17 amino acid residues. One
output gives the sequence and the predicted secondary structure in rows. H=helix,
E=extended or beta strand and C=coil. The other output gives the probability values for
each secondary structure at each amino acid composition.
(d) Hidden Markov Methods (HMMs): HMMs method is used to predict the secondary
structure of a protein of a given structural class (e.g. +) as used in the structural
classification databases. Each HMM is trained with the sequences of the proteins in that
structural class. The models are used with a query sequence to predict both the class and
the secondary structure of the protein.
(f) Neural Networks: Most of the effective structure prediction models extract patterns
from databases of known protein structures. Neural networks comprise a particular tool
for protein recognition and classification.
9
(i) HNN (http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_nn.html) is a
Hierarchial Neural Network based program that gives a secondary structure prediction.
10
(iv) PSIPRED (http://bioinf.cs.ucl.ac.uk/psipred/) PSIPRED Protein Structure Prediction
Server aggregates several of our structure prediction methods into one location. Users can
submit a protein sequence, perform the prediction of their choice and receive the results
of the prediction via e-mail. It is a highly accurate method for protein secondary structure
prediction
11
(a) SOPMA: It is a secondary structure prediction program (Self-Optimized Prediction
Method) that uses multiple alignments. SOPMA correctly predicts 69.5% of amino acids
for a three-state description of the secondary structure (alpha-helix, beta-sheets and coil)
in a whole database containing 126 chains of non-homologous proteins. The server is
available at
(http://npsapbil.ibcp.fr/cgibin/npsa_ automat.pl?page=/NPSA/npsa_sopma.html).
Joint prediction with SOPMA and PHD correctly predicts 82.2% of residues for 74% of
co-predicted amino acids.
(iii) Genetic Algorithms (GA) Simulations:-GA methods try to improve on the sampling
12
and the convergence of MC approaches.
I-TASSER is an internet service for protein structure and function predictions. 3D models
are built based on multiple-threading alignments by LOMETS and iterative TASSER
simulations; function inslights are then derived by matching the predicted models with
protein function databases.
QUARK is the computer algorithm for ab initio protein folding and protein structure
prediction. It aims to construct the protein 3D structures from scratch using replica-
exchange Monte Carlo simulations under the guide of a knowledge-based atomic force
field. QUARK movements include free atom relocation as well as rigid-body replacement
of small fragments (1 to 20 residues long) excised from solved experimental structures.
The server is therefore suitable for proteins which are considered without homologous
templates. Protein sequences having length more than 200 residues are not preferable in
QUARK so I-TASSER can be used.
13
Fig-15: I-TASSER ab-initio Protein Structure Prediction Tool
Fig-15: BHAGEERATH: an energy based computer software suite for ab-initio Protein Structure
Prediction Tool
14
(3) DTMM- Desktop Molecular Modeller: It is a simple-to-use molecular modelling
program that enables you to perform powerful molecular synthesis, editing, energy
minimizations, and display. The package, substantially enhanced from previous versions
of DTMM, will run on any PC with Windows 95, 98, Me, 2000, NT, or XP. The
webserver is available at the website (http://www.polyhedron.co.uk/MFSQMC33264)
(b) Fold Recognition: Fold recognition and threading methods can be used to assign
tertiary structures to protein sequences, even in the absence of clear homology. The
ongoing development of such methods has had a significant impact on structural biology,
providing us with an increasing ability to accurately model 3D protein structures using
very evolutionary distant fold templates. Although fold recognition and threading
techniques will not yield equivalent results as those from X-ray crystallography, they are
comparatively fast and inexpensive way to build a close approximation of a structure
from a sequence, without the time and costs of experimental procedures. Using fold
recognition proteins with known structures that share common folds with the target
sequences can be identified. The identified structures can be used as templates from
which the folds of the target sequences are modeled.
(1) PHYRE- Protein Homology/analogY Recognition Engine. The webserver is available
at (http://www.sbg.bio.ic.ac.uk/~phyre/)
15
(2) 3DPSSM- A Fast, web-based method for protein Fold recognition using 1D and 3D
sequence Profiles coupled with secondary structure and salvation potential information.
(http://www.sbg.bio.ic.ac.uk/~3dpssm/index2.html)
(c) Homology Modelling: It is based on the reasonable assumption that two homologous
proteins will share very similar structures. Because a protein's fold is more evolutionarily
conserved than its amino acid sequence, a target sequence can be modeled with
reasonable accuracy on a very distantly related template, provided that the relationship
between target and template can be discerned through sequence alignment. It has been
suggested that the primary bottleneck in comparative modelling arises from difficulties in
alignment rather than from errors in structure prediction given a known-good
alignment.Unsurprisingly, homology modelling is most accurate when the target and
template have similar sequences.
16
procedure begins with an alignment of the sequence to be modeled (target) with related
known 3D structures (templates). This alignment is usually the input to the program. The
output is a 3D model for the target sequence containing all main chain and side chain
non-hydrogen atoms. Given an alignment, the model is obtained without any user
intervention. First, many distance and dihedral angle restraints on the target sequence are
calculated from its alignment with template 3D structures. The form of these restraints
was obtained from a statistical analysis of the relationships between many pairs of
homologous structures. This analysis relied on a database of 105 family alignments that
included 416 proteins with known 3D structure [ˇSali & Overington, 1994]. By scanning
the database, tables quantifying various correlations were obtained, such as the
correlations between two equivalents C_ – C_ distances, or between equivalent main
chain dihedral angles from two related proteins. These relationships were expressed as
conditional probability density functions (pdf’s) and can be used directly as spatial
restraints.
For example, probabilities for different values of the main chain dihedral angles are
calculated from the type of a residue considered, from main chain conformation of an
equivalent residue, and from sequence similarity between the two proteins. Finally, the
model is obtained. The optimization is carried out by the use of the variable target
function method [Braun & Go, 1985] employing methods of conjugate gradients and
molecular dynamics with simulated annealing (Figure 1.3). Several slightly different
models can be calculated by varying the initial structure. The variability among these
models can be used to estimate the errors in the corresponding regions of the fold.
17
First, the known, template 3D structures are aligned with the target sequence to be
modelled
Second, spatial features, such as - distances, hydrogen bonds, and main chain and
side chain dihedral angles, are transferred from the templates to the target. Thus, a
number of spatial restraints on its structure are obtained.
Third, the 3D model is obtained by satisfying all the restraints as well as possible.
1. Preparing alignment file: Following are the steps to prepare an alignment file:
(a) Take protein sequence of your interest and search it in NCBI and save its fasta
sequence as in following example
Target
>Q865F8|AHSP_BOVIN Alpha-hemoglobin-stabilizing protein - Bos taurus
MALIQTNKDLISKGIKEFNILLNQQVFSDPAISEEAMVTVVNDWVSFYINY YK
KQLSGEQDEQDKALQEFRQELNTLSASFLDKYRNFLKSS
(b) Copy your target protein sequence and paste in NCBI blast-p page by choosing
PDB database in BLASTP option.
18
(c) Get the structural sequence based on degree of similarity and note down pdb id
basing on 3 parameters identity more than 40 %,x-ray crystallographic structure,
resolution<3Angstrom,R-value<0.5,for this case the template is 1y01
(d) Copy the sequence which shows highest similarity with your target protein
sequence and paste it in same note pad in which your target sequence was pasted
in first step. The note pad should be like below.
>1y01
LLKANKDLISAGLKEFSVLLNQQVFNDALVSEEDMVTVVEDWMNFYINY Y
RQ QVTGEPQERDKALQELRQELNTLANPFLAKYRDFLKS
>target
MALIQTNKDLISKGIKEFNILLNQQVFSDPAISEEAMVTVVNDWVSFYINY YK
KQLSGEQDEQDKALQEFRQELNTLSASFLDKYRNFLKSS
(e) Save the file where you wish to store then go to Clustalx
(f) Open ClustalX software choose upload sequence and load your file which you have
saved at step 5 on note pad.
(g) After uploading click alignment button then click output format options where a
Sub-window displayed on your monitor and adjust the parameters.
Select -parameter out – ON
Select-PIR format- ON
Remove Clustal W format.
(h) Again click Alignment button and click “do complete alignment button”.
(j) Open the out put file, which has * .pir extension. Open that file through word pad
and save as filename with ali extension (*.ali). (* Indicates file name). File should
be saved within double quotes (“*.ali”)
(k) Then do modifications in *.ali file by observing standard file, which is given below.
.Ali File:
>P1;1y01
structureX:1y01:3 :A:91 : :ALPHA-HEMOGLOBIN STABILIZING PROTEIN:HOMO SAPIENS:2.80:
0.273
--LLKANKDLISAGLKEFSVLLNQQVFNDALVSEEDMVTVVEDWMNFYINYYRQQVTGEP
QERDKALQELRQELNTLANPFLAKYRDFLKS-
*
19
>P1;target
sequence:target:1 : :102 : :Alpha-hemoglobin-stabilizing protein:Bos taurus: 2.80:0.273
MALIQTNKDLISKGIKEFNILLNQQVFSDPAISEEAMVTVVNDWVSFYINYYKKQLSGEQ
DEQDKALQEFRQELNTLSASFLDKYRNFLKSS
*
(a) Take the pdb id, which has got match with your target protein, submit to pdb
database (Access at www.rcsb.org).
(b) Structural data will be opened and choose explore button then download and
display option and choose pdb format in download options for downloading.
(d) save the *.pdb file as *.atm.
(a) Write the standard format of script file, which is given below in notepad.
© Then save the file by giving the file name with the extension of *.py.
(d) Copy all three (*.ali, *.py & *.atm) and paste it in C:\mod9v7\bin.
(C:\Program Files\Modeller9v7\bin)
(g) Out put file will be given created automatically by Modeller in bin directory
approximately 3 min of analysis
(h) Copy that output file and paste it in your folder then save the same file as *.pdb
.Py File (script file):
# Homology modeling by the automodel class
from modeller import * # Load standard Modeller classes
from modeller.automodel import * # Load the automodel class
log.verbose() # request verbose output
env = environ() # create a new MODELLER environment to build this
model in
# directories for input atom files
env.io.atom_files_directory = './:../atom_files'
a = automodel(env,
alnfile = 'align.ali', # alignment filename
knowns = '1y01', # codes of the templates
sequence = 'target') # code of the target
a.starting_model= 1 # index of the first model
a.ending_model = 1 # index of the last model
20
# (determines how many models to
calculate)
a.make() # do the actual homology modeling
(i) Visualize the 3D model created by you in Rasmol , Swiss PDB viewer
Rastogi, S.C., Mendiratta N., and Rastogi, P. (2009). Bioinformatics Methods and
Applications Genomics, Proteomics and Drug Discovery. PHI Learning Private
Limited. India.
21
284 Bioinformatics
Molecular docking is a key tool in structural molecular biology and computer-assisted drug
design. The goal of ligand-protein docking is to predict the predominant binding mode(s)
of a ligand with a protein of known three-dimensional structure. Docking is a technique
which predicts the ideal orientation of ligand in the active site of the receptor when bound to
each other to form a stable complex.
From hit discovery through lead optimization and beyond, computational methods have
become an essential part of many drugs development processes. There are typically several
steps in the docking process, and each one provides a new level of complexity. Docking
methods are used to place small molecules in the active region of the enzyme. In addition to
these methods, scoring functions are used to estimate a compound's biological activity by
looking at how it interacts with prospective targets. Molecular docking is considered to be the
most widely utilized computational phenomenon in the field of computer-aided drug design
(CADD). It is being utilized at the academic level as well as in pharmaceutical companies for
the lead discovery process. Molecular docking is mainly associated with two terms: ligand and
protein. Protein is the target site where ligand may bind to give specific activity. Molecular
docking provides information on the ability of the ligand to bind with protein which is known
as binding affinity. Applications of molecular docking in drug development have evolved
significantly since it was first created to aid in the study of molecular recognition processes
between small and large compounds. This review emphasizes the basic features of molecular
docking along with the types, approaches and applications.
Docking is widely used to anticipate the alignment of small molecule therapeutic compounds
concerning their protein targets in anticipating the small molecule's affinity and
activity.Docking plays a critical role in rational drug design. Considering the biological and
pharmacological importance of docking studies, much effort has been made to improve the
algorithms for docking prediction. Docking is a mathematical technique that anticipates the
preferable orientation of one molecule relative to another when they are linked together to
create a stable complex. Using scoring functions, it is possible to estimate the strength of the
connection or binding affinity across two compounds based on their preferential orientation.
Signal transduction is dependent on the interactions of physiologically significant substances
such as proteins, nucleic acids, carbohydrates, and lipids. As a result, docking may be used to
forecast both the intensity and type of signals generated. Docking is widely used to anticipate
the alignment of drug candidates relative to specific target molecules to manage the small
molecule's affinity and activity. As a result, docking is critical in the structural characterization
of medications. The goal of docking studies is to optimize the shape of both the ligand and
protein, as well as the relative orientation of the protein and ligand, to reduce the total system's
free energy.
Rigid docking
Assuming the compounds are inflexible, we are seeking a rearrangement of one of the
compounds in three-dimensional space that results in the best match to the other compounds in
parameters of a scoring system. The ligand's conformation can be formed with or without
receptor binding activity.
Flexible docking
In conjunction with transformation, we evaluate molecular flexibility to identify confirmations
for the receptor and ligand molecules as they exist in the complex.
Knowledge of the preferred orientation in turn may be used to predict the strength of
association or binding affinity between two molecules using, for example, scoring functions.
The associations between biologically relevant molecules such as proteins, peptides, nucleic
acids, carbohydrates, and lipids play a central role in signal transduction. Docking is useful for
predicting both the strength and type of signal produced. Molecular docking is one of the most
frequently used methods in structure-based drug design, due to its ability to predict the binding-
conformation of small molecule ligands to the appropriate target binding site. Characterisation
of the binding behaviour plays an important role in rational design of drugs as well as to
elucidate fundamental biochemical processes
Applications
A binding interaction between a small molecule ligand and an enzyme protein may result in
activation or inhibition of the enzyme. If the protein is a receptor, ligand binding may result in
agonism or antagonism. Docking is most commonly used in the field of drug design — most
drugs are small organic molecules, and docking may be applied to:
hit identification – docking combined with a scoring function can be used to quickly screen
large databases of potential drugs in silico to identify molecules that are likely to bind to protein
target of interest. Reverse pharmacology routinely uses docking for target identification.
lead optimization – docking can be used to predict in where and in which relative orientation a
ligand binds to a protein (also referred to as the binding mode or pose). This information may
in turn be used to design more potent and selective analogs.
Bioremediation – Protein ligand docking can also be used to predict pollutants that can be
degraded by enzymes
Software available for docking
Gold
Genetic Enhancement and Receptor Docking make use of numerous ligand subgroups. Three
terms comprise the force-field-based scoring function: The phrase "H-bonding" refers to the
potential for intermolecular dispersion.8 The word "intramolecular potential" refers to the
potential for intramolecular dispersion. 71% success rate in determining the experimental
binding mode for 100 protein complexes.
Autodock
Consists of a three-dimensional lattice of regularly spaced points encircling and cantered about
the macromolecule's region of interest.
Flex-X
Using the "position clustering" technique, the base fragment is picked up and docked. A
clustering approach is used to combine related ligand changes into active site modifications.8
Flexible fragments are sequentially added using MIMUMBA and assessed using the overlap
function, followed by energy calculations to finish the ligand construction.8 Final assessment
using Böhm's scoring system, which incorporates hydrogen bonds, ionic, aromatic, and
lipophilic terms.8