Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

PROTEINS: Structure, Function, and Genetics, Suppl.

1:3842 (1997)

Protein Modeling by Multiple Sequence Threading


and Distance Geometry
Andras Aszodi, Robin E.J. Munro, and William R. Taylor*
Division of Mathematical Biology, National Institute for Medical Research, London, United Kingdom

ABSTRACT
The application of homology
modeling is often limited by the lack of known
structures with sufficiently high sequence similarity to the target protein. The recent development of threading methods now enable the
identification of likely folding patterns in a
number of cases where the structural relatedness between target and template(s) is not
detectable at the sequence level. We devised a
hybrid method in which fold recognition was
performed using the Multiple Sequence
Threading (MST) method. The structural
equivalences deduced from the threading output were used to guide the distance geometry
program DRAGON in the construction of lowresolution Ca/Cb models. The initial structures
were converted to full-atom representation and
refined using the general-purpose molecular
modeling package QUANTA. The performance
of the approach is illustrated on the CASP2
target T0004 (polyribonucleotide nucleotidyltransferase S1 motif (PNS1) from Escherichia
coli, PDB code: 1SRO) for which no obvious
homologues with known structure were available. The correct fold of PNS1 was successfully
identified, and the model was found to be more
similar to the experimental PNS1 structure than
the scaffold (Ca RMSD of 6.2 compared with 6.4
). Our results indicate that a sensitive fold recognition algorithm coupled with a distance geometry program capable of rapidly generating initial structures can successfully complement highresolution homology modeling methods in cases
where sequential similarity is low. Proteins,
Suppl. 1:3842, 1997. r 1998 Wiley-Liss, Inc.
Key words: distance geometry; homology modeling; fold recognition; protein
structure prediction
INTRODUCTION
Homology modeling is a technique whereby the
three-dimensional (3D) conformation of a target
protein is deduced from the known structures of
other proteins (the templates) by using sequential
similarities between the target and the templates to
establish structural equivalences. The approach is
based on the observation that structural features are
conserved during evolution to a larger extent than
r 1998 WILEY-LISS, INC.

sequences and therefore two proteins with similar


sequences can be expected to adopt the same fold.
Due to the large amount of structural information
used in the construction process, homology-based
models are usually of very high quality and can
greatly aid in the understanding of detailed protein
structure.14
The applicability of homology modeling is limited,
however, by the availability of template structures
with a sufficient sequence similarity to the target.
Without clear sequence similarity, fold recognition
(threading) methods can be used to identify structural similarities in cases when these are not apparent in the sequences.57 The equivalences between
target and template residues, established from the
results of the threading program, provide structural
information for the construction of the target fold
much in the same way as the multiple sequence
alignment in homology modeling. Indeed, threading
and comparative modeling can be regarded as related approaches, applicable to different levels of
sequential similarity.
Based on these ideas, we have designed a hybrid
protein modeling approach consisting of three steps.
In the first step, candidate templates were identified
by using the novel fold recognition algorithm MST,8
which is capable of performing simultaneous threading of multiple aligned sequences onto one or more
3D structures, thus increasing sensitivity and alignment quality. In the second step, the structural
equivalences obtained from the MST output were
converted into interresidue distance restraints and
fed into the distance geometry program DRAGON,9
together with auxiliary information obtained from
secondary structure predictions. The program combined the restraints in an unbiased manner and
rapidly generated a large number of low-resolution
model conformations. In the last step, these were
converted into full-atom models and subjected to
energy minimization.
For the CASP2 (http://PredictionCenter.llnl.gov/
casp2/Casp2.html) experiment this combination of

Dr. Aszodi is currently at Novartis Forschungsinstitut GmbH,


Brunnerstrasse 59, A-1235 Vienna, Austria.
*Correspondence to: Dr. William R. Taylor, Division of Mathematical Biology, National Institute for Medical Research, The
Ridgeway, Mill Hill, London NW7 1A4, U.K.
Received 29 April 1997; Accepted 1 August 1997

MODELING BY THREADING AND DISTANCE GEOMETRY

methods was used by us on a number of targets,


ranging from close to distant similarity. In this
article we present our results obtained with the
PNS1 sequence (Target T0004) originally classified
as an ab initio. This example illustrates the capabilities of the method, on a difficult modeling problem,
for which the approach was developed. Some other
modeling results have been described elsewhere.10
METHODS
Multiple Alignment
The 84-amino acid-long target sequence polyribonucleotide nucleotidyltransferase S1 motif (PNS1)
from Escherichia coli was aligned with the following
homologous sequences (SwissProt ID code and accession number): PNP_PHOLUP (P41121), PNP_HAEIN
(P44584), YABR_BACSU (P37560), RS1_MYCLE
(P46836), RS1H_BACSU (P38494), YHGF_ECOLI
(P46837) by using the multiple sequence alignment
program MULTAL11 and a fixed (opening) gap penalty of between 15 and 20 and with an amino acid
relatedness matrix composed of 30% PAM120 values
augmented by adding 7 to the diagonal.12
The resulting alignment was then examined to
find patterns of conserved positions that could provide the basis for a motif search. The OWL sequence
database was scanned with these motifs by using the
UNIX pattern-matching utility regex. This tool,
implemented in a simple application program (D.T.
Jones, W. Taylor) uses a regular expression search to
scan the database and identify any sequences that
match. These related, but relatively distant, proteins
were later threaded along with the target sequence
in the MST program.
Secondary Structure Prediction
Secondary structure prediction was carried out by
the PredictProtein server13 and DSC.14 The results
were evaluated in the context of MULTAL alignments, which corrected after visual inspection to
avoid any unnecessary breaks in structure.
Fold Recognition
First, we scanned the extended UCLA benchmark
set of 319 structures (http://www.mbi.ucla.edu) with
the PNS1 sequence by using the THREADER program.6 Then, the multiple sequence alignment of the
target sequence and its homologues was compared to
the top 100 THREADER hits by using MST. The
MST prediction uses a simple pairwise potential that
favors the packing of conserved hydrophobics into
the core along with the matching of predicted and
observed secondary structure and solvent exposure.
The MST program automatically generated a model
of the alignment (but with no attempt to model
inserted regions), allowing the basic threaded structure to be visualized. The hydrophobic core packing
in each model was assessed, especially in areas
where insertion and deletion had occurred. To pre-

39

vent a dislocated threading across two protein domains, a modified version of MST was used to thread
just one domain of given size. Three hits with the
best packing scores and secondary structure alignments were selected from the MST results as follows
(PDB codes in parentheses): porcine E. coli heatlabile enterotoxin chain D (1LTSD), RNase Hdomain from HIV-1 reverse transcriptase chain A
(1HRHA), and staphylococcal nuclease (2SNS). All
these candidate templates were small b 1 a structures: 1LTSD is a classic OB-fold,16 2SNS is a more
elaborate OB-fold, while 1HRHA is a ribonuclease
RT domain. Figure 1A illustrates the threading on
one of these structures (1LTSD).
Fold Generation
Model building was performed by the distance
geometry program DRAGON. Structural equivalences between the unknown target structure and
the scaffold proteins with known structures were
described by mapping distance restraints between
Ca atoms onto the model through alignments constructed from the MST threading output. The alignments contained the target sequence and one of the
three candidate template sequences only. The threading-based structural information was complemented
by additional restraints derived from secondary structure prediction. For each of the three scaffold structures, 50 models were generated by using Ca:Ca
distances shorter than 10 to guide the folding
process. A representative model based on 1LTS chain
D is shown in Figure 1B.
Model Building and Refinement
The 10 best-scoring DRAGON output structures
were averaged for each scaffold. The missing atoms
were added to the Ca average structures and the
resulting full-atom structures were minimized by
QUANTA version 4.1/CHARMM 23.1.17 Initial geometry regularization was followed by an in vacuo
simulated annealing at T 5 1000 K and a second MD
run with an 8--thick solvation layer at T 5 300 K.
Comparison
Once the NMR coordinates were obtained from the
CASP2 organizers we superposed our models onto
the experimental structure to assess how well the
folds matched (Fig. 1D). The template structures
were also compared to the experimental structure,
although this could not be done by a straightforward
rigid-body superposition as the template sequences
were different from that of PNS1. A modification of
the SSAP algorithm18 was used to generate optimal
correspondences between atoms for the superposition.
RESULTS AND DISCUSSION
We identified three potential folds for the CASP2
target T0004 (polyribonucleotide nucleotidyltransfer-

DI ET AL.
A. ASZO

40

Fig. 1. Modeling target T0004 on the 1LTSD chain. A: Threaded structure of T0004 on 1LTSD
where blue 5 deletions, white 5 inserts, and red 5 hydrophobic. B: Raw DRAGON model of T0004
based on the structure of 1LTSD. C: Experimental NMR structure of PNS1. D: Superposition of the
DRAGON model and the NMR structure (the N terminus is color-coded blue).

ase S1 motif (PNS1) from E. coli) by fold recognition:


1HRH chain A, 1LTS chain D, and 2SNS. The
success of the approach can be judged by the following criteria:
1. Is the correct fold among the candidates?
2. Can we establish a ranking among the candidates
so that the correct fold comes out on top?
The Correct Fold
The answer to the first question is yes: the native
fold of PNS1 is the same as that of 1LTSD and also to
the fold of 2SNS. Comparison of the known structure
of PNS1 (now PDB databank entry 1SRO) to both
these proteins (by optimal structure comparison)

revealed that the core of 2SNS is in fact a better fit to


PNS1 but contains larger insertions and a long
terminal addition. (The best fit to 2SNS was 3.9
over 67 atoms compared to 4.1 over 65 atoms for
1LTS chain D). It should be noted that these values
come from structure comparison, while those quoted
below are based on the alignment from the threading
program. Since this alignment was essentially correct for 1LTS, the corresponding model will be
referred to below as correct.
Superposing the Ca backbone of the model built
from the template 1LTSD onto the experimental
structure gave a root-mean-square deviation (RMSD)
of 6.2 over 76 residues compared to 5.1 against
the template. For completeness, the equivalent value

MODELING BY THREADING AND DISTANCE GEOMETRY

TABLE I. Model Quality Judged by Various Scores*

Template

MST
score

MatchMaker
score (kT)

CHARMM
energy
(kcal/mol)

Ca RMSD
()

1LTSD
1HRHA
2SNS

2743
2707
2966

20.12
20.06
20.14

24417
24547
24403

6.2
10.8
11.0

*The correct model based on the template structure 1LTSD


comes second according to the MST, MatchMaker and CHARMM
energy rankings. The RMSD of the models from the experimental structure (T0004) are shown in the last column for comparison.

for the template against the known structure was 6.4


(using the same threading-derived register). The
RMSD against the template 1LTSD is higher than
that of the model, indicating that the DRAGON
simulation successfully moved away from the scaffold toward the correct structure. While this level of
RMSD is quite high (for proteins of this size), it was
clear upon visual inspection that the overall features
of the PNS1 structure were modeled correctly. Major
differences include a helix in the model, between
Arg-42 and Met-52, which is only partially present in
the nuclear magnetic resonance (NMR) structure
and a b turn between Ile-25 and Lys-29 is out of
position, as is the loop between His-34 and Ala-40.
It should be noted that the Ca backbone RMSD
was not improved by any of the minimizations after
the full-atom model was generated: the RMSD remained 6.2 after in vacuo simulated annealing and
further solvated molecular dynamics (MD) changed
this figure to 6.3 .
Model Ranking
It was hoped that detailed modeling of each candidate target would enable a more positive identification of the correct fold (or at least eliminate the
ribonuclease). However, the search for a method that
would rank the models so that the correct fold would
come out on top proved to be difficult and after
full-atom refinement, all three models looked equally
plausible. The rankings compared were: the MST
threading scores, the pseudoenergy scores calculated
by the MatchMaker program (supplied as part of the
Sybyl molecular modeling package, version 6.3, Tripos Associates)19 and the CHARMM potential energy
values of the refined structures. The correct model
based on 1LTSD ranked second both by the MatchMaker and CHARMM energies, it also ranked second according to the MST threading score (Table I).
Resolution and Accuracy
While we succeeded in locating the correct overall
fold for PNS1, the atomic details of the model
structure are not accurate enough. This is due to
several factors. First, threading-based methods obviously cannot provide the large amount of high-

41

quality structural information available in comparative modeling, where the target and the templates
are closely related both sequentially and structurally. Most participants at the CASP2 meeting agreed
that model quality depends very much on the quality
and quantity of external structural information supplied to the prediction algorithms. Second, it seems
to be difficult to choose the appropriate level of
resolution. While in our case the low-resolution
Ca:Cb model building by distance geometry appeared to be justified on grounds of efficiency and
lack of detailed experimental information, perhaps
the method would have performed better if another
refinement at intermediate resolution had been carried out before the full-atom modeling to improve the
main-chain geometry. Finally, although the choice of
detailed potential functions and sophisticated energy minimization/refinement methods are important for the last stage of full-atom refinement, these
cannot compensate for gross errors (such as misaligned residues in homology modeling) made earlier
in the modeling process. Possible improvements to
our approach should therefore include a careful
choice of low-resolution interaction potentials and
improved gap modeling,20 followed by refinement to
facilitate the full exploitation of available structural
information.
CONCLUSION
Combining threading with distance geometry can
be a useful way to construct a model for a protein. If a
sequence has no known structural homologues, then
it can be threaded to predict a likely scaffold on
which to base the model. This approach has several
advantages over a pure ab initio prediction, where a
fold is constructed by using just secondary structure
information, as threading will provide hints to the
possible tertiary structure of the target as well.
Although we only describe one example in detail,
several points can be taken from the CASP2 experiment with respect to our methods. The distance
geometry program DRAGON performed best on problems like T0004, where possible template structures
could be identified from fold recognition performed
with the high-sensitivity MST method and a lowresolution model chain representation was adequate. DRAGON cannot be expected to replicate
structures with high accuracy based on close sequence similarity, as it uses only Ca atoms and
therefore discards main-chain geometry details that
may be important at the higher level of detail. We
plan to develop the tandem MST/DRAGON approach into a protein modeling system that performs
well under conditions where accurate structural
information is not available, thereby complementing
high-resolution comparative modeling methods.

DI ET AL.
A. ASZO

42
ACKNOWLEDGMENTS

We thank all CASP2 organizers for their efforts in


organizing this unique event, and the experimentalists, without whom this work would not have been
possible.
REFERENCES
1. Pearl, L.H., Taylor, W.R. A structural model for the retroviral proteases. Nature 329:351354, 1987.
2. Sali, A., Overington, J.P., Johnson, M.S., Blundell, T.L.
From comparisons of protein sequences and structures to
protein modeling and design. TIBS 15:235240, 1990.
3. Havel, T.F. Predicting the structure of the flavodoxin from
escherichia-coli by homology modeling, distance geometry
and molecular-dynamics. Mol. Simul. 10:175210, 1993.
4. Taylor, W.R. Protein structure modelling from remote
sequence similarity. J. Biotechnol. 12:281291, 1994.
5. Bowie, J.U., Clarke, N.D., Pabo, C.O., Sauer, R.T. Identification of protein folds: matching hydrophobicity patterns of
sequence sets with solvent accessibility patterns of known
structures. Proteins 7:257264, 1990.
6. Jones, D.T., Taylor, W.R., Thornton, J.M. A new approach to
protein fold recognition. Nature 358:8689, 1992.
7. Bryant, S.H., Lawrence, C.E. An empirical energy function
for threading protein-sequence through the folding motif.
Proteins 16:92112, 1993.
8. Taylor, W.R. Multiple sequence threading: An analysis of
alignment quality and stability. J. Mol. Biol. 269:902943,
1997.

9. Aszodi, A., Taylor, W.R. Homology modelling by distance


geometry. Fold. Design 1:325334, 1996.
10. Aszodi, A., Munro, R.E.J., Taylor, W.R. Distance based
comparative modelling. Fold. Design 2(Suppl.):S3S6, 1997.
11. Taylor, W.R. A flexible method to align large numbers of
biological sequences. J. Mol. Evol. 28:161169, 1988.
12. Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. A model of
evolutionary change in proteins. Atlas Protein Seq. Struct.
5(Suppl. 3):345352, 1978.
13. Rost, B., Sander, C. Prediction of protein secondary structure at better than 70-percent accuracy. J. Mol. Biol.
232:584599, 1993.
14. King, R.D., Sternberg, M.J.E. Identification and application of the concepts important for accurate and reliable
protein secondary structure prediction. Prot. Sci. 5:2298
2310, 1996.
16. Murzin, A.G. New protein folds. Curr. Opin. Struct. Biol.
4:441449, 1994.
17. Brooks, B.R., Bruccoleri, R.E., Olafson, B.D., States, D.J.,
Swaminathan, S., Karplus, M. CHARMM a program for
macromolecular energy, minimisation, and dynamics calculations. J. Comp. Chem. 4:187217, 1983.
18. Orengo, C.A., Taylor, W.R. SSAP: Sequential structure
alignment program for protein structure comparison. Methods Enzymol. 266:617635, 1996.
19. Godzik, A., Skolnick, J. Sequence structure matching in
globular-proteins: Application to supersecondary and tertiary structure determination. Proc. Natl. Acad. Sci. U.S.A.
89:1209812102, 1992.
20. Taylor, W.R., Munro, R.E.J. Multiple sequence threading:
Conditional gap placement. Fold. Design 2(Suppl.):S33
S39, 1997.

You might also like