Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

A Solvation Potential with Improved Contact

Definitions and Optimized by Extensive Threading


Alan A. Dombkowski and Gordon M. Crippen2
Biophysics ResearchDivision, University of Michigan, Ann Arbor, MI 48109-1055,USA.

*College of Pharmacy,University of Michigan, Ann Arbor, MI 48 109-1065,USA.

Protein structural knowledge is essentialto understanding Introduction


the molecular basis of disease. Drug design and discovery
are facilitated by understandingthe three dimensional Developmentof a potential function requires consideration
shapeof relevant proteins, as will be future modesof of the method usedto produce the alternative structures.
diseaseintervention such as genetherapy. While All-atom representationsof proteins provide the greatest
identification of diseaserelated protein sequenceshas detail; however, they are computationally demandingto
dramatically increasedthrough genomemapping efforts, generate,resulting in reducedcoverageof conformation
the rate of determination of the associatedstructures space. A method which provides a wider sampleof
remains relatively slow. Despite recent advancements, possible conformations with less atomic detail is threading
experimental methodsof protein structure determination (Hendlich et al., 1990;Crippen, 1991). Threading uses
remain time consuming and labor intensive. Therefore, backbonesegmentsfrom known PDB structuresto
computational methodswhich can predict the correct produce alternativeswhich have protein-like
protein structure for a given amino acid sequencewould be characteristicssuch asproper bond lengths, angles and
a tremendousaid in biomedical research. Homology commonly observedsecondarystructure. For a query
modeling is precluded if the sequenceof interest haslow sequenceof length N, any contiguous structural segmentof
similarity to proteins of known structure. For these N residuestaken from a PDB structure of length N or
sequences,structure prediction requires an energy potential greatercan be evaluatedas a possible structure for the
which can identify the native structure out of any number query sequence.However, since the amino acid side
of alternative conformations. We have developeda chains of the query sequencediffer from those of the
solvation energy potential which is optimized to identify structural templatesbeing evaluated,compatibility is
the native structure as the lowest in energy. The potential obtainedby representingeach side chain as a point in
has few adjustableparameterswhich are trained so the spacesuch as the p carbon or a centroid (Bryant &
Protein Data Bank (PDB) structure for eachprotein in a Lawrence, 1993), which then takeson the identity of the
training set is the lowest in energy when comparedto a respectiveside chain in the query sequence.Threading
very large set of alternatives - over lo5 alternativeson provides a computationally efficient method to samplea
average. Intramolecular contact definitions are obtained large region of conformation space,but the trade-off is a
for all possible amino acid pairs from a survey of protein loss of atomic detail in the side chains.
structuresand are continuous as a function of interresidue
distance. Training with only 25 native proteins produces Most potentials are a function of inter-side chain and side
an energy potential which is quite successfulin structure chain-backboneinteractions. The potential is constructed
identification tests. Small proteins and those with such that eachdefined interaction has an associatedenergy
prosthetic groups posethe greatestchallenge. parameterwhich may be obtained by statistical analysis of
observedinteractions in known PDB structures(Sippl,
1990;Miyazawa & Jemigan, 1985) or optimization
Permission to make digital or hard copies of all or part of this work for methods(Maiorov & Crippen, 1992; Goldstein et a1.,1992;
personal or classroom use is granted without fee provided that copies Huber & Torda, 1997; Chiu & Goldstein, 1998). Many
arc not made or tlistrihuted for prolit or commercial advantage ;1nJ that
copies bear this ntoticc and the full citation on the first page. I0 copy
functional forms have beentried, most with large setsof
othrrwise. to rcpuhlish. to post on servers or to redistribute to lists. energy parameters.However, in an assessmentof
requires prior specific permission and~or a fee. statistical potentials Thomas and Dill concluded, most of
RECOMB 90 Lyon France the information about protein energeticscontained in
Copyright ACM 1999 I-581 13-069-4/99/04...$5.00
complex statistical potentials is simply hydrophobic

145
clustering propensity . . .(Thomas & Dill, 1996). eachother. There is much variation in the cutoff distances
Hydrophobic burial is inversely related to solvent usedin energy potentials, with a typical value around 9 A.
exposure,and as noted by Jones,By far the largest However, a single defined contact applied to all residue
componentof any useful fold-recognition potential are types is a poor representationin a solvation potential. A
going to be the solvation components (Jones,1997). contact in a solvation potential should be defined as an
interaction which excludes solvent from the spacebetween
It is evident that solvation potentials which can evaluate two residues. The difficulty in choosing a contact distance
simplified protein structures,such as those generatedvia is due to the variation in size among the twenty amino
threading methods,would be of great advantage. The acids. The Cp atomsof two large aromatic residuesmay
challenge in constructing this type of potential lies in the be 12 8, apart,yet water is excluded from the space
loss of side chain atomic detail. The extent of solvent betweenthem due to the volume occupied by the side
exposure,or conversely burial, of a residue in a given chains. This samedistanceis not an appropriatecontact
protein structure is a function of atomic contact with other definition for two alaninesbecausea water molecule could
side chains. In current threading methods,gappedor easily passbetweenthe two side chains. A solvation
ungapped,the simplified representationof eachside chain potential also requires a better estimation of the extent of
as a point in spaceintroduces error when assessing contact since a single definition would give the same
inter-residuecontactsand the extent of solvation. contact value to either glycine or tyrosine that is within the
cutoff distanceof a central residue. We addressthis
Huang et al. (1995) demonstratedthat a simple potential problem by using contact definitions which are a function
consisting of only a hydrophobic fitness score(HF) of the residuetypes involved and their distance apart.
reflecting the extent of burial of apolar amino acids could
successfully identify the correct structure for most For eachof the 210 possible amino acid pairs we derive
sequencesin their test set. Their HF potential achieveda contact curves which representthe averagenumber of
remarkable 85% accuracy of native structure identification atomic contacts(C,) observedin PDB structuresas a
in an ungappedthreading test. The results clearly function of the Cp distancebetweenthe specified pair of
demonstratethe strength of constructing a potential based
residues. Thus, for structureswith simplified side chain
on the hydrophobic effect; however, a problem in
representationswe can provide an improved estimateof
simplified potentials is also revealed. In the HF potential a
interresidueinteractions and the extent of solvation. The
contact is defined as any two side chain representations
contact curves are obtained by tabulating all atomic
which are within 7.3 A of eachother. The simple on/off
(heavy atom) contactsin over 1300PDB structuresfor
form of a contact definition is computationally appealing
eachresiduepair. Our survey structuresare drawn from
and is commonly used; however, discontinuous contacts
the PDB Selectstructureswhich representprotein
(hard cutoffs) may diminish the accuracyof a potential and
sequenceswith no greaterthan 95% sequencehomology
preclude the use of energy minimizers which require
(Hobohm, 1992). The proteins from the PDB Selectlist of
smooth and differentiable functions. In a test of six
June 1997were screenedto eliminate structureswith chain
potential functions where decoy structureswere produced
breaksand alternative coordinates;the remaining 1347
by a four stateoff-lattice model, Park and Levitt (1996)
structuresbecameour survey set. To tabulate atomic
concluded that functions which are distancedependent(in
contactswe use the model of Colonna-Cesariand Sander
more than an on/off sense)are more effective than those
(1990) which considerstwo atomsfrom different side
which are not. In other studies it was found that the hard
chains to be in contact if the interatomic distanceis less
cutoff usedin the I-IF potential contributed to incorrect
than or equal to 6.4 A. This distanceensuresexclusion of
structure identifications when decoys were generatedby
a water molecule from betweenthe two atomsand is
molecular dynamics simulations (Huang et al., 1996;Park
derived by summing the van der Waals radii (2 x 1.8 A)
et al., 1997). One problem noted with sharpcutoffs is that
with the diameter of a water molecule (2.8 A). We use a
slight changesin atomic position may result in large
smooth sigmoidal curve with a maximum atomic contact
changesin contacts. Here, we seekto continue the focus
value of one when the atomsare less than 3.6 8, apart, and
on solvation while improving on the treatmentof contacts
a minimum value of zero when the distanceis greaterthan
in simplified models of proteins.
6.4 A:
2
(d,, - U) (2d,, - 3 L + U)
ifLSd,,,,lU
a(&,, U, L) = (U- L)3
Contact Definitions 1 if d,,,, < L
1 0 ifd[,,, > U
A contact is typically defined as two side chains (Cp or
centroid) being within somespecified cutoff distancefrom
where (II is the atomic contact betweenatoms1and m as a

146
function of the inter-atomic distanced,,, U is 6.4 A, and L pairwise distancewe obtain the contact value Cii from the
is 3.6 A. For each of the 210 amino acid pairs all atomic appropriatecontact curve. All pairwise contact values are
contactsare tabulated from our survey structures. For each then summedto give the total contact value for residue i.
pair we record the averagenumber of atomic contactsas a The energy associatedwith residue i is the product of the
function of the distancebetweenthe Cp carbonsof the contact value and an energy parameterPk(i, for amino acid
residuesinvolved, and then fit the data to a sigmoidal type k. The potential has 21 adjustableparameters,20 of
curve. The contact curves for all 210 pairwise interactions which representthe solvation preferenceof the amino
are then incorporated into the solvation potential. Figure 1 acids and one parameter,P,, is usedin an overpacking
showsthe contact curve fit to the atomic contact data for term. Equation (1) representsthe total energy of a given
the proline-threonine pair. Cii is the averagenumber of sequencein a given conformation and is the sum of the
atomic contactsobservedwhen the Cp of the two residues energy contribution from eachresidue in the protein plus
are the distancedii apart. This is one of the better fits with an overpacking term tabulated for eachresidue.
a x2 of 0.65. In general,the x2 increaseswith residuesize. NN
E = ~,~,CijP~~i)+~P,Si (1)
i j i
The Solvation Potential

In our solvation potential we utilize the approachof


Colonna-Cesariand Sander(1990) where a given amino N
acid type is assumedto have a fixed number of total
contactswhich are divided among solvent and protein 0 if c ij k(i), max
interactions. In this manner,solvation can be determined \j
implicitly by counting interresidue contacts. To assessthe Sj =
N N
extent of solvation for a given residue i, of amino acid type
k, we simply measurethe inter-($ distancesdii from the c ij - k(i), max if c ij > ck(i), max
specified residue to all others in the protein. For each i j

cij

1 -

10 15 20

dii (A>

Figure 1. Average number of atomic contactsCUobservedbetweenproline and


threonine as a function of the distancedii betweenCp carbons. The data points were
obtained from all proline-threonine interactions in 1347PDB structuresand were
binned in 0.25 8, increments. The dashedline representsthe sigmoid contact curve fit
to the data.

147
60 80

c trp,total

Figure 2. Histogram of the observednumber of tryptophanswith a given total number of


atomic contacts(Cr,,,a3. The total atomic contactswere tabulated for 6303 tryptophans found
in 1347 PDB structuresand were separatedinto 100 bins. The maximum number of atomic
contactsobservedfor any tryptophan was 126.15. We use one standarddeviation below this
value as CQ,-, which is 106.69. During threading a steric penalty is incurred when C,V,- is
exceededfor any given tryptophan residue.

The summationsare performed over all N residues. The obtainedfrom the 6303 tryptophans found in 1347 survey
overpacking term is the product of P, and Sj whereP, is structures. Taking one standarddeviation from the
the overpacking parameterand Si is the excesscontact maximum, we assignCrrp,- as 106.69.
value for residue i. Si is the amount by which the total
calculated inter-residuecontact value for residue i exceeds Optimizing the Parameters
a specified maximum value. Each amino acid type is
assigneda maximum contact value Ck(i),- which is The adjustableparametersPk and P, are obtained through
obtained from the distribution of total atomic contacts an optimization procedurewhich ensuresthat for each
observedin our survey structures. For eachamino acid of protein sequencein a training set, the PDB (native)
type k found in our survey structureswe use our contact structure is the lowest in energy comparedto a very large
definition curves to calculate the total number of set of alternative conformations generatedby threading.
interresidue atomic contacts, The maximum contact
values for each amino acid type are calculated as follows:
for a given residue in a given structure the distanceto E alt - Enat m
every other residue in the structure is measuredas dij, and
for eachdti the atomic contact value is obtained from the
appropriatecontact curve. Atomic contact values are
summedfor the given residueproviding the total number Where Enatis the energy of the native sequencein the
of atomic contactsper residue. A histogram for each native conformation, and Eal, is the energy of the native
residue type provides a distribution of the number of sequencein an alternative conformation. The energiesare
occurrencesof eachtotal contact value observedin the calculated per equation (1). The constantm is an energy
survey. The contact value that is one standarddeviation margin set to 10. Each threading contributes an inequality,
lower than the maximum observedis usedas Ck(i),mar and the parametersare obtained by solving the systemof
Figure 2 is the atomic contact distribution for tryptophan linear inequalities (Crippen, 1991;Maiorov & Crippen,

148
1992). The training procedurebegins with arbitrarily set protein domains do not necessarilyimply changesin free
parameters,then a native sequenceis threadedthrough a energy yet may result in RMSD differences. It is desirable
set of template structuresto produce the alternatives. As to have a measureof structural similarity that is consistent
each alternative is generated,we calculate the energy, with our measureof energy. Since our energy potential is
check compliance with equation (2) and continue to the a function of solvation, we wish to define near-native as
next threading if equation (2) is satisfied. Whenever a those structureswhich have a similar solvation profile as
violation of equation (2) is encountered,we add the current the native. This definition allows for energetically similar
inequality (constraint) to our problem, solve for the structuresto be classified as near-native despite coordinate
parameters,then continue threading. The procedureis differences. We introduce the burial root-mean-square
repeatedfor each native and continues until equation (2) is deviation (BRMSD) as:
satisfied for every alternative structure producedfor every
native sequence. Structuresclassified as near-native are l/2
excluded. Since we are satisfying inequalities, there can
be multiple solutions to a given problem. In this casewe
optimize the parametersby choosing the values which
minimize the sum of parameters.
BRMSD = C1~Nl~tCi,a-Ci,~12
i 1
For a given sequencein two different conformations, N is
It is possible to encounter an infeasible problem where the number of residuesin the protein, C,, and C,, are the
there is no solution to the set of linear inequalities. Care total number of contactstabulated for residue i in
must be taken when choosing the training set of proteins so structuresa and b respectively using our contact
that we are not requiring the potential to satisfy a definitions.
constraint which is unreasonable. An example is
cytochrome c which hascovalently bound heme. Without To assessthe relationship of BRMSD to sequencelength
heme attached,apocytochromec is noncompactand does we selected18 PDB structureswhich ranged from 43 to
not fold to its native structure (Dumont et al., 1994). We 333 residuesin length. We threadedthe 18 sequences
found that including c-type cytochromesin the training set through 121randomly selectedPDB structuresand
results in an infeasible problem. However, since our measuredthe BRMSD and DME for eachalternative in
energy potential does not consider interactions with referenceto the native structure. The averageBRMSD
prosthetic groups, metals, or ligands, we should not showslittle dependenceon sequencelength (N) while the
demandthat the native structure of a protein dependenton averageDME has a clear dependence.Reva et al. (1998)
such interactions for native statestability be the lowest in reportedthe averagecoordinate RMSD is proportional to
energy. In effect, we are calculating the energy of the apo N3 for 142threadedproteins. The apparent
form of such proteins. Others have reportedpoor results independenceof BRMSD on sequencelength is appealing
when testing potentials with proteins containing prosthetic becauseone cutoff value may be suitable to define near-
groups (Huang et al., 1995; Miyazawa and Jemigan, native for proteins of various sequencelengths. To
1996). Here, we exclude theseproteins from our training determinewhether conventionally defined structural
set. homologuesfall within a given value of BRMSD we
comparedthe coordinate RMSD and BRMSD for 12
structural homologues(RMSD between0.4 and 2.4) of
protein 1351,turkey egg white lysozyme. Indeed, all
Definition of Near-native, the Burial RMSD homologueshad a BRMSD below 10. Thus, we chosea
BRMSD of 10 as our near-native cutoff, and structures
Our goal is to construct a potential which can consistently with a BRMSD less than 10 relative to the native are
identify the native or a near-native structure for a given exempt from conformanceto equation (2) during training.
protein sequenceout of a large set of alternatives. The
native structure is generally acceptedto be the crystal or
NMR structure in the PDB. The definition of near-native
is much more ambiguous. Similarity of two structuresis Initial Training and Evaluation
conventionally measuredby the coordinate root-mean-
squaredeviation (RMSD) or the distanceRMSD (DME). As a method of testing the solvation potential during
Thesemeasuresof similarity are visually appealingsince developmentand training we have initially used simple,
they classify structuresthat look the sameas similar. ungappedthreading for two reasons: 1) It remains the
However, they are not satisfying when we consider minimal test that should be satisfied, i.e. if the native
structureswhich are energetically near-native. Motion of structure cannot be consistently identified from a set of
solvent exposedloops or hinge-like motion betweenlarge ungappeddecoysthere is little hope that it will succeed

149
when faced with gappedstructures; 2) Numerous test a 54 residue sequencehas about 30,000 alternatives
ungappedthreading testshave appearedin the literature while a 401 residueprotein has less than 1700 alternative
and allow for comparison of results obtained with other structuresavailable. Huber & Torda (1998) speculated
potential functions. The initial question which we sought that their optimized potential may have provided better Z-
to addressis: Can the solvation potential, with a reasonable scoresfor larger proteins becauseof the fewer number of
amount of training, achieve comparableresults to those alternativesavailable as comparedto smaller proteins.
previously established? They also noted that the common occurrenceof disulfide
bonds and post-translationalmodification in small proteins
Initial training of the potential function using the training may make their native structuresmore difficult to identify.
set of Maiorov & Crippen (1992) followed by testing with The former explanation suggeststhat the notable
the test set of Miyazawa & Jernigan (1996) indicated that performanceof contact potentials in ungappedthreading
small proteins and those with prosthetic groupsposedthe testsof larger proteins may be simply due to the fewer
greatestproblem. As mentioned previously, when c-type number of alternativesstructuresusedin the test. If we
cytochromeswere included in the training set the problem use a very large number of alternative structures,can our
becameinfeasible. In other words, no set of parameters potential function still be trained to identify the correct
allowed every native structure to be the lowest in energy conformation for the native sequences?
for the correspondingnative sequencein the training set.
We found that by removing the cytochromesand two other
proteins (lhoe and 1hvp.A) from the RTS training set of
Maiorov & Crippen we were able to train the solvation Training with Extensive Threading
potential to correctly identify the training set native
structuresas the lowest in energy out of all alternatives. With the goal of training the solvation potential with a
We will refer to this set of natives as the initial training set much larger set of alternatives, we usedour entire set of
(ITS). The ability to achieve comparabletraining to 1347survey structuresas the threading template set. This
previous work is significant since we are now using only is over 10 times the number of template structuresused
2 1 adjustableparameterswhere Maiorov & Crippen had with the ITS. Also, to focus on proteins greaterthan 110
used 112. It is desirable to usethe minimum number of amino acids in length, we createda new training set of
adjustableparametersto avoid overfitting which results in natives. To keep our training set within a computationally
a function well trained to identify a small number of manageablesize, we followed a few guidelines in selecting
natives, but possibly incapable of accommodatingthe the native proteins out of the nearly 8000 PDB structures
general population of proteins. available. We startedwith our survey set which represents
a diverse set of proteins. Then, we limited candidatesto
The solvation potential and the parametersobtainedfrom those proteins between 110 and 500 residuesin length. It
training with the ITS set were applied to the test set of is also desirableto use typical proteins, so we selected
Miyazawa & Jernigan (1996). This test consistsof those with averageamino acid composition. This was
threading 88 test sequencesthrough 189 template done by classifying eachamino acid type as hydrophobic
structureswhich include the native structure for each (A,V,L,I,M,Y,W,F), polar (S,T,C,N,Q), charged
sequence.We did not use protein 2trx.A becauseseveral (K,R,E,D,H), or turn (G,P), then assessingthe composition
residuesin the PDB structure have multiple conformers; of eachprotein in our survey set and selecting those within
our test included the remaining 87 proteins. Excluding one standarddeviation of the mean. The last selection
proteins common to the training and test sets,our potential criteria is the exclusion of proteins with prosthetic groups,
function was successfulin identifying the native structure ligands, or a large number of metals. This reducedthe set
for 73% of the test sequences.While not as successfulas of candidatesdown to 69 proteins. The potential was
the 86% accuracy obtained by Miyazawa & Jemigan,it trained by first using a subsetof the candidatesas native
should be noted that our training set consistedof only 27 sequences,training by threading through our entire survey
proteins, and additional training promisesimproved set,then retaining those natives which contributed
results. The results are less encouraging for proteins constraintsto the problem. If an inequality (equation 2)
shorter than 110 residues. We were only 50% accuratein producedby threading a given native sequencethrough a
the Miyazawa & Jernigantest for theseproteins, while our given structure is a solution boundary in the set of
potential was able to correctly identify 83% of the proteins inequalities, we consider it a constraint. We iteratively add
greaterthan 110 residuesin length. It could be arguedthat natives from the 69 candidatesuntil all are usedin at least
the poor performancewith small proteins is due to the one round of training, while removing those which provide
greater number of alternatives available from threading. no constraintsto the problem. Using this procedure,only
Smaller sequenceshave more decoy structuresproduced 25 out of the 69 candidatesactually contributed constraints
by threading, resulting in a greaterchallenge. In the M & J to the problem. We will refer to this set of natives as the

150
extensive threading set (ETS).
Table 1. ETS Training Set.
The ETS along with the number of alternative structures
PDB length class alternatives
usedfor eachone are listed in Table 1. Note that the
solvation potential is able to identify the native structure as 1Paz 120 191478
the lowest in energy when comparedto nearly 135,000 ldro 122 188740
alternatives on average. Comparedwith the previous lbw4 125 184709
training set, where approximately 5000 alternativeswere leal 127 182080
usedfor a 170 residue sequence,we have provided a much 1351 129 179483
greaterchallenge. The optimized parametersare shown in 1arIl 129 179483
Table 2. A negative value indicates that interresidue lrsy 135 171958
contactsfor this amino acid type are favorable, while llcl 141 164647
amino acids with positive parameterstend to be exposedto llba 146 158727
solvent. It is reassuringthat the parametersare consistent 8ilb 146 158727
with generally acceptedhydrophobic classifications. We 2uce 148 156416
imposed an arbitrary boundary of -1000 5 Pk 5 1000 laak 150 154136
for eachparameter,and several parametersare at the lower 1891 164 139067
boundary. Our optimization calls for the solution which lsfe 165 138039
minimizes the sum of parameters;therefore, some llki 172 130967
parametershave a value of -1000 becausethey are not lhbq 177 127015
constrainedby existing inequalities of the problem. This lnfa 178 125082
condition suggeststhat the current solution spaceis large lwbc 183 120308
and further training can be accommodated. 1dsb.A 188 115663
lak2 220 88481
ljud 220 88481
lmml 251 68562
Testing the Parameters Obtained by Extensive Threading lapa 261 63048
lbtl 263 61982
The parametersshown in Table 2 were applied to the lagx 331 32779
Miyazawa & Jemigan test, and the results comparedto AVG <176> <134802>
those obtained in the initial training. The ETS results
exclude one protein that was common to both the ETS
training set and the M&J test set. Table 3 is a summaryof than 10. Apparently, for most proteins which fail the
results for the testsof the ITS and ETS parameters.The structure identification test, the problem is not due to
extensive threading parametersprovided slightly less decoyswith native-like folds. A representativeexample is
overall accuracy due mostly to a significant drop in ltpk.A, plasminogenactivator, with 88 residues. The
accuracy with the smaller proteins. Proteins greaterthan native structure ranked 22 out of over 23,000 alternatives.
110 residueswere correctly identified in about 82% of the The averageBRMSD of the 21 higher ranking alternatives
caseswith both setsof parameters.Excluding small is 16.51 which is about one standarddeviation from the
proteins from our extensive threading test set apparently meanBRMSD of the entire population of alternatives for
reduced the ability of the resulting potential to correctly 1tpk.A. The higher ranking decoysare not much more
identify small proteins. It is interesting to note that despite native-like than the generalpopulation of alternatives.
using two completely different training sets,of the 28 Further analysis of common characteristicsamong the
proteins which failed with ETS parameters,23 also failed higher ranking structuresmay reveal why they are
with the ITS parameters. This may reflect an intrinsic preferredover the native by the potential function.
weaknessof the solvation potential in its current form, Conversely,analysis of the native structureswhich fail
namely difficulty with small proteins. identification testsmay reveal common featuresamong
theseproteins which provide stability and are not currently
For the proteins which were incorrectly identified we accountedfor. In the M&J test 27 proteins are shorter than
examined those structureswhich ranked higher than the 110residuesin length (again excluding 2trx.A). Over half
native to find if they had native-like solvation profiles as of thesecontain disulfides. Of the 15 small, disulfide-
measuredby the BRMSD. For those natives which ranked bearing proteins 80% failed. Consideration of disulfides
lower than 50 we only checkedthe BRMSD of the top 50 and their contribution to native statestability may be
alternatives. Out of 28 failures only sevenhad a higher required.
ranking (lower energy) alternative with a BRMSD of less

151
Table 2. ETS parametervalues for the 20 amino Table 3. Summaryof test results using the Miyazawa
acids and the steric penalty P,. Negative values & Jemiganthreading test. Figures representthe
percentageof structurescorrectly identified for the
reflect a favorable contact energy while ammo test sequences.Out of a total of 87 proteins, 60 have
acids with positive values tend to be exposed. a chain length greaterthan 110 residues. Proteins
common to training and test setswere not included in
ASN. 923.87 A the statistics.
ps 843.32
GLU 767.63 ITS ETS
GLN 492.88
ASP 411.67 overall 73.4 67.4
LYS 377.33
ARG 182.73 length > 110 82.5 81.4
PRO 84.72 Unfavorable Contact
SER 41.50 length < 110 50.0 37.0
HIS 29.54 -I

. . . . . . . . . . . 15.33
GLY ..... ................
THR -70.69 Given that fewer than 30 proteins were usedin each
-379.85 training set, the results indicate the approachis viable and
PHE -386.14 Favorable Contact promising. It is expectedthat an expandedtraining set of
MET -407.36 native proteins will provide additional constraints to the set
TYR -412.53 of inequalities and result in an improved potential. Current
LEU -580.94 parametersindicate that the existing solution spaceis large
VAL -893.91 and additional constraintscan be tolerated. Further
ILE -1000.00 training should be followed by an expandedrepertoire of
CYS -1000.00 teststo measurethe capability of the potential to identify
ALA -1000.00 v the native among decoysgeneratedby non-threading
methodssuch as off-lattice models and molecular
dynamics. Additional template structuresmay not be
beneficial to training. Further testing is required to
determineif training with our extensive set of alternative
structuresresults in a better potential comparedwith more
Conclusions limited training. Basedon the test results it is tempting to
conclude that our extensive threading produced a potential
We have constructeda potential function modeledon the no smarter than that obtained with our limited initial
predominant contribution to protein folding, namely training. However, differencesin the prediction ability of
solvation, while attempting to provide an improved the two setsof parametersmay not be apparentwith the
treatment of intramolecular interactions. Predefined current test. Despite the differencesbetweenthe two
contact definitions for all possible pairs of amino acids training sets,one would expect that parametersobtainedby
provide a method to assessthe extent of solvation using training with a much larger set of decoy structureswould
simplified side chain representations. The contact result in a more accuratepotential. Could it be that by
definitions are smooth and continuous as a function of increasingthe number of alternativesthrough threading we
interresidue distance. The potential function can be arejust adding more structuressimilar to those already
optimized so that the native structure for eachprotein used? Consideration of the quality of alternatives is
sequencein a training set is the lowest in energy compared warranted.
to an immenseset of alternative conformations. Testing
with the Miyazawa & Jemigan test showsthe potential
function to be greaterthan 80% accuratein identifying the
native structure for sequencesover 110 residuesin length, Acknowledgements
but shorter proteins are problematic. Results with two sets
of parametersobtained from independenttraining sets A.A.D. is a University of Michigan RackhamRegents
indicate numerouscommon failures. Small, disulfide- Fellow. This work is also supportedby the Michigan
bearing proteins are particularly difficult to identify, and Molecular Biophysics Training Program,National
consideration of disulfides may be required in the potential. Institutes of Health (GM08270).

152
References Jones,D. T. (1997). Progressin protein structureprediction.
Cure Op. in Struct. Biol. 7,377-387.
Bryant, S. H. & Lawrence, C. E. (1993). An empirical
energy function for threading protein sequence Maiorov, V. N. & Crippen, G. M. (1992). Contact potential
through folding motif. Proteins: Struct. Funct. that recognizesthe correct folding of globular pro-
Genet. 16,92-112. teins. J. Mol. Biol. 227, 876-888.

Chiu T. & Goldstein, R. A. (1998). Optimizing energy Miyazawa, S. & Jernigan,R. L. (1985). Estimation of
potentials for successin protein tertiary structure effective interresidue contact energiesfrom protein
prediction. Folding & Design 3,223-228. crystal structures:quasi-chemical approximation.
Macromolecules, 18,534-552.
Colonna-Cesari,F. & Sander,C. (1990). Excluded volume
approximation to protein-solvent interaction. The Miyazawa, S. & Jemigan, R. L. (1996). Residue-residue
solvent contact model. Biophys. J. 57, 1103-1107. potentials with a favorable contact pair term and an
unfavorablehigh packing density term, for simula-
Crippen, G. M. (1991). Prediction of protein folding from tion and threading. J. Mol. Biol. 256,623-644.
amino acid sequenceover discreteconformation
space. Biochemistry, 30,4232-4237. Park, B. & Levitt, M. (1996). Energy Functions that Dis-
criminate X-ray and Near-nativeFolds from Well-
Dumont, M. E., Corm, A. F. & Campbell, G. A. (1994). constructedDecoys. J. Mol. Biol. 258,367-392.
Noncovalent binding of hemeinduces a compact
apocytochromec structure. Biochemistry 33,7368- Park, B. H., Huang, E. S. & Levitt, M. (1997). Factors
7378. Affecting the Ability of Energy Functions to Dis-
criminate Correct from Incorrect Folds. J. Mol.
Goldstein, R. A., Luthey-Schulten, Z. A. & Wolynes, P. G. Biol. 266, 831-846.
(1992). Protein tertiary structure recognition using
optimized hamiltonians with local interactions. Reva,B. A., Finkelstein, A. V. & Skolnick, J. (1998). What
Proc. Natl. Acad. Sci. USA 89,9029-9033. is the probability of a chanceprediction of a protein
structure with an rmsd of 6 A? Folding & Design 3,
Hendlich, M., Lackner, P.,Weitckus, S., Floeckner,H., 141-147.
Froschauer,R., Gottsbacher,Casari, G. & Sippl, M.
J. (1990). Identification of native protein folds Sippl, M. J. (1990). Calculation of conformational ensem-
amongsta large number of incorrect models. The bles from potentials of mean force. An approachto
calculation of low energy conformations from the knowledge-basedprediction of local structures
potentials of mean force. J. Mol. Biol. 2 16, 167-180. in globular proteins. J. Mol. Biol. 213,859-883.

Hobohm, U., Scharf,M., Schneider,R. & Sander,C. (1992). Thomas,P.D. & Dill, K. A. (1996). Statistical potentials
Selection of a representativeset of structuresfrom extractedfrom protein structures:how accurateare
the Brookhaven Protein Data Bank. Protein Science they? J. Mol. Biol. 257,457-469.
1,409-417.

Huang, E. S., Subbiah,S. & Levitt, M. (1995). Recognizing


native folds by the arrangementof hydrophobic and
polar residues. J. Mol. Biol. 252,709-720.

Huang, E. S., Subbiah, S., Tsai, J. & Levitt, M. (1996).


Using a Hydrophobic Contact Potential to Evaluate
Native and Near-native Folds Generatedby Molecu-
lar Dynamics Simulations. J. Mol. Biol. 257,716-
725.

Huber, T. & Torda,A. E. (1998). Protein fold recognition


without Boltzmann statistics or explicit physical
basis. Protein Science 7, 142-149.

153

You might also like