Computational Methods For Protein Structure Prediction

Computational
p Methods for
Protein Structure Prediction
Ying
g Xu
2010/1/19 1
Outline
 i t d ti tto protein
introduction t i structures
t t
 the problem of protein structure prediction
 protein secondary structure prediction
 protein tertiary structure prediction

 Ab initio folding
 homology modeling
 protein threading
2010/1/19 2
Protein
Sequence, Structure and Function
Protein
>1MBN:_
_ MYOGLOBIN (154 AA) sequence
MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHL
KTEAEMKASEDLKKAGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI
PIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKEL
GYQG
Protein
structure
Protein
function
Oxygen storage
2010/1/19 3
Protein Structure
 protein sequence folds into a “unique”
unique shape (“structure”)
( structure ) that
minimizes its free potential energy
2010/1/19 4
Protein Structures
 Primary sequence
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
 Secondary structure
-helix
anti-parallel
-sheet
parallel
ll l
2010/1/19 5
Protein Structures
 Tertiary structure
 Quaternary structure
2010/1/19 6
Protein Structures
 Backbone versus all
all--atom structures
Backbone + sidechain = all-atom structure
Backbone structure == structural fold

2010/1/19 7
Protein Structures
 Protein structure
 generally compact
 Soluble protein structure

 individual domains are g
generally
ygglobular
 they share various common characteristics, e.g. hydrophobic
moment profile
 Membrane protein structure
most of the amino acid sidechains of transmembrane

segments are non-polar
polar groups of the polypeptide backbone of transmembrane
segments generally participate in hydrogen bonds
2010/1/19 8
Protein Structure Prediction
Problem: Given the amino acid sequence of a protein,
computationally predict its 3-dimensional shape?
MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHL
KTEAEMKASEDLKKAGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI
PIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKEL
GYQG
? ……..
2010/1/19 9
Secondary Structure Prediction
Eight Secondary Structure Types
H: -helix (i, i+4 form hydrogen bond)

G: 310 helix (i, i+3 form hydrogen bond) H
I: -helix
helix (i,i i+5 form hydrogen bond)
E: -strand E
B: bridge
T: -turn
 turn
S: bend C
C: coil
Rough categories: H, E, C
2010/1/19 10
Given a protein sequence (primary structure), predict its

secondaryy structure categories
g
GHWIAT
HWIATRGQ
RGQLIREA
LIREAYED
YEDYRHFSS
YRHFSSECPFIP
ECPFIP E: -strand
H: -helix
C: coil
CEEEEE
EEEEECCC
CCEEEEE
EEEEECCC
CCCHHHHHH
HHHHHHCCCCCC
Assumption: short stretches of residues have propensity to adopt certain

structural conformation
2010/1/19 11
Secondary structure propensities:
propensities:
Calculate the propensity for a given amino acid to adopt a certain ss-
ss-type
P ( | aai ) p ( , aai ) i, amino acid

P 
i
 , secondary structure
p( ) p ( ) p (aai ) type
 Example: a data set with 300 proteins containing #residues=20

#residues 20,000,
000
#Ala=2,000, #helix=4,000, #Ala in helix=500
p(,aa) = 500/20,000, p() = 4,000/20,000, p(aa) = 2,000/20,000

P = 500 / (4,000/10) = 1.25
2010/1/19 12
2010/1/19 13
P di t h
Predict helix
li ((and
d predict
di t strand)
t d)
1. Scan each window of 6 residues; if score > 4 predict helix ( if score > 3
predict strand)
2. Propagate in both directions until 4 residue window with mean

propensity < 1
3. Move forward and repeat
Resolving Conflicts
For overlapping regions, decide according to propensity parameters
Chou-Fasman designed the first prediction program

2010/1/19 14
 Thi simple
This i l algorithm
l ith h hadd ~60%
60% prediction
di ti accuracy and
d
was the best prediction for a long while
 The next breakthrough came in 1999 when PSI-

PSI-PRED
program
p g was developed
p
 Key ideas
 using sequence profiles, generated by psi-
psi-blast, rather than
individual sequence for secondary structure
 combiningg multiple
p p predictors using
g a neural network
 accuracy reaches ~76%
2010/1/19 15
 Using
U i llarger training
i i sets, the
h prediction
di i accuracy can reach
h
~80%; So how far can we further push this
 Non-
Non-locality. Secondary structure is influenced by long-
long-range
interactions
 Some
S segments
t can have
h multiple
lti l structure
t t types
t
(chameleon sequences)
sequences)
There is some room for

improvement but not much
2010/1/19 Long Piccolo C2A Short Piccolo C2A16

Methods for Tertiary Structure
Prediction
 ab
b iinitio
iti
 use first principles to fold proteins
 does not require templates
 high computational complexity
 homology modeling
 similar sequence similar structures
 practically very useful
useful, need homologues
 protein threading
 many proteins share the same structural fold
 a folding problem becomes a fold recognition problem
2010/1/19 17
Ab initio Structure Prediction
A energy ffunction
An ti tto d
describe
ib the
th protein
t i
o bond energy
o bond angle energy
o dihedral angel energy
o van der Waals energy
o electrostatic energy
Need an algorithm to search the conformational space to find structural

conformation that minimizes the function.
Not practical in general

o computationally too expensive
o accuracy is poor
2010/1/19 18
Homology Modeling
Observation: proteins with similar sequences tend to fold into
similar structures.
1. Target sequence is aligned with the sequence of a known structure

(typically requiring they share sequence identity 30% or higher)
2. Superimpose target sequence onto the template,

replacing equivalent side-chain atoms where necessary
3. Refine the model by minimizing an energy function.
Programs:
Modeller http://salilab.org/modeller/
Swiss-Model http://swissmodel expasy org//SWISS MODEL html
http://swissmodel.expasy.org//SWISS-MODEL.html
2010/1/19 19
Protein Threading
 Basic premise
The number of unique structural (domain) folds in nature

is fairly small (possibly a few thousand)
 Statistics from Protein Data Bank (~61

( 61,000
000 structures)
90% of new structures submitted to PDB in the past

three years have similar structural folds in PDB
 Chances for a protein to have a native native--like structural fold in

PDB are quite good (estimated to be 60-
60-70%)
 Proteins with similar structural folds could be homologues or analogues
2010/1/19 20
Protein Threading
 The goal:
goal: find the “correct” sequence-
sequence-structure alignment
between a target sequence and its native-
native-like fold in PDB
MTYKLILN …. NGVDGEWTYTE
 Energy function – knowledge (or statistics) based rather than

physics based
 Should be able to distinguish correct structural folds from incorrect
structural folds
 Should be able to distinguish correct sequence-
sequence-fold alignment from
incorrect sequence-
sequence-fold alignments
2010/1/19 21
Protein Threading – four basic components
 Structure database
 Energy function
 Sequence--structure alignment algorithm

Sequence
 Prediction reliability assessment
2010/1/19 22
Protein Threading – structure
t t database
d t b
 Build a template database
2010/1/19 23
Protein Threading – structure
t t database
d t b
• Non-redundant representatives through structure-structure

and/or sequence-sequence
sequence sequence comparison
FSSP (http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.html)
( p )
(Families of Structurally Similar Proteins)
SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/)
PDB-Select (http://www.sander.embl-heidelberg.de/pdbsel/)
Pisces (http://www.fccc.edu/research/labs/dunbrack/pisces/)
2010/1/19 24
Protein Threading – energy function
f ti
how preferable
f to put
two particular residues
nearby: E_p how well a residue fits
a structural
t t l
environment: E_s
alignment gap
penalty: E_g
total energy: E
E_p
p + E_s
E s+E
E_g
g
find a sequence-structure alignment

t minimize
to i i i the
th energy function
f ti
2010/1/19 25
f ti
 A singleton energy measures each residue’s preference in a specific

structural environments
 secondary structure
 solvent accessibility
 Compare actual occurrence against its “expected value” by chance
where
2010/1/19 26
f ti
 A simple definition of structural environment

 secondary structure: alpha-
alpha-helix, beta-
beta-strand, loop
 solvent accessibility: 00, 10
10, 20
20, …, 100% of accessibility
 each combination of secondary structure and solvent
accessibility level defines a structural environment

• E.g., (alpha-
(alpha-helix, 30%), (loop, 80%), …
 E_s: a scoring matrix of 30 structural environments by 20 amino
acids
 E.g., E_s ((loop, 30%), A)
singleton
i l t energy tterm
2010/1/19 27
f ti
Helix Sheet Loop

Buried Inter Exposed Buried Inter Exposed Buried Inter Exposed
ALA -0.578 -0.119 -0.160 0.010 0.583 0.921 0.023 0.218 0.368
ARG 0
0.997
997 -0.507
0 507 -0.488
0 488 1
1.267
267 -0.345
0 345 -0.580
0 580 0
0.930
930 -0.005
0 005 -0.032
0 032
ASN 0.819 0.090 -0.007 0.844 0.221 0.046 0.030 -0.322 -0.487
ASP 1.050 0.172 -0.426 1.145 0.322 0.061 0.308 -0.224 -0.541
CYS -0.360 0.333 1.831 -0.671 0.003 1.216 -0.690 -0.225 1.216
GLN 1.047 -0.294 -0.939 1.452 0.139 -0.555 1.326 0.486 -0.244
GLU 0.670 -0.313 -0.721 0.999 0.031 -0.494 0.845 0.248 -0.144
GLY 0.414 0.932 0.969 0.177 0.565 0.989 -0.562 -0.299 -0.601
HIS 0.479 -0.223 0.136 0.306 -0.343 -0.014 0.019 -0.285 0.051
ILE -0.551 0.087 1.248 -0.875 -0.182 0.500 -0.166 0.384 1.336
LEU -0.744
0.744 -0.218
0.218 0.940 -0.411
0.411 0.179 0.900 -0.205
0.205 0.169 1.217
LYS 1.863 -0.045 -0.865 2.109 -0.017 -0.901 1.925 0.474 -0.498
MET -0.641 -0.183 0.779 -0.269 0.197 0.658 -0.228 0.113 0.714
PHE -0.491 0.057 1.364 -0.649 -0.200 0.776 -0.375 -0.001 1.251
PRO 1.090 0.705 0.236 1.249 0.695 0.145 -0.412 -0.491 -0.641
SER 0
0.350
350 00.260
260 -0.020
0 020 0
0.303
303 00.058
058 -0.075
0 075 -0.173
0 173 -0.210
0 210 -0.228
0 228
THR 0.291 0.215 0.304 0.156 -0.382 -0.584 -0.012 -0.103 -0.125
TRP -0.379 -0.363 1.178 -0.270 -0.477 0.682 -0.220 -0.099 1.267
TYR -0.111 -0.292 0.942 -0.267 -0.691 0.292 -0.015 -0.176 0.946
VAL -0.374 0.236 1.144 -0.912 -0.334 0.089 -0.030 0.309 0.998
2010/1/19 28
f ti
 It measures the preference of a pair of amino acids to be

close in 3D space.
 Observed occurrence of a pair compared with its “expected”

occurrence
uniform state model
pair-wise interaction energy term
2010/1/19 29
f ti
ALA -140
ARG 268 -18
ASN 105 -85 -435
ASP 217 -616 -417 17
CYS 330 67 106 278 -1923
GLN 27 -60 -200 67 191 -115
GLU 122 -564 -136 140 122 10 68
GLY 11 -80 -103 -267 88 -72 -31 -288
HIS 58 -263 61 -454 190 272 -368 74 -448
ILE -114 110 351 318 154 243 294 179 294 -326
LEU -182 263 358 370 238 25 255 237 200 -160 -278
LYS 123 310 -201 -564 246 -184 -667 95 54 194 178 122
MET -74 304 314 211 50 32 141 13 -7 -12 -106 301 -494
PHE -65 62 201 284 34 72 235 114 158 -96 -195 -17 -272 -206
PRO 174 -33 -212 -28 105 -81 -102 -73 -65 369 218 -46 35 -21 -210
SER 169 -80 -223 -299 7 -163 -212 -186 -133 206 272 -58 193 114 -162 -177
THR 58 60 -231 -203 372 -151 -211 -73 -239 109 225 -16 158 283 -98 -215 -210
TRP 51 -150 -18 104 52 -12 157 -69 -212 -18 81 29 -5 31 -432 129 95 -20
TYR 53 -132 53 268 62 -90 269 58 34 -163 -93 -312 -173 -5 -81 104 163 -95 -6
VAL -105 171 298 431 196 180 235 202 204 -232 -218 269 -50 -42 46 267 73 101 107 -324
ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL
2010/1/19 30
f ti
w(k) = h + gk, k ≥ 1, w(0) = 0;
h: opening gap penalty

g: extension gap penalty
FDSK---THRGHR FDSK-T--HRGHR
:.: :: ::: :.: : : :::
FESYWTCTH GHR
FESYWTCTH-GHR FESYWTCTH GHR
FESYWTCTH-GHR
gap penalty term
2010/1/19 31
Threading Parameter Optimization
 How to determine the weight of different energy term?
Etotal = sEsingleton + pEpairwise + gEgap
 Select the weights

g to g
give the “best” threading
g
performance on a training set (fold recognition and
alignment accuracy)
 Different weights for different classes? (superfamily, fold)

pair-wise may
pair- y contribute more for fold level threading
g
mutation/profile terms dominate in superfamily level threading
2010/1/19 32
Protein Threading -- algorithm
 Dynamic programming
 Heuristic algorithms for pair-
pair-wise interactions
 Frozen approximation algorithm (A. Godzik et al.)
al.)
 Double dynamic programming (D. Jones et al.)
al.)
 Monte carlo sampling
p g ((S.H. Bryant
y al.))
et al.)
 Rigorous algorithms for pair-

pair-wise interactions
 Branch
B
Branch-
h-and-
and
d-bound
b d (R
(R.H.
H LLathrop
th and
dTT.F.
F SSmith)
ith)
 Divide--and-
Divide and-conquer (Y. Xu et al.)
al.) --PROSPECT
--PROSPECT
 Linear programming (J. Xu et al.)
al.) –RAPTOR
 Tree decomposition (L. Cai et al.)
 Rigorous algorithm for treating backbone and side-
side-chain
simultaneously (Li et al
al.))
2010/1/19 33
Fold Recognition
S
Score = -1500
1500 S
Score = -720
720 S
Score = -1120
1120 S
Score = -900
900
Which one is the correct structural

fold for the target sequence if any?
The one with the lowest score ?
2010/1/19 34
Fold Recognition
Query sequence: AAAA
Template #1: AATTAATACATTAATATAATAAAATTACTGA
B
Better template?
l ?
Template #2: CGGTAGTACGTAGTGTTTAGTAGCTATGAA
Which of these two sequences will have better

chance to have a good match with the query
sequence after randomly reshuffling them?
2010/1/19 35
Fold Recognition
 Different template structures may have different background score

distribution making direct comparison of threading scores against
distribution,
different templates invalid
 Comparison of threading results should be made based on how

standout the score is in its background score distribution rather than
the threading scores directly
2010/1/19 36
Fold Recognition
Threading 100
100,000
000
sequences against a
template structure provides
th b
the baseline
li iinformation
f ti
about the background
scores of the template
Byy locating
g where the
threading score with a
particular query sequence,
one can decide how
significant the score, and
hence the threading result,
is!
Not significant significant
2010/1/19 37
Fold Recognition
score - average
Z-score =
standard deviation
--randomly shuffle the query sequence and calculate the alignment score
The goal is to pick the predicted structure

with high statistical significance
2010/1/19 38
State of the Art
 ~60% of the proteins in encoded in any genome can

probably
b bl hhave th
their
i structural
t t l ffolds
ld predicted
di t d
 ~60% of these proteins can have their structures

predicted accurate enough to be useful to guide
experimental designs
2010/1/19 39

Computational Methods For Protein Structure Prediction

Uploaded by

Copyright:

Available Formats

You might also like

Computational Methods For Protein Structure Prediction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computational Methods For Protein Structure Prediction

Uploaded by

Copyright:

Available Formats

Computational

 the problem of protein structure prediction

 protein secondary structure prediction

 protein tertiary structure prediction

Backbone + sidechain = all-atom structure

Backbone structure == structural fold

 Soluble protein structure

 Membrane protein structure

most of the amino acid sidechains of transmembrane

Eight Secondary Structure Types

H: -helix (i, i+4 form hydrogen bond)

Given a protein sequence (primary structure), predict its

Assumption: short stretches of residues have propensity to adopt certain

P ( | aai ) p ( , aai ) i, amino acid

 Example: a data set with 300 proteins containing #residues=20

p(,aa) = 500/20,000, p() = 4,000/20,000, p(aa) = 2,000/20,000

2. Propagate in both directions until 4 residue window with mean

3. Move forward and repeat

Chou-Fasman designed the first prediction program

 The next breakthrough came in 1999 when PSI-

There is some room for

2010/1/19 Long Piccolo C2A Short Piccolo C2A16

Need an algorithm to search the conformational space to find structural

Not practical in general

1. Target sequence is aligned with the sequence of a known structure

2. Superimpose target sequence onto the template,

3. Refine the model by minimizing an energy function.

The number of unique structural (domain) folds in nature

 Statistics from Protein Data Bank (~61

90% of new structures submitted to PDB in the past

 Chances for a protein to have a native native--like structural fold in

 Energy function – knowledge (or statistics) based rather than

 Sequence--structure alignment algorithm

 Prediction reliability assessment

 Build a template database

• Non-redundant representatives through structure-structure

find a sequence-structure alignment

 A singleton energy measures each residue’s preference in a specific

 A simple definition of structural environment

accessibility level defines a structural environment

Helix Sheet Loop

 It measures the preference of a pair of amino acids to be

 Observed occurrence of a pair compared with its “expected”

uniform state model

pair-wise interaction energy term

w(k) = h + gk, k ≥ 1, w(0) = 0;

h: opening gap penalty

gap penalty term

 Select the weights

 Different weights for different classes? (superfamily, fold)

 Rigorous algorithms for pair-

Which one is the correct structural

The one with the lowest score ?

Template #1: AATTAATACATTAATATAATAAAATTACTGA

Template #2: CGGTAGTACGTAGTGTTTAGTAGCTATGAA

Which of these two sequences will have better

 Different template structures may have different background score

 Comparison of threading results should be made based on how