Computational Methods For Protein Structure Prediction

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Computational

p Methods for
Protein Structure Prediction

Ying
g Xu

2010/1/19 1
Outline
 i t d ti tto protein
introduction t i structures
t t

 the problem of protein structure prediction

 protein secondary structure prediction

 protein tertiary structure prediction


 Ab initio folding
 homology modeling

 protein threading

2010/1/19 2
Protein
Sequence, Structure and Function
Protein
>1MBN:_
_ MYOGLOBIN (154 AA) sequence
MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHL
KTEAEMKASEDLKKAGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI
PIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKEL
GYQG
Protein
structure

Protein
function

Oxygen storage

2010/1/19 3
Protein Structure
 protein sequence folds into a “unique”
unique shape (“structure”)
( structure ) that
minimizes its free potential energy

2010/1/19 4
Protein Structures
 Primary sequence

MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

 Secondary structure

-helix

anti-parallel
-sheet

parallel
ll l

2010/1/19 5
Protein Structures
 Tertiary structure

 Quaternary structure

2010/1/19 6
Protein Structures
 Backbone versus all
all--atom structures

Backbone + sidechain = all-atom structure

Backbone structure == structural fold


2010/1/19 7
Protein Structures
 Protein structure
 generally compact

 Soluble protein structure


 individual domains are g
generally
ygglobular
 they share various common characteristics, e.g. hydrophobic
moment profile

 Membrane protein structure

most of the amino acid sidechains of transmembrane


segments are non-polar
polar groups of the polypeptide backbone of transmembrane
segments generally participate in hydrogen bonds

2010/1/19 8
Protein Structure Prediction
Problem: Given the amino acid sequence of a protein,
computationally predict its 3-dimensional shape?

MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHL
KTEAEMKASEDLKKAGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI
PIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKEL
GYQG

? ……..
2010/1/19 9
Secondary Structure Prediction

Eight Secondary Structure Types

H: -helix (i, i+4 form hydrogen bond)


G: 310 helix (i, i+3 form hydrogen bond) H
I: -helix
helix (i,i i+5 form hydrogen bond)
E: -strand E
B: bridge
T: -turn
 turn
S: bend C
C: coil

Rough categories: H, E, C

2010/1/19 10
Secondary Structure Prediction

Given a protein sequence (primary structure), predict its


secondaryy structure categories
g
GHWIAT
HWIATRGQ
RGQLIREA
LIREAYED
YEDYRHFSS
YRHFSSECPFIP
ECPFIP E: -strand
H: -helix
C: coil

CEEEEE
EEEEECCC
CCEEEEE
EEEEECCC
CCCHHHHHH
HHHHHHCCCCCC

Assumption: short stretches of residues have propensity to adopt certain


structural conformation

2010/1/19 11
Secondary Structure Prediction
Secondary structure propensities:
propensities:
Calculate the propensity for a given amino acid to adopt a certain ss-
ss-type

P ( | aai ) p ( , aai ) i, amino acid


P 
i
 , secondary structure
p( ) p ( ) p (aai ) type

 Example: a data set with 300 proteins containing #residues=20


#residues 20,000,
000
#Ala=2,000, #helix=4,000, #Ala in helix=500

p(,aa) = 500/20,000, p() = 4,000/20,000, p(aa) = 2,000/20,000


P = 500 / (4,000/10) = 1.25

2010/1/19 12
Secondary Structure Prediction

2010/1/19 13
Secondary Structure Prediction
P di t h
Predict helix
li ((and
d predict
di t strand)
t d)
1. Scan each window of 6 residues; if score > 4 predict helix ( if score > 3
predict strand)

2. Propagate in both directions until 4 residue window with mean


propensity < 1

3. Move forward and repeat

Resolving Conflicts
For overlapping regions, decide according to propensity parameters

Chou-Fasman designed the first prediction program


2010/1/19 14
Secondary Structure Prediction
 Thi simple
This i l algorithm
l ith h hadd ~60%
60% prediction
di ti accuracy and
d
was the best prediction for a long while

 The next breakthrough came in 1999 when PSI-


PSI-PRED
program
p g was developed
p

 Key ideas
 using sequence profiles, generated by psi-
psi-blast, rather than
individual sequence for secondary structure
 combiningg multiple
p p predictors using
g a neural network
 accuracy reaches ~76%

2010/1/19 15
Secondary Structure Prediction
 Using
U i llarger training
i i sets, the
h prediction
di i accuracy can reach
h
~80%; So how far can we further push this

 Non-
Non-locality. Secondary structure is influenced by long-
long-range
interactions
 Some
S segments
t can have
h multiple
lti l structure
t t types
t
(chameleon sequences)
sequences)

There is some room for


improvement but not much

2010/1/19 Long Piccolo C2A Short Piccolo C2A16


Methods for Tertiary Structure
Prediction
 ab
b iinitio
iti
 use first principles to fold proteins
 does not require templates
 high computational complexity

 homology modeling
 similar sequence similar structures
 practically very useful
useful, need homologues

 protein threading
 many proteins share the same structural fold
 a folding problem becomes a fold recognition problem

2010/1/19 17
Ab initio Structure Prediction
A energy ffunction
An ti tto d
describe
ib the
th protein
t i
o bond energy
o bond angle energy
o dihedral angel energy
o van der Waals energy
o electrostatic energy

Need an algorithm to search the conformational space to find structural


conformation that minimizes the function.

Not practical in general


o computationally too expensive
o accuracy is poor

2010/1/19 18
Homology Modeling
Observation: proteins with similar sequences tend to fold into
similar structures.

1. Target sequence is aligned with the sequence of a known structure


(typically requiring they share sequence identity 30% or higher)

2. Superimpose target sequence onto the template,


replacing equivalent side-chain atoms where necessary

3. Refine the model by minimizing an energy function.

Programs:
Modeller http://salilab.org/modeller/
Swiss-Model http://swissmodel expasy org//SWISS MODEL html
http://swissmodel.expasy.org//SWISS-MODEL.html

2010/1/19 19
Protein Threading
 Basic premise

The number of unique structural (domain) folds in nature


is fairly small (possibly a few thousand)

 Statistics from Protein Data Bank (~61


( 61,000
000 structures)

90% of new structures submitted to PDB in the past


three years have similar structural folds in PDB

 Chances for a protein to have a native native--like structural fold in


PDB are quite good (estimated to be 60-
60-70%)
 Proteins with similar structural folds could be homologues or analogues

2010/1/19 20
Protein Threading
 The goal:
goal: find the “correct” sequence-
sequence-structure alignment
between a target sequence and its native-
native-like fold in PDB

MTYKLILN …. NGVDGEWTYTE

 Energy function – knowledge (or statistics) based rather than


physics based
 Should be able to distinguish correct structural folds from incorrect
structural folds
 Should be able to distinguish correct sequence-
sequence-fold alignment from
incorrect sequence-
sequence-fold alignments

2010/1/19 21
Protein Threading – four basic components

 Structure database

 Energy function

 Sequence--structure alignment algorithm


Sequence

 Prediction reliability assessment

2010/1/19 22
Protein Threading – structure
t t database
d t b

 Build a template database

2010/1/19 23
Protein Threading – structure
t t database
d t b

• Non-redundant representatives through structure-structure


and/or sequence-sequence
sequence sequence comparison

FSSP (http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.html)
( p )
(Families of Structurally Similar Proteins)
SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/)

PDB-Select (http://www.sander.embl-heidelberg.de/pdbsel/)

Pisces (http://www.fccc.edu/research/labs/dunbrack/pisces/)

2010/1/19 24
Protein Threading – energy function
f ti

MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

how preferable
f to put
two particular residues
nearby: E_p how well a residue fits
a structural
t t l
environment: E_s
alignment gap
penalty: E_g

total energy: E
E_p
p + E_s
E s+E
E_g
g

find a sequence-structure alignment


t minimize
to i i i the
th energy function
f ti
2010/1/19 25
Protein Threading – energy function
f ti

 A singleton energy measures each residue’s preference in a specific


structural environments
 secondary structure
 solvent accessibility
 Compare actual occurrence against its “expected value” by chance

where

2010/1/19 26
Protein Threading – energy function
f ti

 A simple definition of structural environment


 secondary structure: alpha-
alpha-helix, beta-
beta-strand, loop
 solvent accessibility: 00, 10
10, 20
20, …, 100% of accessibility
 each combination of secondary structure and solvent

accessibility level defines a structural environment


• E.g., (alpha-
(alpha-helix, 30%), (loop, 80%), …
 E_s: a scoring matrix of 30 structural environments by 20 amino
acids
 E.g., E_s ((loop, 30%), A)

singleton
i l t energy tterm
2010/1/19 27
Protein Threading – energy function
f ti

Helix Sheet Loop


Buried Inter Exposed Buried Inter Exposed Buried Inter Exposed
ALA -0.578 -0.119 -0.160 0.010 0.583 0.921 0.023 0.218 0.368
ARG 0
0.997
997 -0.507
0 507 -0.488
0 488 1
1.267
267 -0.345
0 345 -0.580
0 580 0
0.930
930 -0.005
0 005 -0.032
0 032
ASN 0.819 0.090 -0.007 0.844 0.221 0.046 0.030 -0.322 -0.487
ASP 1.050 0.172 -0.426 1.145 0.322 0.061 0.308 -0.224 -0.541
CYS -0.360 0.333 1.831 -0.671 0.003 1.216 -0.690 -0.225 1.216
GLN 1.047 -0.294 -0.939 1.452 0.139 -0.555 1.326 0.486 -0.244
GLU 0.670 -0.313 -0.721 0.999 0.031 -0.494 0.845 0.248 -0.144
GLY 0.414 0.932 0.969 0.177 0.565 0.989 -0.562 -0.299 -0.601
HIS 0.479 -0.223 0.136 0.306 -0.343 -0.014 0.019 -0.285 0.051
ILE -0.551 0.087 1.248 -0.875 -0.182 0.500 -0.166 0.384 1.336
LEU -0.744
0.744 -0.218
0.218 0.940 -0.411
0.411 0.179 0.900 -0.205
0.205 0.169 1.217
LYS 1.863 -0.045 -0.865 2.109 -0.017 -0.901 1.925 0.474 -0.498
MET -0.641 -0.183 0.779 -0.269 0.197 0.658 -0.228 0.113 0.714
PHE -0.491 0.057 1.364 -0.649 -0.200 0.776 -0.375 -0.001 1.251
PRO 1.090 0.705 0.236 1.249 0.695 0.145 -0.412 -0.491 -0.641
SER 0
0.350
350 00.260
260 -0.020
0 020 0
0.303
303 00.058
058 -0.075
0 075 -0.173
0 173 -0.210
0 210 -0.228
0 228
THR 0.291 0.215 0.304 0.156 -0.382 -0.584 -0.012 -0.103 -0.125
TRP -0.379 -0.363 1.178 -0.270 -0.477 0.682 -0.220 -0.099 1.267
TYR -0.111 -0.292 0.942 -0.267 -0.691 0.292 -0.015 -0.176 0.946
VAL -0.374 0.236 1.144 -0.912 -0.334 0.089 -0.030 0.309 0.998

2010/1/19 28
Protein Threading – energy function
f ti

 It measures the preference of a pair of amino acids to be


close in 3D space.

 Observed occurrence of a pair compared with its “expected”


occurrence

uniform state model

pair-wise interaction energy term

2010/1/19 29
Protein Threading – energy function
f ti

ALA -140
ARG 268 -18
ASN 105 -85 -435
ASP 217 -616 -417 17
CYS 330 67 106 278 -1923
GLN 27 -60 -200 67 191 -115
GLU 122 -564 -136 140 122 10 68
GLY 11 -80 -103 -267 88 -72 -31 -288
HIS 58 -263 61 -454 190 272 -368 74 -448
ILE -114 110 351 318 154 243 294 179 294 -326
LEU -182 263 358 370 238 25 255 237 200 -160 -278
LYS 123 310 -201 -564 246 -184 -667 95 54 194 178 122
MET -74 304 314 211 50 32 141 13 -7 -12 -106 301 -494
PHE -65 62 201 284 34 72 235 114 158 -96 -195 -17 -272 -206
PRO 174 -33 -212 -28 105 -81 -102 -73 -65 369 218 -46 35 -21 -210
SER 169 -80 -223 -299 7 -163 -212 -186 -133 206 272 -58 193 114 -162 -177
THR 58 60 -231 -203 372 -151 -211 -73 -239 109 225 -16 158 283 -98 -215 -210
TRP 51 -150 -18 104 52 -12 157 -69 -212 -18 81 29 -5 31 -432 129 95 -20
TYR 53 -132 53 268 62 -90 269 58 34 -163 -93 -312 -173 -5 -81 104 163 -95 -6
VAL -105 171 298 431 196 180 235 202 204 -232 -218 269 -50 -42 46 267 73 101 107 -324
ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL

2010/1/19 30
Protein Threading – energy function
f ti

w(k) = h + gk, k ≥ 1, w(0) = 0;

h: opening gap penalty


g: extension gap penalty

FDSK---THRGHR FDSK-T--HRGHR
:.: :: ::: :.: : : :::
FESYWTCTH GHR
FESYWTCTH-GHR FESYWTCTH GHR
FESYWTCTH-GHR

gap penalty term

2010/1/19 31
Threading Parameter Optimization
 How to determine the weight of different energy term?
Etotal = sEsingleton + pEpairwise + gEgap

 Select the weights


g to g
give the “best” threading
g
performance on a training set (fold recognition and
alignment accuracy)

 Different weights for different classes? (superfamily, fold)


pair-wise may
pair- y contribute more for fold level threading
g
mutation/profile terms dominate in superfamily level threading

2010/1/19 32
Protein Threading -- algorithm
 Dynamic programming
 Heuristic algorithms for pair-
pair-wise interactions
 Frozen approximation algorithm (A. Godzik et al.)
al.)
 Double dynamic programming (D. Jones et al.)
al.)
 Monte carlo sampling
p g ((S.H. Bryant
y al.))
et al.)

 Rigorous algorithms for pair-


pair-wise interactions
 Branch
B
Branch-
h-and-
and
d-bound
b d (R
(R.H.
H LLathrop
th and
dTT.F.
F SSmith)
ith)
 Divide--and-
Divide and-conquer (Y. Xu et al.)
al.) --PROSPECT
--PROSPECT
 Linear programming (J. Xu et al.)
al.) –RAPTOR
 Tree decomposition (L. Cai et al.)
 Rigorous algorithm for treating backbone and side-
side-chain
simultaneously (Li et al
al.))

2010/1/19 33
Fold Recognition

MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

S
Score = -1500
1500 S
Score = -720
720 S
Score = -1120
1120 S
Score = -900
900

Which one is the correct structural


fold for the target sequence if any?

The one with the lowest score ?

2010/1/19 34
Fold Recognition
Query sequence: AAAA

Template #1: AATTAATACATTAATATAATAAAATTACTGA

B
Better template?
l ?

Template #2: CGGTAGTACGTAGTGTTTAGTAGCTATGAA

Which of these two sequences will have better


chance to have a good match with the query
sequence after randomly reshuffling them?

2010/1/19 35
Fold Recognition

 Different template structures may have different background score


distribution making direct comparison of threading scores against
distribution,
different templates invalid

 Comparison of threading results should be made based on how


standout the score is in its background score distribution rather than
the threading scores directly

2010/1/19 36
Fold Recognition
Threading 100
100,000
000
sequences against a
template structure provides
th b
the baseline
li iinformation
f ti
about the background
scores of the template

Byy locating
g where the
threading score with a
particular query sequence,
one can decide how
significant the score, and
hence the threading result,
is!
Not significant significant
2010/1/19 37
Fold Recognition

score - average
Z-score =
standard deviation

--randomly shuffle the query sequence and calculate the alignment score

The goal is to pick the predicted structure


with high statistical significance

2010/1/19 38
State of the Art

 ~60% of the proteins in encoded in any genome can


probably
b bl hhave th
their
i structural
t t l ffolds
ld predicted
di t d

 ~60% of these proteins can have their structures


predicted accurate enough to be useful to guide
experimental designs

2010/1/19 39

You might also like