Professional Documents
Culture Documents
PF General Talk
PF General Talk
Ming Li
Canada Research Chair in Bioinformatics
Cheriton School of Computer Science
University of Waterloo
Human: 3 billion bases, 30k genes.
T A
A T
E. coli: 5 million bases, 4k genes
C G
T A
cDNA
reverse transcription
A T transcription translation
G C
C G
mRNA Protein
G (A,C,G,U) (20 amino acids)
C
Examples:
hormones – regulate metabolism
structures – hair, wool, muscle,…
antibodies – immune response
enzymes – chemical reactions
Sickle-cell anemia: hemoglobin protein is made of 4
chains, 2 alphas and 2 betas. Single mutation from
Glu to Val happens at residue 6 of the beta chain.
This is recessive. Homozygotes die but
Heterozygotes have resistance to malaria, hence it
had some evolutionary advantage in Africa. 1 in 12
African Americans are carriers.
What happened in sickle-cell anemia
Mutating to
Valine. Glu: Glutamic
Hydrophobic acid, E,
patch on the Codon:
surface. GAA,GAG
Codon: GTT
GTA,GTC,
GTG
Mutating to
Valine.
Hydrophobic
patch on the
surface.
Hemoglobin
Amino acids
There are 500 amino acids in nature. Only 20
(22) are used in proteins.
The first amino acid was discovered from
asparagus, hence called Asparagine, in 1806.
All 20 amino acids in proteins are discovered
by 1935.
Traces of glycin, alanine etc were found in a
meteorite in Australia in 1969. That brings the
conjecture that life began from extraterrestrial
origin.
20 Amino acids
Hydrophobic amino acids Polar amino acids
Alanine Serine
Neutral Polar: one positive
Valine Threonine
and one negative charged ends,
Non-polar e.g. H O is polar, oil is non-polar.
Phenylalanine
2
Tyrosine
Proline
Histidine
Methionine
Cysteine
Isoleucine
Leucine Asparagine
Glutamine
Charged Amino Acids
Tryptophan
Aspartic acid
Glutamic acid
Lysine Simplest Amino Acid
Arginine Glycine
The Φ and Ψ angles
The angle at N-Cα is Φ
angle
The angle at Cα-C’ is Ψ
angle
No side chain is
involved (which is at Cα)
These angles determine
the backbone structure.
Cα
Homologous proteins have similar
structure and functions
Being homologous means that they have
evolved from a common ancestral gene.
Hence at least in the past they had the same
structure and function.
Caution: old genes can be recruited for new
functions. Example: a structural protein in eye
lens is homologous to an ancient glycolytic
enzyme.
Conserving core regions
Homologous proteins usually have conserved
core regions.
When we model one protein after a similar
protein with known structure, the main
problem becomes modeling loop regions.
Modeling loops can also depend on database
to some degree.
Side chains: only a few side-chain
conformations frequently occur – they are
called rotamers, there is a such a database.
There are not too many candidates!
There are only about 1000 topologically
different domain structures. There is no
reason whatsoever that we cannot compute
their structures accurately.
Protein data bank
http://www.rcsb.org/pdb/Welcome.d
o
Note: natural α
helices are
Height: 5.4A right-handed
per turn.
5.4A
Each residue
gives1.5A rise Hydrogen bond
Water molecule, H2O
Hydrogen bond (you know ionic bond
and covalent bond from high school)
– +
Water H
(H2O) O A hydrogen
bond results
from the
H attraction
between the
+ partial positive
charge on the
–
hydrogen atom
of water and
Ammonia the partial
N negative charge
(NH3)
on the nitrogen
H H
+ atom of
+ H ammonia.
+
Walking on water
Antiparallel β strands
Side chains
in purple Hydrogen bonds, note their unevenness
Core question
Looking at the protein sequences of globular proteins, one finds
that hydrophobic side chains are usually scattered along the
entire sequence, seemingly randomly.
In the native state of folded protein, ½ of these side chains are
buried, and the rest are scattered on the surface of the protein,
surrounded by hydrophilic side chains.
The buried hydrophobic side chains are not clustered in the
sequence.
Central Question: what causes these residues to
be selectively buried during the early and rapid
formation of the molten globule?
Folding pathways
Ui‘s --- unfolded states,
many of them.
Mi’s --- molten globule
states, i can be 1. Has most
secondary structures, but
less compact.
Converging to F. During this
relatively slower process it
passes a high energy
transition state T.
These facts have been
verified by NMR, hydrogen
exchange, spectroscopy,
and thermo-chemistry.
Web Lab Protein Structure Determination
Wet Lab:
X-ray crystallography
NMR
The wet lab technologies not only are slow
and expansive, but also they simply fail for:
Protein design
Alternative splicing
Insoluble proteins
Not to mention millions of proteins they can do
but will never finish.
Computational Approaches
RAPTOR: Protein Threading by
Linear Programming
Make a structure prediction through finding an optimal
placement (threading) of a protein sequence onto each known
structure (structural template)
“placement” quality is measured by some statistics-based
energy function
best overall “placement” among all templates may give a
structure prediction
target sequence
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
template library
Threading
Threading Example
Introduction to Linear Program
sequence similarity
alignment gap
between query and
penalty: Eg
template proteins: Em
(gap score)
(Mutation score)
Consistency with the secondary structures: E ss
E= Ep + Es + Em + Eg + Ess
y(i ,l )( j ,k ) xi ,l x j ,k Es , Ess , Em
Encodes x
lD[ i ]
i ,l 1
scoring system
xi ,l , y(i ,l )( j ,k ) {0,1} Encodes interaction structures:
the first makes sure no crosses;
the second is quadratic, but can
be converted to linear: a=bc is
eqivalent to: a≤b, a≤c, a≥b+c-1
Formulation used in RAPTOR
Minimize
Eg, Ep
E ai ,l xi ,l b(i ,l )( j ,k ) y(i ,l )( j ,k )
s.t.
xi ,l y
kR[ i , j ,l ]
( i ,l )( j , k ) , l D[i ]
Es, Ess, En
Encodes
x j ,k y
lR[ j , k ,i ]
( i ,l )( j , k ) , k D[ j ]
scoring system
x
lD[ i ]
i ,l 1
Encodes interaction
xi ,l , y(i ,l )( j ,k ) {0,1} structures
Solving the Problem Practically
976*975 threading pairs are tested, the results of other servers are taken from
Shi et al.’s paper.
CASP5, CASP6, CASP7
Held every 2 years.
RAPTOR consistently ranked high since
CASP5. It was voted by CASP5 attendees as
the most novel approach, at http://forcasp.org
62—100 targets each time. 48 hours allowed for
each target.
No manual intervention.
Evaluated by computer programs.
Example, CASP5 Target Category
Easy Hard
Prediction Difficulty
RAPTOR first
Model ranks
5th
Predicted Experimental
CASP6, T0262-2, ACE buffalo rank: 4th
From Fugue3 6th model. TM=0.4306, MaxSub=0.3459.
Good parts: 162-203
Fugue’s
top
model
ranks
low
Predicted Experimental
CASP6, T0242, NF, ACE buffalo rank: 1
From RAPTOR rank 5 model.
TM score=0.2784, MaxSub score=0.1645
However,
RAPTOR top
model
ranks 44th !
Trivial error?
Predicted Experimental
CASP6, T0238, NF ACE buffalo rank 1st
From RAPTOR 8th model TM=0.2748, MaxSub=0.1633
Good part: 188-237. High TM score, low MaxSub
Raptor
top
model
ranks 4th
Predicted Experimental
About RAPTOR
Jinbo Xu’s Ph.D. thesis work.
The RAPTOR system has benefited
significantly from PROSPECT (Ying Xu, Dong
Xu, et al).
References: J. Xu, M. Li, D. Kim, Y. Xu, Journal of
Bioinformatics and Computational Biology, 1:1(2003), 95-118.
J. Xu, M. Li, PROTEINS: Structure, Function, and Genetics,
CASP5 special issue.
Old Paradigm
New RAPTOR, New Paradigm
Contact Prediction
Local Threading/
Large fragments Loop / Side Chain
Short Fragment Modeling
selection Assembly by
Molecular dynamics
Super motif/domain
Modeling
Refinement
Global threading/
Old RAPTOR
Hydrophobic s.c.
Burying information NMR constraints