Professional Documents
Culture Documents
A Probabilistic Learning Approach To Whole-Genome Operon Prediction
A Probabilistic Learning Approach To Whole-Genome Operon Prediction
g1 g2 g3 g4
ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT
TGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGATCGACTGTCGATCTAGCTATCGAGCTATCGTGCACATGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGA
g5
1
Finding Operons in E. coli:
Two Key Steps
• Given a candidate operon (i.e. sequence of genes),
score it
2
Estimating Conditional Probabilities
• construct histogram 140
120
for each feature from 100
training data 80
Pos
60 Neg
• 150 data points / bin 40
3
Length and Spacing Features
g1 g2 g3 g4
ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT
TGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGATCGACTGTCGATCTAGCTATCGAGCTATCGTGCACATGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGA
g5
• number of genes in candidate
• mean and maximum within-operon space
• distance to neighboring genes
• strands of neighboring genes
carbon energy
metabolism
4
Functional Annotation Features
5
Transcription Signal Features
• use position-specific Interpolated Markov Models to
predict promoters and terminators
n
Pr(S | model) = ∏ IMM( S i )
i =1
...A C G T C G A G A...
IMM( S i = G ) ∝ λ1 Pri ,1 ( Si = G | Si −1 = C ) +
λ0 Pri ,0 ( Si = G )
6
Gene Expression Features
• evaluate candidate operon c by
r
∏ ce | Op)
Pr( a
e∈expts
L (c ) = r
∏ ce )
Pr( a
e∈expts
r
where ace represents expression measurements for c
for 1 channel, 1 experiment
• compute similar features for
– gene before candidate & first gene in candidate
– gene after candidate & last gene in candidate
7
Generating Putative Negatives
known
positive g2 g3 g4
ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT
TGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGATCGACTGTCGATCTAGCTATCGAGCTATCGTGCACATGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGA
g1 g5
g2 g3 g4
putative g3 g4
negatives
g2
g3
8
Possible Operon Maps
run of
for a Run of Genes
genes g2 g3 g4
ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT
TGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGATCGACTGTCGATCTAGCTATCGAGCTATCGTGCACATGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGA
g1 g5
g2 g3 g4
g2 g3 g4
possible
maps
g2 g3 g4
g2 g3 g4
9
Dynamic Programming to
Construct An Operon Map
Score(i,1) + Map(i − 1),
Score(i,2) + Map(i − 2),
Map(i ) = max
M
Score(i, i ) + Map(i − i ).
g i −2 g i −1 gi
Map (i − 2) Score(i,2)
Experiments
• compare predictive accuracy of
– classifying candidate operons with naïve Bayes
– constructing operon map with DP + naïve Bayes
– randomly selected operon maps
10
Operon Map Accuracy
naïve Bayes
classifier
operon map
random map
11
Operon Maps Made Leaving Out
Individual Feature Groups
leaving out none
annotation
spacing
expression data
promoter
terminator
operon size
neighboring genes
Conclusions
• new method for predicting operons in prokaryotes
– score candidate operons using learned models
– construct operon map using dynamic program
• learned models combine evidence from diverse
sources
• approach is complementary to those that predict
functionally coupled genes via genome
comparisons
[Overbeek et al. 1999; Tamames et al. 1997]
12