Download as ps, pdf, or txt
Download as ps, pdf, or txt
You are on page 1of 12

A Probabilistic Learning Approach to

Whole-Genome Operon Prediction


Mark Craven David Page Jude Shavlik
Joseph Bockhorst Jeremy Glasner

Department of Biostatistics & Medical Informatics


Department of Computer Sciences
Department of Genetics
University of Wisconsin

Finding Operons in E. coli


promoter terminator

g1 g2 g3 g4
ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT
TGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGATCGACTGTCGATCTAGCTATCGAGCTATCGTGCACATGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGA

g5

Given: known operons and associated E. coli data


Do: predict all operons in E. coli

operon: sequence of one or more genes transcribed


as a unit under some conditions

1
Finding Operons in E. coli:
Two Key Steps
• Given a candidate operon (i.e. sequence of genes),
score it

• Given scored candidates, partition the genome into


operons

Scoring Operons with


Naïve Bayes
Pr(Op)∏ Pr( Di | Op )
Pr(Op | D) ≈ i
Pr( D )
• where Di is the ith feature describing the candidate
operon
• histograms used to represent conditional distributions
• we’ve also used C5.0 and non-naïve Bayes nets

2
Estimating Conditional Probabilities
• construct histogram 140
120
for each feature from 100

training data 80
Pos
60 Neg
• 150 data points / bin 40

• hence, bin width varies 20


0
Bin1 Bin2 Bin3 Bin4
• from histogram, compute
Pr( Di | Op), Pr( Di | ¬Op)

Features Used in Learned Models


• length and spacing features
• functional annotation features
• predicted promoters
• predicted terminators
• expression data features

3
Length and Spacing Features
g1 g2 g3 g4
ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT
TGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGATCGACTGTCGATCTAGCTATCGAGCTATCGTGCACATGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGA

g5
• number of genes in candidate
• mean and maximum within-operon space
• distance to neighboring genes
• strands of neighboring genes

Functional Annotation Features


• 1,668 genes have been assigned a functional
annotation code from a 3-level, 123-leaf hierarchy
• we expect the genes in an operon to have closely
related functions
metabolism of
small molecules

carbon energy
metabolism

electron fermentation aerobic


transport respiration

4
Functional Annotation Features

• annotation distance between a pair of genes ∝


distance to common ancestor in hierarchy
• compute mean pairwise distances between:
– all genes in candidate operon
– gene before candidate and genes in candidate
– gene after candidate and genes in candidate

Transcription Signal Features


promoter terminator
model model
g2 g3 g4
ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT
TGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGATCGACTGTCGATCTAGCTATCGAGCTATCGTGCACATGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGA

• scan upstream of candidates looking for promoters


• scan downstream looking for terminators
• features represent highest scoring subsequence in
each scan

5
Transcription Signal Features
• use position-specific Interpolated Markov Models to
predict promoters and terminators
n
Pr(S | model) = ∏ IMM( S i )
i =1

...A C G T C G A G A...
IMM( S i = G ) ∝ λ1 Pri ,1 ( Si = G | Si −1 = C ) +
λ0 Pri ,0 ( Si = G )

Gene Expression Features


• microarray data from 39 experiments

• given a candidate operon we can ask:


does it look like all expression measurements for each
experiment come from some true underlying signal?

6
Gene Expression Features
• evaluate candidate operon c by
r
∏ ce | Op)
Pr( a
e∈expts
L (c ) = r
∏ ce )
Pr( a
e∈expts
r
where ace represents expression measurements for c
for 1 channel, 1 experiment
• compute similar features for
– gene before candidate & first gene in candidate
– gene after candidate & last gene in candidate

Positive and Negative Examples


• 365 known operons
• no real negative examples
• putative negative examples generated by
exploiting regularity in domain
– operons rarely overlap
– any sequence of genes overlapping a known
operon is unlikely to be an operon itself

7
Generating Putative Negatives
known
positive g2 g3 g4
ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT
TGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGATCGACTGTCGATCTAGCTATCGAGCTATCGTGCACATGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGA

g1 g5

g2 g3 g4

putative g3 g4
negatives
g2

g3

Constructing An Operon “Map”


for Entire Genome
• assign every gene in given genome to its most
likely operon (i.e. partition the genome into
operons)
• given a method for scoring candidate operons, can
do this optimally using dynamic programming

8
Possible Operon Maps
run of
for a Run of Genes
genes g2 g3 g4
ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT
TGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGATCGACTGTCGATCTAGCTATCGAGCTATCGTGCACATGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGA

g1 g5
g2 g3 g4

g2 g3 g4
possible
maps
g2 g3 g4

g2 g3 g4

Scoring An Operon: Revisited


• the score for a candidate operon is given by:

Score(Op ) = | Op | × log Pr(Op | D )

length of operon prob estimated


(# genes) by naïve Bayes

• see paper for justification

9
Dynamic Programming to
Construct An Operon Map
 Score(i,1) + Map(i − 1),
Score(i,2) + Map(i − 2),

Map(i ) = max 
 M
 Score(i, i ) + Map(i − i ).

g i −2 g i −1 gi
Map (i − 2) Score(i,2)

Experiments
• compare predictive accuracy of
– classifying candidate operons with naïve Bayes
– constructing operon map with DP + naïve Bayes
– randomly selected operon maps

• evaluate predictive accuracy of methods when


– using only a single feature group
– leaving out a single feature group

10
Operon Map Accuracy

naïve Bayes
classifier

operon map

random map

0% 20% 40% 60% 80% 100%

accuracy false positive rate true positive rate

Operon Maps Made with


Individual Feature Groups
all features
annotation
spacing
expression data
promoter
terminator
operon size
neighboring genes

0% 10% 20% 30% 40% 50% 60% 70% 80%

false positive rate true positive rate

11
Operon Maps Made Leaving Out
Individual Feature Groups
leaving out none
annotation
spacing
expression data
promoter
terminator
operon size
neighboring genes

0% 10% 20% 30% 40% 50% 60% 70% 80%

false positive rate true positive rate

Conclusions
• new method for predicting operons in prokaryotes
– score candidate operons using learned models
– construct operon map using dynamic program
• learned models combine evidence from diverse
sources
• approach is complementary to those that predict
functionally coupled genes via genome
comparisons
[Overbeek et al. 1999; Tamames et al. 1997]

12

You might also like