A Probabilistic Learning Approach To Whole-Genome Operon Prediction

A Probabilistic Learning Approach to
Whole-Genome Operon Prediction

Mark Craven David Page Jude Shavlik
Joseph Bockhorst Jeremy Glasner
Department of Biostatistics & Medical Informatics

Department of Computer Sciences
Department of Genetics
University of Wisconsin
Finding Operons in E. coli

promoter terminator
g1 g2 g3 g4
ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT
TGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGATCGACTGTCGATCTAGCTATCGAGCTATCGTGCACATGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGA
g5
Given: known operons and associated E. coli data

Do: predict all operons in E. coli
operon: sequence of one or more genes transcribed

as a unit under some conditions
1
Finding Operons in E. coli:
Two Key Steps
• Given a candidate operon (i.e. sequence of genes),
score it
• Given scored candidates, partition the genome into

operons
Scoring Operons with

Naïve Bayes
Pr(Op)∏ Pr( Di | Op )
Pr(Op | D) ≈ i
Pr( D )
• where Di is the ith feature describing the candidate
operon
• histograms used to represent conditional distributions
• we’ve also used C5.0 and non-naïve Bayes nets
2
Estimating Conditional Probabilities
• construct histogram 140
120
for each feature from 100
training data 80
Pos
60 Neg
• 150 data points / bin 40
• hence, bin width varies 20

0
Bin1 Bin2 Bin3 Bin4
• from histogram, compute
Pr( Di | Op), Pr( Di | ¬Op)
Features Used in Learned Models

• length and spacing features
• functional annotation features
• predicted promoters
• predicted terminators
• expression data features
3
Length and Spacing Features
g1 g2 g3 g4
g5
• number of genes in candidate
• mean and maximum within-operon space
• distance to neighboring genes
• strands of neighboring genes
Functional Annotation Features

• 1,668 genes have been assigned a functional
annotation code from a 3-level, 123-leaf hierarchy
• we expect the genes in an operon to have closely
related functions
metabolism of
small molecules
carbon energy
metabolism
electron fermentation aerobic

transport respiration
4
Functional Annotation Features
• annotation distance between a pair of genes ∝

distance to common ancestor in hierarchy
• compute mean pairwise distances between:
– all genes in candidate operon
– gene before candidate and genes in candidate
– gene after candidate and genes in candidate
Transcription Signal Features

promoter terminator
model model
g2 g3 g4
• scan upstream of candidates looking for promoters

• scan downstream looking for terminators
• features represent highest scoring subsequence in
each scan
5
Transcription Signal Features
• use position-specific Interpolated Markov Models to
predict promoters and terminators
n
Pr(S | model) = ∏ IMM( S i )
i =1
...A C G T C G A G A...
IMM( S i = G ) ∝ λ1 Pri ,1 ( Si = G | Si −1 = C ) +
λ0 Pri ,0 ( Si = G )
Gene Expression Features

• microarray data from 39 experiments
• given a candidate operon we can ask:

does it look like all expression measurements for each
experiment come from some true underlying signal?
6
Gene Expression Features
• evaluate candidate operon c by
r
∏ ce | Op)
Pr( a
e∈expts
L (c ) = r
∏ ce )
Pr( a
e∈expts
r
where ace represents expression measurements for c
for 1 channel, 1 experiment
• compute similar features for
– gene before candidate & first gene in candidate
– gene after candidate & last gene in candidate
Positive and Negative Examples

• 365 known operons
• no real negative examples
• putative negative examples generated by
exploiting regularity in domain
– operons rarely overlap
– any sequence of genes overlapping a known
operon is unlikely to be an operon itself
7
Generating Putative Negatives
known
positive g2 g3 g4
g1 g5
g2 g3 g4
putative g3 g4
negatives
g2
g3
Constructing An Operon “Map”

for Entire Genome
• assign every gene in given genome to its most
likely operon (i.e. partition the genome into
operons)
• given a method for scoring candidate operons, can
do this optimally using dynamic programming
8
Possible Operon Maps
run of
for a Run of Genes
genes g2 g3 g4
g1 g5
g2 g3 g4
g2 g3 g4
possible
maps
g2 g3 g4
g2 g3 g4
Scoring An Operon: Revisited

• the score for a candidate operon is given by:
Score(Op ) = | Op | × log Pr(Op | D )
length of operon prob estimated

(# genes) by naïve Bayes
• see paper for justification
9
Dynamic Programming to
Construct An Operon Map
 Score(i,1) + Map(i − 1),
Score(i,2) + Map(i − 2),

Map(i ) = max 
 M
 Score(i, i ) + Map(i − i ).
g i −2 g i −1 gi
Map (i − 2) Score(i,2)
Experiments
• compare predictive accuracy of
– classifying candidate operons with naïve Bayes
– constructing operon map with DP + naïve Bayes
– randomly selected operon maps
• evaluate predictive accuracy of methods when

– using only a single feature group
– leaving out a single feature group
10
Operon Map Accuracy
naïve Bayes
classifier
operon map
random map
0% 20% 40% 60% 80% 100%
accuracy false positive rate true positive rate
Operon Maps Made with

Individual Feature Groups
all features
annotation
spacing
expression data
promoter
terminator
operon size
neighboring genes
0% 10% 20% 30% 40% 50% 60% 70% 80%
false positive rate true positive rate
11
Operon Maps Made Leaving Out
Individual Feature Groups
leaving out none
annotation
spacing
expression data
promoter
terminator
operon size
neighboring genes
0% 10% 20% 30% 40% 50% 60% 70% 80%
false positive rate true positive rate
Conclusions
• new method for predicting operons in prokaryotes
– score candidate operons using learned models
– construct operon map using dynamic program
• learned models combine evidence from diverse
sources
• approach is complementary to those that predict
functionally coupled genes via genome
comparisons
[Overbeek et al. 1999; Tamames et al. 1997]
12

A Probabilistic Learning Approach To Whole-Genome Operon Prediction

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Probabilistic Learning Approach To Whole-Genome Operon Prediction

Uploaded by

Copyright:

Available Formats

A Probabilistic Learning Approach to

Whole-Genome Operon Prediction

Department of Biostatistics & Medical Informatics

Finding Operons in E. coli

Given: known operons and associated E. coli data

operon: sequence of one or more genes transcribed

• Given scored candidates, partition the genome into

Scoring Operons with

• hence, bin width varies 20

Features Used in Learned Models

Functional Annotation Features

electron fermentation aerobic

• annotation distance between a pair of genes ∝

Transcription Signal Features

• scan upstream of candidates looking for promoters

Gene Expression Features

• given a candidate operon we can ask:

Positive and Negative Examples

Constructing An Operon “Map”

Scoring An Operon: Revisited

Score(Op ) = | Op | × log Pr(Op | D )

length of operon prob estimated

• see paper for justification

• evaluate predictive accuracy of methods when

0% 20% 40% 60% 80% 100%

accuracy false positive rate true positive rate

Operon Maps Made with

0% 10% 20% 30% 40% 50% 60% 70% 80%

false positive rate true positive rate

0% 10% 20% 30% 40% 50% 60% 70% 80%

false positive rate true positive rate

You might also like