Professional Documents
Culture Documents
The Hidden Markov Models For Sequence Parsing: Gene Structure in Eukaryotes
The Hidden Markov Models For Sequence Parsing: Gene Structure in Eukaryotes
1
Gene finding
GAGAACGTGTGAGAGAGAGGCAAGCCGAAAAATCAGCCGC
• Given un-annotated sequences, CGAAGGATACACTATCGTCGTCCTTGTCCGACGAACCGGT
GGTCATGCAAACAACGCACAGAACAAATTAATTTTCAAAT
TGTTCAATAAATGTCCCACTTGCTTCTGTTGTTCCCCCCT
• delineate: TTCCGCTAGCGGAATTTTTTATATTCTTTTGGGGGCGCTC
TTTCGTTGACTTTTCGAGCACTTTTTCGATTTTCGCGCGC
TGTCGAACGGCAGCGTATTTATTTACAATTTTTTTTGTTA
GCGGCCGCCGTTGTTTGTTGCAGATACACAGCGCACACAT
ATAAGCTTGCACACTGATGCACACACACCGACACGTTGTC
ACCGAAATGAACGGGACGGCCATATGACTGGCTGGCGCTC
GGTATGTGGGTGCAAGCGAGATACCGCGATCAAGACTCGA
2
Example: The Dishonest Casino
Game:
1. You bet $1
2. You roll (always with a fair die)
3. Casino player rolls (maybe with fair die,
maybe with loaded die)
4. Highest number wins $2
Question # 1 – Evaluation
GIVEN
1245526462146146136136661664661636616366163616515615115146123562344
QUESTION
How likely is this sequence, given our model of how the casino works?
3
Question # 2 – Decoding
GIVEN
1245526462146146136136661664661636616366163616515615115146123562344
QUESTION
What portion of the sequence was generated with the fair die, and what
portion with the loaded die?
Question # 3 – Learning
GIVEN
124552646214614613613666166466163661636616361651561511514612356234
QUESTION
How “loaded” is the loaded die? How “fair” is the fair die? How
often does the casino player change from fair to loaded, and
back?
4
A stochastic generative model
Observed sequence:
1 4 3 6 6 4
B
Hidden sequence (a parse or segmentation):
B B A A A B
ai,j = P(yt=j|yt-1=i)
ai,1 + … + ai,K = 1, for all states i = 1…K
K …
• Start probabilities a0,i
a0,1 + … + a 0,K = 1
eib = P( xt = b | yt = i)
ei,b1 + … + ei,bM = 1, for all states i = 1…K
5
The dishonest casino model
0.05
0.95 0.95
FAIR LOADED
Likelihood of a parse
6
Example: the dishonest casino
x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4
y = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair?
½ × P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) =
Therefore, it is after all 6.59 times more likely that the die is fair all the
way, than that it is loaded all the way
7
Example: the dishonest casino
x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6
y = L, L, …, L?
1. Evaluation
2. Decoding
3. Learning
8
Applications of HMMs
9
Some facts about human genes
10
GENSCAN (Burge & Karlin)
GAGAACGTGTGAGAGAGAGGCAAGCCGAAAAATCAGCCGC
CGAAGGATACACTATCGTCGTCCTTGTCCGACGAACCGGT
E0 E1 E2 GGTCATGCAAACAACGCACAGAACAAATTAATTTTCAAAT
TGTTCAATAAATGTCCCACTTGCTTCTGTTGTTCCCCCCT
TTCCGCTAGCGGAATTTTTTATATTCTTTTGGGGGCGCTC
TTTCGTTGACTTTTCGAGCACTTTTTCGATTTTCGCGCGC
TGTCGAACGGCAGCGTATTTATTTACAATTTTTTTTGTTA
I0 I1 I2 GCGGCCGCCGTTGTTTGTTGCAGATACACAGCGCACACAT
ATAAGCTTGCACACTGATGCACACACACCGACACGTTGTC
ACCGAAATGAACGGGACGGCCATATGACTGGCTGGCGCTC
GGTATGTGGGTGCAAGCGAGATACCGCGATCAAGACTCGA
ACGAGACGGGTCAGCGAGTGATACCGATTCTCTCTCTTTT
Ei Et GCGATTGGGAATAATGCCCGACTTTTTACACTACATGCGT
TGGATCTGGTTATTTAATTATGCCATTTTTCTCAGTATAT
θ1
CGGCAATTGGTTGCATTAATTTTGCCGCAAAGTAAGGAAC
ACAAACCGATAGTTAAGATCCAACGTCCCTGCTGCGCCTC
θ2 GCGTGCACAATTTGCGCCAATTTCCCCCCTTTTCCAGTTT
TTTTCAACCCAGCACCGCTCGTCTCTTCCTCTTCTTAACG
5'UTR Es 3'UTR p( • | y ) =
θ3 TTAGCATTCGTACGAGGAACAGTGCTGTCATTGTGGCCGC
TGTGTAGCTAAAAAGCGTAATTATTCATTATCTAGCTATC
θ4 TTTTCGGATATTATTGTCATTTGCCTTTAATCTTGTGTAT
TTATATGGATGAAACGTGCTATAATAACAATGCAGAATGA
promoter poly -A AGAACTGAAGAGTTTCAAAACCTAAAAATAATTGGAATAT
AAAGTTTGGTTTTACAATTTGATAAAACTCTATTGTAAGT
GGAGCGTAACATAGGGTAGAAAACAGTGCAAATCAAAGTA
CCTAAATGGAATACAAATTTTAGTTGTACAATTGAGTAAA
ATGAGCAAAGCGCCTATTTTGGATAATATTTGCTGTTTAC
AAGGGGAACATATTCATAATTTTCAGGTTTAGGTTACGCA
Forward (+) strand intergenic Forward (+) strand TATGTAGGCGTAAAGAAATAGCTATATTTGTAGAAGTGCA
Reverse (-) strand region Reverse (-) strand TATGCACTTTATAAAAAATTATCCTACATTAACGTATTTT
ATTTGCTTTAAAACCTATCTGAGATATTCCAATAAGGTAA
GTGCAGTAATACAATGTAAATAATTGCAAATAATGTTGTA
ACTAAATACGTAAACAATAATGTAGAAGTCCGGCTGAAAG
CCCCAGCAGCTATAGCCGATATCTATATGATTTAAACTCT
TGTCTGCAACGTTCTAATAAATAAATAAAATGCAAAATAT
AACCTATTGAGACAATACATTTATTTTATTTTTTTATATC
ATCAATCATCTACTGATTTCTTTCGGTGTATCGCCTAATC
CATCTGTGAAATAGAAATGGCGCCACCTAGGTTAAGAAAA
GATAAACAGTTGCCTTTAGTTGCATGACTTCCCGCTGGAT
Questions:
11
The Forward Algorithm
We want to calculate
= Σ πt-1 P(x 1…x t-1, πt-1)P(πt = k|πt-1, x1…x t-1)P(xt |πt = k, πt-1, x 1…xt-1)
12
The Forward Algorithm
Initialization:
f0(0) = 1
fk(0) = 0, for all k > 0
Iteration:
fl(t) = el(xt) Σ k f k(t-1) a kl (a0k is a vector of initial probability)
Termination:
P(x) = Σ k f k(T)
We start by computing
13
The Backward Algorithm – derivation
Initialization:
Iteration:
Termination:
14
Posterior Decoding
fk(t) b k(t)
P(π t = k | x) = –––––––
P(x)
π *t = argmaxk P(π t = k | x)
Decoding
2 2 2 … 2
We want to find π = π1, ……, πT,
such that P(π|x) is maximized … … … …
K K K … K
π* = argmax π P(π|x) = argmax π P(π,x)
x1
We can use dynamic programming! x2 x3 xT
15
Decoding – main idea
What is VI (t+1)?
From definition,
Vl(t+1) = max {π1,…, π t}P( x 1…x t, π1, …, πt , xt+1, πt+1 = l)
= max {π1,…, π t}P(x t+1, πt+1 = l | x 1…xt ,π1,…, πt ) P(x 1…x t , π1,…, πt )
= max {π1,…, π t}P(x t+1, πt+1 = l | πt ) P(x 1…x t-1, π1, …, πt-1, xt , πt )
= max k P(xt+1, πt+1 = l | πt = k) max {π1,…, π t-1}P(x 1…xt-1,π1,…, πt-1, xt ,πt =k)
= el(x t+1) maxk akl Vk (t)
Input: x = x 1……x N
Initialization:
V0(0) = 1 (0 is the imaginary first position)
Vk (0) = 0, for all k > 0
Iteration:
Vj(t) = ej(x t ) × max k akj Vk (t-1)
Termination:
P(x, π*) = max k Vk (T)
Traceback:
πT * = argmax k Vk (T)
πt-1* = Ptrπt (t)
16
Viterbi Algorithm – a practical detail
Computational Complexity
What is the running time, and space required, for Forward, and
Backward?
17
Model learning: two scenarios
Examples:
GIVEN: a genomic region x = x1…x1,000,000 where we have good
(experimental) annotations of the CpG islands
Examples:
GIVEN: the porcupine genome; we don’t know how frequent are the
CpG islands there, neither do we know their composition
GIVEN: 10,000 rolls of the casino player, but we don’t see when he
changes dice
Supervised ML estimation
Given x = x1…xN
for which the true π = π 1…π N is known,
Define:
Akl Ek(b)
akl = ––––– ek(b) = –––––––
Σ i Aki Σ c Ek(c) (Homework!)
18
Supervised ML estimation
Drawback:
Given little data, there may be overfitting:
P(x|θ) is maximized, but θ is unreasonable
0 probabilities – VERY BAD
Example:
Given 10 casino rolls, we observe
x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3
π = F, F, F, F, F, F, F, F, F, F
Then: aFF = 1; aFL = 0
eF (1) = eF (3) = .2;
eF (2) = .3; eF (4) = 0; eF (5) = eF (6) = .1
Pseudocounts
Add pseudocounts
19
Unsupervised ML estimation
Given x = x1…xN
for which the true π = π 1…π N is unknown,
EXPECTATION MAXIMIZATION
We can get to a provably more (or equally) likely parameter set θ each iteration
Time Complexity:
20