Professional Documents
Culture Documents
Hidden Markov Modelss
Hidden Markov Modelss
Hidden Markov Modelss
Based on
• “Foundations of Statistical NLP” by C. Manning & H.
Schütze, ch. 9, MIT Press, 2002
• “Biological Sequence Analysis”, R. Durbin et al., ch. 3 and
11.6, Cambridge University Press, 1998
1.
PLAN
1 Markov Models
Markov assumptions
2 Hidden Markov Models
3 Fundamental questions for HMMs
3.1 Probability of an observation sequence:
the Forward algorithm, the Backward algorithm
3.2 Finding the “best” sequence: the Viterbi algorithm
3.3 HMM parameter estimation:
the Forward-Backward (EM) algorithm
4 HMM extensions
5 Applications
2.
1 Markov Models (generally)
Markov Models are used to model a sequence of ran-
dom variables in which each element depends on pre-
vious elements.
X = hX1 . . . XT i Xt ∈ S = {s1, . . . , sN }
X is also called a Markov Process or Markov Chain.
S = set of states
Π = initial state probabilities
PN
πi = P (X1 = si ); i=1 πi = 1
A = transition probabilities:
PN
aij = P (Xt+1 = sj |Xt = si ); j=1 aij = 1 ∀i
3.
Markov assumptions
• Limited Horizon:
P (Xt+1 = si|X1 . . . Xt ) = P (Xt+1 = si |Xt )
(first-order Markov model)
• Time Invariance: P (Xt+1 = sj |Xt = si) = pij ∀t
A T
Note:
Here we leave
transition
probabilities
unspecified.
C G
5.
+ −
c c
a+
st = P st
+ a−
st = P st
−
′ c
t st′ c
t st′
′
+ A C G T − A C G T
A 0.180 0.274 0.426 0.120 A 0.300 0.205 0.285 0.210
C 0.171 0.368 0.274 0.188 C 0.322 0.298 0.078 0.302
G 0.161 0.339 0.375 0.125 G 0.248 0.246 0.298 0.208
T 0.079 0.355 0.384 0.182 T 0.177 0.239 0.292 0.292
6.
L L
P (x | model +) X a+
x x
X
S(x) = log2 = log2 −i−1 i = βxi−1 xi
P (x | model −) i=1
axi−1 xi i=1
β A C G T
A −0.740 0.419 0.580 −0.803
C −0.913 0.302 1.812 −0.685
G −0.624 0.461 0.331 −0.730
T −1.169 0.573 0.393 −0.679
7.
2 Hidden Markov Models
Remark:
In all our subsequent examples bijk is independent of j.
8.
1: 1/6 1: 1/10
2: 1/6 2: 1/10
3: 1/6 3: 1/10
4: 1/6 4: 1/10
5: 1/6 0.05 5: 1/10
6: 1/6 6: 1/2
0.95
0.9
F L
0.1
0.99 0.01
11.
A 2rd HMM example: The crazy soft drink machine
(from [Manning & Schütze, 2000])
P(Coke) = 0.6
Ice tea = 0.1
Lemon = 0.3 0.3
sN
a Nj b NjO
α N(t) t
t t+1
18.
Trellis
s1
State
sN
1 2 Time t T+1
19.
Induction:
βi (t) = P (OtOt+1 . . . OT |Xt = i, µ)
XN
= P (Ot Ot+1 . . . OT , Xt+1 = j|Xt = i, µ)
j=1
N
X
= P (Ot Ot+1 . . . OT |Xt = i, Xt+1 = j, µ)P (Xt+1 = j|Xt = i, µ)
j=1
N
X
= P (Ot+1 . . . OT |Ot , Xt = i, Xt+1 = j, µ)P (Ot|Xt = i, Xt+1 = j, µ)aij
j=1
N
X N
X
= P (Ot+1 . . . OT |Xt+1 = j, µ)bijOt aij = βj (t + 1)bijOt aij
j=1 j=1
N
X N
X
Total: P (O|µ) = P (O1O2 . . . OT |X1 = i, µ)P (X1 = i|µ) = βi (1)πi
i=1 i=1
21.
Combining Forward and Backward probabilities
P (O, Xt = i|µ) = αi (t)βi (t)
XN
P (O|µ) = αi (t)βi (t) for 1 ≤ t ≤ T + 1
i=1
Proofs:
P (O, Xt = i|µ) = P (O1 . . . OT , Xt = i|µ)
= P (O1 . . . Ot−1 , Xt = i, Ot . . . OT |µ)
= P (O1 . . . Ot−1 , Xt = i|µ)P (Ot . . . OT |O1 . . . Ot−1 , Xt = i, µ)
= αi (t)P (Ot . . . OT |Xt = i, µ)
= αi (t)βi (t)
N
X N
X
P (O|µ) = P (O, Xt = i|µ) = αi (t)βi (t)
i=1 i=1
Note: The “total” forward and backward formulae are special cases of
the above one (for t = T + 1 and respectively t = 1).
22.
3.2 Finding the “best” state sequence
3.2.1 Posterior decoding
One way to find the most likely state sequence underlying the
observation sequence: choose the states individually
αi (t)aij bijOt βj (t + 1)
= PN PN
m=1 n=1 αm (t)amn bmnOt βn (t + 1)
Summing over t:
PT
t=1 pt (i, j) = expected number of transitions from si to sj in O
PN PT
j=1 t=1 pt (i, j) = expected number of transitions from si in O
29.
.
a ij b ijO sj .
t .
.
. si
.
α i (t) βj (t+1)
t−1 t t t+1
30.
P(Coke) = 0.4037
Ice tea = 0.1376
Lemon = 0.4587 0.4514
Given a valid model θt , we want to estimate a new and better model θt+1 ,
i.e. one for which
log P (x | θt+1 ) > log P (x | θt )
35.
Chaining rule
P (x, y | θ) = P (y | x, θ)P (x | θ) ⇒ log P (x | θ) = log P (x, y | θ) − log P (y | x, θ)
t
By multiplying the last equality by P (y | x, θ ) and summing over y,
it follows (since y P (y | x, θt ) = 1):
P
X X
t
log P (x | θ) = P (y | x, θ ) log P (x, y | θ) − P (y | x, θt ) log P (y | x, θ)
y y
t t t t
X
t P (y | x, θt )
log P (x | θ) − log P (x | θ ) = Q(θ | θ ) − Q(θ | θ ) + P (y | x, θ ) log
y
P (y | x, θ)
should be positive.
Note that the last sum is the relative entropy of P (y | x, θt ) with respect to
P (y | x, θ), therefore it is non-negative. So,
log P (x | θ) − log P (x | θt ) ≥ Q(θ | θt ) − Q(θt | θt )
with equality only if θ = θt , or if P (x | θ) = P (x | θt ) for some other θ 6= θt .
36.
Note: Since the likelihood increases at each iteration, the procedure will
always reach a local (or maybe global) maximum asymptotically as t → ∞.
37.
Note:
For many models, such as HMM, both of these steps can be carried out
analytically.
If the second step cannot be carried out exactly, we can use some numerical
optimisation technique to maximise Q.
In fact, it is enough to make Q(θt+1 | θt ) > Q(θt | θt ), thus getting generalised
EM algorithms. See [Dempster, Laird, Rubin, 1977], [Meng, Rubin, 1992],
[Neal, Hinton, 1993].
38.
3.3.3.2 Derivation of EM steps for HMM
In this case, the ‘missing data’ are the state paths π. We want to maximize
X
t
Q(θ | θ ) = P (π | x, θt ) log P (x, π | θ)
π
For a given path, each parameter of the model will appear some number
of times in P (x, π | θ), computed as usual. We will note this number Akl (π)
for transitions and Ek (b, π) for emissions. Then,
Ek (b,π) M A (π)
P (x, π | θ) = ΠM
k=1 Π b [ek (b)] Π Π M
a
k=0 l=1 kl
kl
M X M M
! M
0
X akl X X X
0 a0kl
Akl log = Akl′ akl log
k=0 l=1
akl
k=0 ′ l=1
akl
l
The last sum is a relative entropy, and thus it is larger than 0 unless
akl = a0kl . This proves that the maximum is at a0kl .
Exactly the same procedure can be used for the E term.
40.
Therefore, the M-step just consists of plugging Akl and Ek (b) into
the re-estimation formulae for akl and ek (b). (See formulae (3.18)
in the R. Durbin et al. BSA book.)
41.
4 HMM extensions
◦ Speech Recognition
• Text Processing: Part Of Speech Tagging
• Probabilistic Information Retrieval
◦ Bioinformatics: genetic sequence analysis
43.
P(large) = 0.004
small = 0.005
adj
0.218
0.45
0.016
det noun
π det=1
P(a) = 0.245 0.475 P(house) = 0.001
the = 0.586 stock = 0.001
46.
From David Austin, “How Google finds your needle in the web’s
haystack”, Monthly Essays on Mathematical Topics, 2006.
51.
Notations
Let n = the number of pages on Internet, and
H and A two n × n matrices defined by
1 if page j points to page i (notation: Pj ∈ Bi )
hij =
0 otherwise
1 if page i contains no outgoing links
aij =
0 otherwise
α ∈ [0; 1] (this is a parameter that was initially set to 0.85)
Pn
G is a stochastic matrix (gij ∈ [0; 1], i=1 gij = 1),
therefore λ1 the greatest eigenvalue of G is 1, and
G has a stationary vector I (i.e., GI = I).
G is also primitive (| λ2 |< 1, where λ2 is the second eigenvalue of G)
and irreducible (I > 0).
From the matrix calculus it follows that
I can be computed using the power method:
if I 1 = GI 0 , I 2 = GI 1 , . . . , I k = GI k−1 then I k → I.
I gives the relative importance of pages.
54.
Suggested readings
“Using Google’s PageRank algorithm to identify important attributes
of genes”, G.M. Osmani, S.M. Rahman, 2006
55.
ADDENDA
Note
A begin state was introduced. The transition probability a0k from this begin
state to state k can be thought as the probability of starting in state k.
An end state is assumed, which is the reason for ak0 in the termination step.
If ends are not modelled, this ak0 will disappear.
For convenience we label both begin and end states as 0. There is no conflict
because you can only transit out of the begin state and only into the end
state, so variables are not used more than once.
The emission probabilities are considered independent of the origin state.
(Thus te emission of (pairs of ) symbols can be seen as being done when
reaching the non-end states.) The begin and end states are silent.
56.
Forward:
1. Initialization (i = 0): f0 (0) = 1; fk (0) = 0, for k > 0
P
2. Induction (i = 1 . . . L): fl (i) = el (xi) k fk (i − 1)akl
P
3. Total: P (x) = k fk (L)ak0.
Backward:
1. Initialization (i = L): bk (L) = ak0 , for all k
P
2. Induction (i = L − 1, . . . , 1: bk (i) = l akl el (xi+1)bl (i + 1)
P
3. Total: P (x) = l a0l el (x1)bl (1)
Combining f and b: P (πk , x) = fk (i)bk (i)
57.
Viterbi:
1. Initialization (i = 0): v0(0) = 1; vk (0) = 0, for k > 0
2. Induction (i = 1 . . . L):
vl (i) = el (xi) maxk (vk (i − 1)akl );
ptri (l) = argmaxk vk (i − 1)akl )
3. Termination and readout of best path:
P (x, π ⋆) = maxk (vk (L)ak0);
πL⋆ = argmaxk vk (L)ak0, and πi−1
⋆
= ptri (πi⋆ ), for i = L . . . 1.
58.
Baum-Welch:
1. Initialization: Pick arbitrary model parameters
2. Induction:
For each sequence j = 1 . . . n calculate fkj (i) and bjk (i) for sequence j using the
forward and respectively backward algorithms.
Calculate the expected number of times each transition of emission is used,
given the training sequences:
X 1 X j j j
Akl = j)
f k (i)a kl e l (x i+1 )b l (i + 1)
j
P (x
X 1i X
Ekl = j
fkj (i)bjk (i)
j
P (x ) j
{i|xi =b}
Calculate the new model parameters:
Akl Ek (b)
akl = P and ek (b) = P ′
l′ Akl b′ Ek (b )
′