2 HMM

A Revealing Introduction to
Hidden Markov Models
Mark Stamp
A Revealing Introduction to HMMs 1

Hidden Markov Models
 What is a hidden Markov model (HMM)?
o A machine learning technique…
o …trained via a discrete hill climb technique
o Two for the price of one!
 Where are HMMs used?
o Speech recognition, information security, and
far too many other things to list
o Q: Why are HMMs so useful?
o A: Widely applicable and efficient algorithms

Markov Chain
 Markov chain
o “Memoryless random process”
o Transitions depend only on current state
(Markov chain of order 1)…
o …and transition probability matrix
 Example?
o See next slide…

Markov Chain
0.7
 Supposewe’re interested in
average annual temperature H
o Only consider Hot and Cold

 From recorded history, 0.3 0.4
obtain probabilities for…

C
o Year-to-year transitions
o Based on thermometer
0.6
readings for recent years

Markov Chain
0.7
 Transition probability
matrix H
0.3 0.4
 Matrix is denoted as A
C
0.6
 Note, A is “row stochastic”

Markov Chain
 Can also include 0.7
begin, end states
0.6
 Matrix for begin H
state denoted π
o In this example, begin 0.3 0.4 end
C
 Note that π also 0.4
row stochastic
0.6

Hidden Markov Model
 HMM includes a Markov chain
o But the Markov process is “hidden”, i.e., we
can’t directly observe the Markov process
o Instead, observe things that are
probabilistically related to hidden states
o It’s as if there is a “curtain” between
Markov chain and the observations
 Example on next few slides…

HMM Example
 ConsiderH/C annual temp example
 Suppose we want to know H or C annual
temperature in distant past
o Before thermometers were invented
o Note, we only distinguish between H and C
 We’llassume transition between Hot
and Cold years is same as today
o Then the A matrix is known

HMM Example
 Assume temps follow a Markov process
o But we cannot observe temperature in past
 We find modern evidence that tree ring
size is related to temperature
o Discover this by looking at recent data
 We only consider 3 tree ring sizes
o Small, Medium, Large (S, M, L, respectively)
 Measure tree ring sizes and recorded
temperatures to determine relationship

HMM Example
 We find that tree ring sizes and
temperature related by
 This is known as the B matrix
 The matrix B is also row stochastic

HMM Example
 Can we now find H/C temps in past?
 We cannot measure (observe) temps
 But we can measure tree ring sizes…
 …and tree ring sizes related to temps
o By the probabilities in the B matrix
 Can we say something intelligent about
temps over some interval in the past?

HMM Notation
A lot of notation is required
o Notation may be the hardest part…

HMM Notation
 Note that for simplicity, observations
taken from V = {0,1,…,M-1}
 That is,
 The matrix A = {aij} is N x N, where
o
 The matrix B = {bj(k)} is N x M, where
o

HMM Example
 Consider our temperature example…
 What are the observations?
o V = {0,1,2}, corresponding to S,M,L
 What are states of Markov process?
o Q = {H,C}
 What are A,B, π, and T?
o A,B, π on previous slides
o T is number of tree rings measured
 What are N and M?
o N = 2 and M = 3

Generic HMM
 Generic view of HMM
 HMM defined by A,B, and π

 We denote HMM “model” as λ = (A,B,π)
HMM Example
 Suppose that we observe tree ring sizes
o For a 4-year period of interest: S,M,S,L
o Then = (0, 1, 0, 2)
 Most likely (hidden) state sequence?
o That is, most likely X = (X0, X1, X2, X3)
 Let πX0 be prob. of starting in state X0
 Note prob. of initial observation
o And aX0,X1 is prob. of transition X0 to X1
 And so on…
HMM Example
 Bottom line?
 We can compute P(X) for any given X
 For X = (X0, X1, X2, X3) we have
 Assuming we observe (0,1,0,2), then

what is probability of, say, HHCC?
 Plug into formula above to find

HMM Example
 Do same for all
4-state seq’s
 We find that
the winner is…
o CCCH
 Not so fast my
friend!

HMM Example
 The path CCCH scores the highest
 In dynamic programming (DP), we find
highest scoring path
 But, in HMM we maximize expected
number of correct states
o Sometimes called “EM algorithm”
o For “expectation maximization”
 How does HMM work in this example?
HMM Example
 For first position in sequence…
o Sum probabilities for all paths that have H
in 1st position, compare to sum of probs for
paths with C in 1st position  biggest wins
 Repeat for each position and we find

HMM Example
 So, HMM solution gives us CHCH

 While DP solution is CCCH
 Which solution is better?
 Neither solution is better!
o Just using different definitions of “best”

HMM Paradox?
 HMM maximizes expected number of
correct states
o Whereas DP chooses “best” overall path
 Possible for HMM to choose a “path”
that is impossible
o Could be a transition probability of 0
 Cannot get impossible path with DP
 Is this a flaw with HMM?
o No, it’s a feature

Probability of
Observations
 Table computed for
= (0,1,0,2)
 For this sequence,
P( ) = .000412 + .000035
+ .000706 + … +
.000847
= left to the reader
 Similarly for other
observations
HMM Model
 An HMM is defined by the three
matrices, A, B, and π
 Note that M and N are implied, since
they are the dimensions of matrices
 So, we denote an HMM “model” as
λ = (A,B,π)

The Three Problems
 HMMs used to solve 3 problems
 Problem 1: Given a model λ = (A,B,π) and
observation sequence O, find P(O|λ)
o That is, we can score an observation sequence
to see how well it fits a given model
 Problem 2: Given λ = (A,B,π) and O, find an
optimal state sequence (in HMM sense)
o Uncover hidden part (like previous example)
 Problem 3: Given O, N, and M, find the
model λ that maximizes probability of O
o That is, train a model to fit observations

HMMs in Practice
 Often, we use HMMs as follows:
 Given an observation sequence…
o Assume that (hidden) Markov process exists
 Train a model based on observations
o That is, solve Problem 3
o “Best” N can be found by trial and error
 Then given a sequence of observations,
score it versus the model we trained
o This is Problem 1  high score implies similar
to training data, low score says it’s not similar

HMMs in Practice
 Previousslide gives sense in which
HMM is a “machine learning” technique
o To train model, we do not need to specify
anything except the parameter N
o “Best” N often found by trial and error
 So, we don’t need to think too much
o Just train HMM and then use it
o Practical, since efficient algorithms for
training HMM and scoring

The Three Solutions
 We give detailed solutions to 3 problems
o Note: We must find efficient solutions
 The three problems:
o Problem 1: Score an observation sequence versus a
given model
o Problem 2: Given a model, “uncover” hidden part
o Problem 3: Given an observation sequence, train a
model
 Recall that we considered example for 2 and 1
o But direct solutions are soooooo… inefficient

Solution 1
 Score observations versus a given model
o Given model λ = (A,B,π) and observation
sequence O=(O0,O1,…,OT-1), find P(O|λ)
 Denote hidden states as
X = (X0, X1, . . . , XT-1)
 Then from definition of B,
P(O|X,λ)=bX0(O0) bX1(O1) … bXT-1(OT-1)
 And from definition of A and π,
P(X|λ)=πX0 aX0,X1 aX1,X2 … aXT-2,XT-1

Solution 1
 Elementary conditional probability fact:
P(O,X|λ) = P(O|X,λ) P(X|λ)
 Sum over all possible state sequences X,
P(O|λ) = Σ P(O,X|λ) = Σ P(O|X,λ) P(X|λ)
= ΣπX0bX0(O0)aX0,X1bX1(O1)…aXT-2,XT-1bXT-1(OT-1)
 This “works” but way too costly
 Requires about 2TNT multiplications
o Why?
 There better be a better way…
Forward Algorithm
 Instead, use forward algorithm
o Or “alpha pass”
 For t = 0,1,…,T-1 and i=0,1,…,N-1, let
αt(i) = P(O0,O1,…,Ot,Xt=qi|λ)
 Probability of “partial sum” to t, and
Markov process is in state qi at step t
 Can be computed recursively, efficiently

Forward Algorithm
 Let α0(i) = πibi(O0) for i = 0,1,…,N-1
 For t = 1,2,…,T-1 and i=0,1,…,N-1, let
αt(i) = (Σαt-1(j)aji)bi(Ot)
o Where the sum is from j = 0 to N-1
 From definition of αt(i) we see
P(O|λ) = ΣαT-1(i)
o Where the sum is from i = 0 to N-1
 This requires only N2T multiplications

Solution 2
 Given a model, find hidden states
o Given λ = (A,B,π) and O, find an optimal state
sequence
o Recall that optimal means “maximize
expected number of correct states”
o In contrast, DP finds best scoring path
 For temp/tree ring example, solved this
o But hopelessly inefficient approach
 A better way: backward algorithm
o Or “beta pass”
Backward Algorithm
 For t = 0,1,…,T-1 and i = 0,1,…,N-1, let βt(i)
= P(Ot+1,Ot+2,…,OT-1|Xt=qi,λ)
 Probability of partial sum from t to end
and Markov process in state qi at step t
 Analogous to the forward algorithm
 As with forward algorithm, this can be
computed recursively and efficiently

Backward Algorithm
 Let βT-1(i) = 1 for i = 0,1,…,N-1
 For t = T-2,T-3, …,1 and i = 0,1,…,N-1, let
βt(i) = Σaijbj(Ot+1)βt+1(j)
 Where the sum is from j = 0 to N-1

Solution 2
 For t = 1,2,…,T-1 and i=0,1,…,N-1 define
γt(i) = P(Xt=qi|O,λ)
o Most likely state at t is qi that maximizes γt(i)
 Note that γt(i) = αt(i)βt(i)/P(O|λ)
o And recall P(O|λ) = ΣαT-1(i)
 The bottom line?
o Forward algorithm solves Problem 1
o Forward/backward algorithms solve Problem 2
(how?)
Solution 3
 Train a model: Given O, N, and M, find
λ that maximizes probability of O
 We’ll iteratively adjust λ = (A,B,π) to
better fit the given observations O
o The size of matrices are fixed (N and M)
o But elements of matrices can change
 It is nice that this works…
o …and amazing that it’s efficient!

Solution 3
 For t=0,1,…,T-2 and i,j in {0,1,…,N-1},
define “di-gammas” as
γt(i,j) = P(Xt=qi, Xt+1=qj|O,λ)
 Note γt(i,j) is prob of being in state qi at
time t and transiting to state qj at t+1
 Then γt(i,j) = αt(i)aijbj(Ot+1)βt+1(j)/P(O|λ)
 And γt(i) = Σγt(i,j)
o Where sum is from j = 0 to N – 1
Model Re-estimation
 Given di-gammas and gammas…
 For i = 0,1,…,N-1 let πi = γ0(i)
 For i = 0,1,…,N-1 and j = 0,1,…,N-1
aij = Σγt(i,j)/Σγt(i)
o Where both sums are from t = 0 to T-2
 For j = 0,1,…,N-1 and k = 0,1,…,M-1
bj(k) = Σγt(j)/Σγt(j)
o Both sums from from t = 0 to T-1 but only t
for which Ot = k are counted in numerator
 Why does this work?
Solution 3
 To summarize…
1. Initialize λ = (A,B,π)
2. Compute αt(i), βt(i), γt(i,j), γt(i)
3. Re-estimate the model λ = (A,B,π)
4. If P(O|λ) increases by more than ε
(where ε is small), goto 2

Solution 3
 Some fine points…
 Model initialization
o If we have a good guess for λ = (A,B,π) then we
can use it for initialization
o If not, let πi ≈ 1/N, ai,j ≈ 1/N, bj(k) ≈ 1/M
o Subject to row stochastic conditions
o But do not initialize to exactly uniform values
 Stopping conditions
o Stop after some number of iterations and/or…
o Stop if increase in P(O|λ) is too small
HMM as Discrete Hill Climb
 Algorithm
on previous slides shows
that HMM is a “discrete hill climb”
 HMM consists of discrete states, Xt
o “Climb” on the elements of the matrices
 And re-estimation process improves
model by modifying parameters
o So, “climbs” toward improved model
o This happens in a high-dimensional space

Dynamic Programming
 Brief detour…
 For λ = (A,B,π) as above, it’s easy to
define a dynamic program (DP)
 Executive summary:
o DP is forward algorithm, with “sum”
replaced by “max”
 Details on next few slides

Dynamic Programming
 Let δ0(i) = πi bi(O0) for i=0,1,…,N-1
 For t=1,2,…,T-1 and i=0,1,…,N-1 compute
δt(i) = max (δt-1(j)aji)bi(Ot)
o Where the max is over j in {0,1,…,N-1}
 Note that at each t, the DP computes best
path for each state, up to that point
 So, probability of best path is max δT-1(j)
 This max gives the highest probability
o Not the best path, for that, see next slide

Dynamic Programming
 To determine optimal path
o While computing deltas, keep track of
pointers to previous state
o When finished, construct optimal path by
tracing back points
 For example, consider temp example:
recall that we observe (0,1,0,2)
 Probabilities for path of length 1:
 These are the only “paths” of length 1

Dynamic Programming
 Probabilities for each path of length 2
 Best path of length 2 ending with H is CH

 Best path of length 2 ending with C is CC

Dynamic Program
 Continuing,we compute best path
ending at H and C at each step
 And save pointers  why?

Dynamic Program
 Best final score is .002822

o And thanks to pointers, best path is CCCH
 But what about underflow?
o A serious problem in bigger cases

Underflow Resistant DP
 Common trick to prevent underflow:
o Instead of multiplying probabilities…
o …add logarithms of probabilities
 Why does this work?
o Because log(xy) = log x + log y
o Adding logs does not tend to 0
 Note that these logs are negative…
 …and we must avoid 0 probabilities

Underflow Resistant DP
 Underflow resistant DP algorithm:
 Let δ0(i) = log(πi bi(O0))
for i=0,1,…,N-1
 For t=1,2,…,T-1 and i=0,1,…,N-1 compute
δt(i) = max (δt-1(j) + log(aji) + log(bi(Ot)))
o Where the max is over j in {0,1,…,N-1}
 And score of best path is max δT-1(j)
o As before, must also keep track of paths

HMM Scaling
 Trickier to prevent underflow in HMM
 We consider solution 3
o Since it includes solutions 1 and 2
 Recall for t = 1,2,…,T-1, i=0,1,…,N-1,
αt(i) = (Σαt-1(j)aj,i)bi(Ot)
 The idea is to normalize alphas so that
they sum to 1
o Algorithm on next slide

HMM Scaling
 Given αt(i) = (Σαt-1(j)aj,i)bi(Ot)

 Let a0(i) = α0(i) for i=0,1,…,N-1
 Let c0 = 1/Σa0(j)
 For i = 0,1,…,N-1, let a0(i) = c0a0(i)
 This takes care of t = 0 case
 Algorithm continued on next slide…

HMM Scaling
 For t = 1,2,…,T-1 do the following:
 For i = 0,1,…,N-1,
at(i) = (Σat-1(j)aj,i)bi(Ot)
 Let ct = 1/Σat(j)
 For i = 0,1,…,N-1 let at(i) = ctat(i)

HMM Scaling
 Easy to show at(i) = c0c1…ct αt(i) (♯)
o Simple proof by induction
 So, c0c1…ct is scaling factor at step t
 Also, easy to show that
at(i) = αt(i)/Σαt(j)
 Which implies ΣaT-1(i) = 1 (♯♯)

HMM Scaling
 By combining (♯) and (♯♯), we have
1 = ΣaT-1(i) = c0c1…cT-1 ΣαT-1(i)
= c0c1…cT-1 P(O|λ)
 Therefore, P(O|λ) = 1 / c0c1…cT-1
 To avoid underflow, we compute
log P(O|λ) = -Σ log(cj)
o Where sum is from j = 0 to T-1

HMM Scaling
 Similarly, scale betas as ctβt(i)
 For re-estimation,
o Compute γt(i,j) and γt(i) using original
formulas, but with scaled alphas, betas
 This gives us new values for λ = (A,B,π)
 “Easy exercise” to show re-estimate is exact
when scaled alphas and betas used
 Also, P(O|λ) cancels from formula
o Use log P(O|λ) = -Σ log(cj) to decide if
iterate improves
All Together Now
 Complete pseudo code for Solution 3
 Given: (O0,O1,…,OT-1) and N and M
 Initialize: λ = (A,B,π)
o A is NxN, B is NxM and π is 1xN
o πi ≈ 1/N, aij ≈ 1/N, bj(k) ≈ 1/M, each matrix row
stochastic, but not uniform
 Initialize:
o maxIters = max number of re-estimation steps
o iters = 0
o oldLogProb = -∞

Forward Algorithm
 Forward algorithm
o With scaling

Backward Algorithm
 Backward algorithm
or “beta pass”
o With scaling
 Note:same scaling
factor as alphas

Gammas
 Using scaled
alphas and
betas

Re-Estimation
 Again, using
scaled gammas
 So formulas
unchanged

Stopping Criteria
 Checkthat
probability
increases
o In practice, want
logProb >
oldLogProb + ε
 And don’t
exceed max
iterations

References
 M. Stamp, A revealing introduction
to hidden Markov models
 L.R. Rabiner, A tutorial on hidden
Markov models and selected
applications in speech recognition
 R.L. Cave & L.P. Neuwirth, Hidden
Markov models for English

2 HMM

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 HMM

Uploaded by

Copyright:

Available Formats

A Revealing Introduction to

Hidden Markov Models

A Revealing Introduction to HMMs 1

A Revealing Introduction to HMMs 2

A Revealing Introduction to HMMs 3

o Only consider Hot and Cold

obtain probabilities for…

A Revealing Introduction to HMMs 4

A Revealing Introduction to HMMs 5

A Revealing Introduction to HMMs 6

A Revealing Introduction to HMMs 7

A Revealing Introduction to HMMs 8

A Revealing Introduction to HMMs 9

 This is known as the B matrix

 The matrix B is also row stochastic

A Revealing Introduction to HMMs 10

A Revealing Introduction to HMMs 11

A Revealing Introduction to HMMs 12

A Revealing Introduction to HMMs 13

A Revealing Introduction to HMMs 14

 HMM defined by A,B, and π

 Assuming we observe (0,1,0,2), then

A Revealing Introduction to HMMs 17

A Revealing Introduction to HMMs 18

A Revealing Introduction to HMMs 20

 So, HMM solution gives us CHCH

A Revealing Introduction to HMMs 21

A Revealing Introduction to HMMs 22

A Revealing Introduction to HMMs 24

A Revealing Introduction to HMMs 25

A Revealing Introduction to HMMs 26

A Revealing Introduction to HMMs 27

A Revealing Introduction to HMMs 28

A Revealing Introduction to HMMs 29

A Revealing Introduction to HMMs 31

A Revealing Introduction to HMMs 32

A Revealing Introduction to HMMs 34

A Revealing Introduction to HMMs 35

A Revealing Introduction to HMMs 37

A Revealing Introduction to HMMs 40

A Revealing Introduction to HMMs 42

A Revealing Introduction to HMMs 43

A Revealing Introduction to HMMs 44

 These are the only “paths” of length 1

A Revealing Introduction to HMMs 45

 Best path of length 2 ending with H is CH

A Revealing Introduction to HMMs 46

A Revealing Introduction to HMMs 47

 Best final score is .002822

A Revealing Introduction to HMMs 48

A Revealing Introduction to HMMs 49

A Revealing Introduction to HMMs 50

A Revealing Introduction to HMMs 51

 Given αt(i) = (Σαt-1(j)aj,i)bi(Ot)

A Revealing Introduction to HMMs 52

A Revealing Introduction to HMMs 53

A Revealing Introduction to HMMs 54

A Revealing Introduction to HMMs 55

A Revealing Introduction to HMMs 57

A Revealing Introduction to HMMs 58

A Revealing Introduction to HMMs 59

A Revealing Introduction to HMMs 60