CS 4705 Hidden Markov Models: Slides Adapted From Dan Jurafsky, and James Martin

CS 4705
Hidden Markov Models
Slides adapted from Dan Jurafsky, and James Martin
1 02/28/22
 What we’ve described with these two kinds of

probabilities is a Hidden Markov Model
 Now we will ties this approach into the model
 Definitions.
2 02/28/22
Definitions
 A weighted finite-state automaton adds probabilities
to the arcs
– The sum of the probabilities leaving any arc must sum to
one
 A Markov chain is a special case of a WFST
– the input sequence uniquely determines which states the
automaton will go through
 Markov chains can’t represent inherently ambiguous
problems
– Assigns probabilities to unambiguous sequences
3 02/28/22
Markov chain for weather
4 02/28/22
Markov chain for words
5 02/28/22
Markov chain = “First-order
observable Markov Model”
 a set of states
– Q = q1, q2…qN; the state at time t is qt
 Transition probabilities:
– a set of probabilities A = a01a02…an1…ann.
– Each aij represents the probability of transitioning from state i to state j
– The set of these is the transition probability matrix A
aij  P(qt  j | qt1  i) 1  i, j  N

 Distinguished start and end states N
a ij  1; 1 i  N
6 02/28/22 j1
Markov chain = “First-order
observable Markov Model”
 Current state only depends on previous state
P(qi | q1...qi1)  P(qi | qi1)

7 02/28/22
Another representation for start
state
 Instead of start state
 Special initial probability vector 
 i  P(q1  i) 1  i  N
– An initial distribution over probability of start states
 Constraints:
N

 j 1
j1
8 02/28/22
The weather figure using pi
9 02/28/22
The weather figure: specific
example
10 02/28/22
Markov chain for weather
 What is the probability of 4 consecutive rainy

days?
 Sequence is rainy-rainy-rainy-rainy
 I.e., state sequence is 3-3-3-3
 P(3,3,3,3) =
 1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432
11 02/28/22
How about?
 Hot hot hot hot

 Cold hot cold hot
 What does the difference in these

probabilities tell you about the real world
weather info encoded in the figure
12 02/28/22
 We don’t observe POS tags

– We infer them from the words we see
 Observed events
 Hidden events
13 02/28/22
HMM for Ice Cream
 You are a climatologist in the year 2799

 Studying global warming
 You can’t find any records of the weather in
Baltimore, MA for summer of 2007
 But you find Jason Eisner’s diary
 Which lists how many ice-creams Jason ate
every date that summer
 Our job: figure out how hot it was
14 02/28/22
Hidden Markov Model
 For Markov chains, the output symbols are the same as the states.
– See hot weather: we’re in state hot
 But in part-of-speech tagging (and other things)
– The output symbols are words
– But the hidden states are part-of-speech tags
 So we need an extension!
 A Hidden Markov Model is an extension of a Markov chain in which
the input symbols are not the same as the states.
 This means we don’t know which state we are in.
15 02/28/22
 States Q = q1, q2…qN;
 Observations O= o1, o2…oN;
– Each observation is a symbol from a vocabulary V =
{v1,v2,…vV}
 Transition probabilities
– Transition probability matrix A = {a }
ij
aij  P(qt  j | qt1  i) 1  i, j  N
 Observation likelihoods
– Output probability matrix B={b (k)}
i
bi (k)  P(X t  ok | qt  i)
  Special initial probability vector 
 i  P(q1  i) 1  i  N
 Some constraints
N
a ij  1; 1 i  N
j1
 i  P(q1  i) 1  i  N
M
 b (k)  1
N

k1
i  j 1
j1

17 02/28/22
Assumptions
 Markov assumption:
P(qi | q1...qi1)  P(qi | qi1)
 Output-independence assumption
 P(ot | O1t1,q1t )  P(ot |q t )

18 02/28/22
Eisner task
 Given
– Ice Cream Observation Sequence: 1,2,3,2,2,2,3…
 Produce:
– Weather Sequence: H,C,H,H,H,C…
19 02/28/22
HMM for ice cream
20 02/28/22
Different types of HMM structure
Ergodic =
Bakis = left-to-right
fully-connected
Transitions between the hidden states of
HMM, showing A probs
22 02/28/22
B observation likelihoods for POS HMM
Three fundamental Problems for HMMs
 Likelihood: Given an HMM λ = (A,B) and an

observation sequence O, determine the
likelihood P(O, λ).
 Decoding: Given an observation sequence
O and an HMM λ = (A,B), discover the best
hidden state sequence Q.
 Learning: Given an observation sequence O
and the set of states in the HMM, learn the
HMM parameters A and B.
24 02/28/22
Decoding
 The best hidden sequence
 Weather sequence in the ice cream task
 POS sequence given an input sentence
 We could use argmax over the probability of each
possible hidden state sequence
– Why not?
 Viterbi algorithm
– Dynamic programming algorithm
– Uses a dynamic programming trellis
 Each trellis cell represents, vt(j), represents the probability that
the HMM is in state j after seeing the first t observations and
passing through the most likely state sequence
25 02/28/22
Viterbi intuition: we are looking for
the best ‘path’
S1 S2 S3 S4 S5
RB
NN
VBN
JJ DT VB
TO
VBD
VB NNP NN
promised to back the bill

26 02/28/22 Slide from Dekang Lin
Intuition
 The value in each cell is computed by taking

the MAX over all paths that lead to this cell.
 An extension of a path from state i at time t-1

is computed by multiplying:
27 02/28/22
The Viterbi Algorithm
28 02/28/22
The A matrix for the POS HMM
29 02/28/22
The B matrix for the POS HMM
30 02/28/22
Viterbi example
31 02/28/22
Computing the Likelihood of an
observation
 Forward algorithm
 Exactly like the viterbi algorithm, except

– To compute the probability of a state, sum the
probabilities from each path
32 02/28/22
Error Analysis: ESSENTIAL!!!
 Look at a confusion matrix
 See what errors are causing problems

– Noun (NN) vs ProperNoun (NN) vs Adj (JJ)
– Adverb (RB) vs Prep (IN) vs Noun (NN)
– Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)
33 02/28/22
Learning HMMs
 Learn the parameters of an HMM
– A and B matrices
 Input
– An unlabeled sequence of observations (e.g., words)
– A vocabulary of potential hidden states (e.g., POS tags)
 Training algorithm
– Forward-backward (Baum-Welch) algorithm
– A special case of the Expectation-Maximization (EM) algorithm
– Intuitions
 Iteratively estimate the counts
 Estimated probabilities derived by computing the forward probability
for an observation and divide that probability mass among all different
contributing paths
34 02/28/22
Other Classification Methods
 Maximum Entropy Model (MaxEnt)
 MEMM (Maximum Entropy HMM)
35 02/28/22

CS 4705 Hidden Markov Models: Slides Adapted From Dan Jurafsky, and James Martin

Uploaded by

Copyright:

Available Formats

You might also like

CS 4705 Hidden Markov Models: Slides Adapted From Dan Jurafsky, and James Martin

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS 4705 Hidden Markov Models: Slides Adapted From Dan Jurafsky, and James Martin

Uploaded by

Copyright:

Available Formats

CS 4705

Hidden Markov Models

Slides adapted from Dan Jurafsky, and James Martin

 What we’ve described with these two kinds of

aij  P(qt  j | qt1  i) 1  i, j  N

 Current state only depends on previous state

P(qi | q1...qi1)  P(qi | qi1)

 Special initial probability vector 

 What is the probability of 4 consecutive rainy

 Hot hot hot hot

 What does the difference in these

 We don’t observe POS tags

 You are a climatologist in the year 2799

 P(ot | O1t1,q1t )  P(ot |q t )

 Likelihood: Given an HMM λ = (A,B) and an

promised to back the bill

 The value in each cell is computed by taking

 An extension of a path from state i at time t-1

 Exactly like the viterbi algorithm, except

 See what errors are causing problems

 Maximum Entropy Model (MaxEnt)

 MEMM (Maximum Entropy HMM)

You might also like