Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Markov Models

• Markov Models
• Hidden Markov Models
• The Three Fundamental Questions for HMMs
• Finding the probability of an observation
• Finding the best state sequence and parameter estimation
Problem Statement
✔ Often, we want to consider a sequence (perhaps through time) of random variables that
aren’t independent, but rather the value of each variable depends on previous elements in
the sequence.
✔ For many such systems, it seems reasonable to assume that all we need to predict the
future random variables is the value of the present random variable, and we don’t need to
know the values of all the past random variables in the sequence.
Markov Model Definition
Markov chain is a mathematical model that is utilized to simulate random processes
occurring over a duration of time. It consists of a set of states and the transitions between
them. These transitions are probabilistic, which implies that the possibility of moving
from one state to another solely depends on the current state and not on any past events.

• If the random variables measure the number of books in the University library, then,
knowing how many books were in the library today might be an adequate predictor of
how many books there will be tomorrow, and we don’t really need to additionally
know how many books the library had last week, let alone last year.
• That is, future elements of the sequence are conditionally independent of past
elements, given the present element.
Markov Assumption
• Suppose X = (X1,...,XT ) is a sequence of random variables taking values in
some finite set S = {s1,...,sN}, the state space. Then the Markov
Assumptions are:
• Limited Horizon:
• Time invariant (stationary process):

X is then said to be a Markov chain, or to have the Markov property.


Markov Model in NLP
✔ Markov models can be used whenever one wants to model the probability of a linear
sequence of events.
✔ For example, MM can also be used in NLP for modeling valid sequences in speech
recognition, and for sequences of speech acts in dialog systems.
✔ One can represent a Markov chain by a state diagram.
Here, the states are shown as circles around the state name,
and the single start state is indicated with an incoming arrow.
Possible transitions are shown by arrows connecting states,
and these arcs are labeled with the probability of this
transition being followed, given that you are in the state at the
tail of the arrow.
Transitions with zero probability are omitted from the
diagram.
Note that the probabilities of the outgoing arcs from each
state sum to 1.
Markov Model
• The Markov properties ensure that we have a finite state automation.
• There are no long distance dependencies, and where one ends up next depends
simply on what state one is in.
• In a visible Markov model, we know what state the machine is passing through, so the
state sequence or some deterministic function of it can be regarded as the output.
• The probability of a sequence of states (that is, a sequence of random variables) Xi, . .
. , XT is easily calculated for a Markov chain.
• We find that we need merely calculate the product of the probabilities that
occur on the arcs or in the stochastic matrix:
N-gram Vs Markov Model
• It is possible to reformulate any n-gram model as a visible Markov
model by simply encoding the appropriate amount of history into the
state space (states are then (n - 1)-grams, would be a state in a
fourgram model).
• Any fixed finite amount of history can always be encoded in this way
by simply elaborating the state space as a crossproduct of multiple
previous states.
• In such cases, we sometimes talk of an mth order Markov model,
where m is the number of previous states that we are using to predict
the next state.
• An n-gram model is equivalent to an (n - 1)th order Markov model.
Hidden Markov Model (HMM)
A Hidden Markov Model (HMM) is a statistical model used in various fields, including
speech recognition, natural language processing, and bioinformatics, to represent
sequences of observable events that are generated by an underlying set of hidden states.
HMMs are characterized by three key components:
1. States (Hidden States): A finite set of hidden states, often represented as {S1, S2,
..., Sn}, where n is the number of states. These states are not directly observable.
2. Observations (Emissions): A finite set of observable events or emissions,
represented as {E1, E2, ..., Em}, where m is the number of possible observations.
These are the events we can measure or observe.
3. Transition Probabilities: Probabilities that define the likelihood of moving from
one hidden state to another. These are typically represented as a state transition
matrix, where the (i, j)-th entry represents the probability of transitioning from state
Si to state Sj.
4. Emission Probabilities: Probabilities that define the likelihood of observing an
event given the current hidden state. These are typically represented as an emission
matrix, where the (i, j)-th entry represents the probability of observing event Ej when
in state Si.
The "weather" example
To model the weather as either "sunny" or "rainy" based on whether a
person carries an umbrella or not.

States: Observations:
•Sunny (S) •Carrying an umbrella (U)
•Rainy (R) •Not carrying an umbrella (N)
Transition
Probabilities: Emission Probabilities:
•P(Sunny to Sunny) = 0.7 •P(Carrying an umbrella in Sunny) = 0.2
•P(Sunny to Rainy) = 0.3 •P(Not carrying an umbrella in Sunny) = 0.8
•P(Rainy to Sunny) = 0.4 •P(Carrying an umbrella in Rainy) = 0.6
•P(Rainy to Rainy) = 0.6 •P(Not carrying an umbrella in Rainy) = 0.4
Stochastic Model
• It is a discrete-time process indexed at time 1,2,3,…that take
values called states which are observed.
• For an example if the states (S) ={hot , cold }
• State series over time => z∈ S_T
• Weather for 4 days can be a sequence => {z1=hot, z2 =cold,
z3 =cold, z4 =hot}

Markov and Hidden Markov models are engineered to


handle data which can be represented as ‘sequence’ of
observations over time.
Markov Vs Hidden Markov Models
• A stochastic model is said to be a Markov model, it must only depend on
the system’s current state. This is known as the Markov Property.
• A stochastic process with a discrete state space is a Markov Chain.
However, if the state space is continuous, then it is a Markov Process.
• The states of MCs can be observed directly. HMMs are used when we can
observe a secondary sequence. That is, the underlying sequence of states
is hidden.
• Significantly, this secondary sequence depends on the sequence of hidden
states. Therefore, this observed sequence gives us information about the
hidden sequence.

Just as the hidden states have a transition matrix, the secondary


states have an emission matrix. Each matrix element gives the
probability that a given hidden state produces a given observed state.
Hidden Markov Models
• In an HMM, you don’t know the state sequence that the model
passes through, but only some probabilistic function of it.

• Suppose you have a crazy soft drink machine: it can be in two states,
cola preferring (CP) and iced tea preferring (IP), but it switches
between them randomly after each purchase.
• Now, if, when you put in your coin, the machine always put out a
cola if it was in the cola preferring state and an iced tea when it was
in the iced tea preferring state, then we would have a visible Markov
model
Hidden Markov Models
• We need symbol emission probabilities for the observations:

• For this machine, the output is actually independent of Sj, and so can
be described by the following probability matrix:

• What is the probability of seeing the output sequence {lem, ice-t} if


the machine always starts off in the cola preferring state?
Hidden Markov Models
• We need to consider all paths that might be taken through the HMM, and then
to sum over them.
• We know that the machine starts in state CP.
• There are then four possibilities depending on which of the two states the
machine is in at the other two time instants.

• output sequence {lem, ice-t}


• starts off in the cola preferring state
• Possible combinations

• CP-CP-CP, CP-CP-IP, CP-IP-CP, CP-IP-IP


0.7 x 0.3 x 0.7 x 0.1 + 0.7 x 0.3 x 0.3 x 0.1 + 0.3 x 0.3 x 0.5 x 0.7 + 0.3 x 0.3 x 0.5 x
0.7 = 0.084
What is the probability of seeing the output sequence {lem, ice-t} if the
machine always starts off in the cola preferring state?
0.7 0.7 0.7 0.3 0.3 0.5 0.3 0.5

CP-CP-CP, CP-CP-IP, CP-IP-CP, CP-IP-IP


0.3 0.1 0.3 0.1 0.3 0.7 0.3 0.7

Lem Ice-t Lem Ice-t Lem Ice-t Lem Ice-t

0.7 x 0.3 x 0.7 x 0.1 + 0.7 x 0.3 x 0.3 x 0.1 + 0.3 x 0.3 x 0.5 x 0.7 + 0.3 x 0.3 x 0.5 x 0.7 = 0.084
Hidden Markov Models

What is the probability of seeing the output sequence {col,lem} if the


machine always starts off in the ice tea preferring state?
Hidden Markov Models

Why use HMMs?


• When this general model is suitable, the reason that HMMs are very
useful is that they are one of a class of models for which there exist
efficient methods of training through use of the Expectation
Maximization (EM) algorithm.
• Given plenty of data that we assume to be generated by some HMM –
where the model architecture is fixed but not the probabilities on the
arcs
• This algorithm allows us to automatically learn the model parameters
that best account for the observed data.
Hidden Markov Models
Why use HMMs? Hidden Markov Models
How this HMM assigns the same probabilities as the earlier equation: there are three
ways for wc to follow wawb and the total probability of seeing wc next is then the sum
of each of the n-gram probabilities that alter the arcs multiplied by the corresponding
parameter λi.
The HMM training algorithm can then be applied to this network, and used to
improve initial estimates for the parameters λiab.
There are two things to note.
This conversion works by adding epsilon transitions – that is
transitions that we wish to say do epsilon transitions not
produce an output symbol.
Secondly, as presented, we now have separate parameters
λiab for each word pair.
But we would not want to adjust these parameters separately,
as this would make our sparse data problem worse not better.
Rather, for a fixed i, we wish to keep all (or at least classes of)
the λiab parameters having the same value, which we do by
using tied states.
General form of an HMM Hidden Markov Models
An HMM is specified by a five-tuple (S, K, Π, A, B), where S and K are
the set of states and the output alphabet, and Π, A, and B are the
probabilities for the initial state, state transitions, and symbol emissions,
respectively
HMM: Emission Modules
Arc Emission (Continuous Emission):
In arc emission HMMs, the emission probabilities are modeled as continuous
probability distributions. This means that for each hidden state, the emission
probability distribution is a continuous distribution. The observable symbols are
continuous variables, and the emission probabilities model the likelihood of observing a
particular value from the continuous distribution given the current hidden state.
This type of emission model is often used in speech recognition where the acoustic
features, such as spectral coefficients, are continuous values.
State Emission (Discrete Emission):
In state emission HMMs, the emission probabilities are modeled as discrete probability
distributions. Each hidden state is associated with a set of discrete symbols or
observations, and the emission probabilities represent the probability of emitting one of
these symbols when in that state.
This type of emission model is common in natural language processing tasks, where the
observable symbols are discrete words and the emission probabilities model the
likelihood of generating each word from the underlying hidden state (e.g., part of
speech).
General form of an HMM Hidden Markov Models
Given a specification of an HMM, it is perfectly straight forward to
simulate the running of a Markov process, and to produce an output
sequence.
The interest in HMMs comes from assuming that some set of data was
generated by a HMM, and then being able to calculate probabilities and
probable underlying state sequences.
Hidden Markov Models
The Three Fundamental Questions for HMMs

1. Given a model µ = (A, B, Π), how do we efficiently


compute how likely a certain observation is, that is P
(O|µ)?
2. Given the observation sequence O and a model µ, how
do we choose a state sequence (X1,...,XT+1) that best
explains the observations?
3. Given an observation sequence O, and a space of
possible models found by varying the model parameters
µ = (A, B, π), how do we find the model that best explains
the observed data?
The Three Fundamental Questions for HMMs
Finding the probability of an observation
Given the observation sequence O = (o1,...,oT ) and a model µ = (A, B, Π), we wish to
know how to efficiently compute P (O|µ) – the probability of the observation given the
model.
This process is often referred to as decoding.
For any state sequence X = (X1,...,XT+1),

For the general case (where one can start in any state, and move to any
other at each step), the calculation requires (2T + 1) · NT+1 multiplications.
The Three Fundamental Questions for HMMs
Finding the probability of an observation

To avoid this complexity the general technique of dynamic programming or


memoization by which we remember partial results rather than recomputing them.
This general concept crops up in many other places in computational linguistics, such
as chart parsing, and in computer science more generally.
For algorithms such as HMMs, the dynamic programming problem is generally
described in terms of trellises (also trellis called lattices).
Here, we make a square array of states versus time, and compute the probabilities of
being at each state at each time in terms of the probabilities for being in each state at
the preceding time instant.
A trellis can record the probability of all initial subpaths of the HMM that end in a
certain state at a certain time.
The probability of longer subpaths can then be worked out in terms of one shorter
subpaths.
The Three Fundamental Questions for HMMs
Finding the probability of an observation
Trellis algorithms.
• The trellis is a square array of states versus times.
• A node at (si,t) can store information about state
sequences which include Xt = i.
• The lines show the connections between nodes.
• Here we have a fully interconnected HMM where
one can move from any state to any other at each
step.

Trellis - The forward procedure:


The form of caching is called the forward procedure.
We describe it in terms of forward variables:

The forward variable αi(t) is stored at (si,t) in the trellis and expresses the total probability of
ending up in state si at time t (given that the observations o1 ··· ot-1 were seen).
The Three Fundamental Questions for HMMs
Finding the probability of an observation
Trellis - The forward procedure:
It is calculated by summing probabilities for all incoming arcs at a
Trellis node.
We calculate the forward variables in the trellis left to right using
the following procedure:

– Trellis algorithms: Closeup of the computation of


forward probabilities at one node.
– The forward probability αj (t+1) is calculated by
summing the product of the probabilities on each
incoming arc with the forward probability of the
originating node.
This is a much cheaper algorithm that requires only 2N2T multiplications.
The Three Fundamental Questions for HMMs
Finding the probability of an observation
Example

• Visualizing the computation of a single


element αt(i) in the trellis by summing all
the previous values αt-1, weighted by their
transition probabilities a, and multiplying
by the observation probability bi(ot).
• For many applications of HMMs, many of
the transition probabilities are 0, so not
all previous states will contribute to the
forward probability of the current state.
• Hidden states are in circles, observations
in squares. Shaded nodes are included in
the probability computation for αt(i).
What is the probability of seeing the output sequence {lem, ice-t, cola} if the
machine always starts off in the cola preferring state?
Lem Ice-t Cola

CP CP CP CP

IP IP IP IP

Forward Procedure
Forward Procedure Output Summary
The Three Fundamental Questions for HMMs
Finding the probability of an observation
Trellis - The backward procedure:
It should be clear that we do not need to cache results working
forward through time like this, but rather that we could also work
backward.
The backward procedure computes backward variables which are
the total backward procedure probability of seeing the rest of the
observation sequence given that we were in state si at time t.
Then we can calculate backward variables working from
right to left through the trellis as follows:

The real reason for introducing this less intuitive


calculation, though, is because use of a combination of
forward and backward probabilities is vital for solving the
third problem of parameter reestimation.
Define backward variables
βi(t) = P (ot ··· oT |Xt = i,µ)
The Three Fundamental Questions for HMMs
Finding the probability of an observation
More generally, in fact, we can use any combination of forward and
backward caching to work out the probability of an observation sequence.
What is the probability of seeing the output sequence {lem, ice-t, cola} if the
machine always starts off in the cola preferring state?
Lem Ice-t Cola

CP CP CP CP

IP IP IP IP

Backward Procedure
Forward & Backward Procedure
Output Summary
The Three Fundamental Questions for HMMs
Finding the best state sequence
The second problem is “finding the state sequence that best explains the
observations.”
One way to proceed would be to choose the states individually.
That is, for each t, 1 ≤ t ≤ T + 1, we would find Xt that maximizes P (Xt|O,µ).

This quantity maximizes the expected number of states that will be guessed correctly.
However, it may yield a quite unlikely state sequence.
Therefore, this is not the method that is normally used, but rather the Viterbi algorithm,
which efficiently computes the most likely state sequence.
What is the probability of seeing the output sequence {lem, ice-t, cola} if the
machine always starts off in the cola preferring state?
Lem Ice-t Cola

CP CP CP CP

IP IP IP IP

Best Sequence
Best State Sequence Calculation

CP-IP-CP-CP
Forward/Backward Procedure & Best
State Sequence Output Summary
Finding the probability of an observation: Example

What is the probability of seeing the output sequence {lem, ice-t, cola} if
the machine always starts off in the cola preferring state?

Homework!

AAB?
The Three Fundamental Questions for HMMs
Viterbi Algorithm
• For any model, such as an HMM, that
contains hidden variables, the task of • We might propose to find the best
determining which sequence of sequence as follows:
variables is the underlying source of • For each possible hidden state
some sequence of observations is called sequence (HHH, HHC, HCH, etc.), we
the decoding task. could run the forward algorithm and
• In the ice-cream domain, given a compute the likelihood of the
sequence decoding of ice-cream observation sequence given that
observations 3 1 3 and an HMM, the hidden state sequence.
task of the decoder is to find the best • Then we could choose the hidden state
hidden weather sequence (H H H). sequence with the maximum
• More formally, Decoding: Given as observation likelihood.
input an HMM μ = (A,B) and a • It should be clear from the previous
sequence of observations O = o1,o2,...,oT section that we cannot do this because
, find the most probable sequence of there are an exponentially large
states Q = q1q2q3 ...qT . number of state sequences.
The Three Fundamental Questions for HMMs
Viterbi Algorithm
Using dynamic programming, we calculate the most probable path through the whole
trellis as follows
Finding the best state sequence: Viterbi
Lem Ice-t Cola

CP CP CP CP

IP IP IP IP
Output Summary
The Three Fundamental Questions for HMMs
The third problem: Parameter estimation
• Given a certain observation sequence, • There is no known analytic method to
we want to find the values of the model choose µ to maximize P (O|µ).
parameters µ = (A, B, π) which best • But we can locally maximize it by an
explain what we observed. iterative hill-climbing algorithm.
• Using Maximum Likelihood • This algorithm is the Baum-Welch or
Estimation, that means we want to find Forward-Backward algorithm, which is a
the values that maximize P (O|µ): Forward-Backward algorithm special case
of the Expectation Maximization method.

• We don’t know what the model is, but we can work out the probability of the observation sequence
using some (perhaps randomly chosen) model.
• Looking at that calculation, we can see which state transitions and symbol emissions were probably
used the most.
• By increasing the probability of those, we can choose a revised model which gives a higher probability
to the observation sequence.
• This maximization process is often referred to as training the model and is performed on training data.
What is the probability of seeing the output sequence {lem, ice-t, cola} if the
machine always starts off in the cola preferring state?
Lem Ice-t Cola

CP CP CP CP

IP IP IP IP
The Three Fundamental Questions for HMMs
The third problem: Parameter estimation
Define pt(i, j), 1 ≤ t ≤ T, 1 ≤ i, j ≤ N
as shown below. This is the
probability of traversing a certain
arc at time t given observation
sequence O
The Three Fundamental Questions for HMMs
The third problem: Parameter estimation
What is the probability of seeing the output sequence {lem, ice-t, cola} if the
machine always starts off in the cola preferring state?
Lem Ice-t Cola

CP CP CP CP

IP IP IP IP
What is the probability of seeing the output sequence {lem, ice-t, cola} if the
machine always starts off in the cola preferring state?
Lem Ice-t Cola

CP CP CP CP

IP IP IP IP
The Three Fundamental Questions for HMMs
• The Three Fundamental Questions for HMMs
• Parameter Re-estimation
The Three Fundamental Questions for HMMs
• The third problem: Parameter estimation
The Three Fundamental Questions for HMMs
The third problem: Parameter estimation
Now, if we sum over the time index, this gives We then repeat this process, hoping to
us expectations (counts): converge on optimal values for the model
parameters µ.
The reestimation formulas are as follows:

We begin with some model µ (perhaps


preselected, perhaps just chosen randomly).
We then run O through the current model to
estimate the expectations of each model
parameter.
We then change the model to maximize the
values of the paths that are used a lot (while
still respecting the stochastic constraints).

You might also like