4-Hidden Markov Models

1 , 2006.2.
1~3, KAIST
A Tutorial on Hidden Markov Models
2006 2 2 jyha@kangwon.ac.kr
Contents
Introduction Markov Model Hidden Markov model (HMM) Three algorithms of HMM
Model evaluation Most probable path decoding Model training
Pattern classification using HMMs HMM Applications and Software Summary References
A Tutorial on HMMs
Sequential Data
Examples
Speech data ( )
Handwriting data
A Tutorial on HMMs 3
Characteristics of such data

Data are sequentially generated according to time or index Spatial information along time or index Often highly variable, but has an embedded structure Information is contained in the structure
A Tutorial on HMMs
Advantage of HMM on Sequential Data

Natural model structure: doubly stochastic process
transition parameters model temporal variability output distribution model spatial variability
Efficient and good modeling tool for

sequences with temporal constraints spatial variability along the sequence real world complex processes
Efficient evaluation, decoding and training algorithms

Mathematically strong Computationally efficient
Proven technology!
Successful stories in many applications
Tools already exist

HTK (Hidden Markov Model Toolkit) HMM toolbox for Matlab
A Tutorial on HMMs
Successful Application Areas of HMM

On-line handwriting recognition Speech recognition and segmentation Gesture recognition Language modeling Motion video analysis and tracking Protein sequence/gene sequence alignment Stock price prediction
A Tutorial on HMMs
Whats HMM?
Hidden Markov Model
Hidden
Markov Model
What is hidden?
What is Markov model?
A Tutorial on HMMs
Markov Model
Scenario Graphical representation Definition Sequence probability State probability
8
A Tutorial on HMMs
Markov Model: Scenario

Classify a weather into three states
State 1: rain or snow State 2: cloudy State 3: sunny
By carefully examining the weather of some city for a long time, we found following weather change pattern
Tomorrow Rain/snow Today Rain/Snow Cloudy Sunny 0.4 0.2 0.1 Cloudy 0.3 0.6 0.1 Sunny 0.3 0.2 0.8
Assumption: tomorrow weather depends only on todays weather!

Markov Model: Graphical Representation

Visual illustration with diagram
0.4 0.3 0.2 0.3 0.1 3: sunny 0.8 0.2 0.1
1: rain
2: cloudy
0.6
- Each state corresponds to one observation - Sum of outgoing edge weights is one
Markov Model: Definition

0.4
Observable states
1: rain 0.3 0.1
0.3 0.2 0.2
2: cloudy
0.6
{1, 2 , L , N }
Observed sequence
0.1 3: sunny 0.8
q1 , q2 , L , qT
1st order Markov assumption
P ( qt = j | qt 1 = i, qt 2 = k , L) = P ( qt = j | qt 1 = i )
q1 q2
qt 1
qt
qt 1
qt
Stationary
A Tutorial on HMMs
Bayesian network representation
P ( qt = j | qt 1 = i ) = P (qt +l = j | qt +l 1 = i )
11
Markov Model: Definition (Cont.)

0.4
State transition matrix
1: rain 0.3 0.1
0.3 0.2 0.2
2: cloudy
0.6
a11 a A = 21 M a N1
Where
a12 a 22 M a NN
L L M L
a1 N a2 N M a NN
0.1 3: sunny 0.8
aij = P (qt = j | qt 1 = i ),
1 i, j N
ij
With constraints
a ij 0,
Initial state probability
a
j =1
=1
= P ( q 1 = i ),
1 i N
12
A Tutorial on HMMs
Markov Model: Sequence Prob.

Conditional probability
P ( A, B ) = P ( A | B ) P ( B )
Sequence probability of Markov model
P (q1 , q2 , L , qT )
Chain rule
= P (q1 ) P (q2 | q1 ) L P (qT 1 | q1 , L , qT 2 ) P (qT | q1 , L , qT 1 ) = P (q1 ) P (q2 | q1 ) L P (qT 1 | qT 2 ) P (qT | qT 1 )

1st order Markov assumption
A Tutorial on HMMs
13
Markov Model: Sequence Prob. (Cont.)

Question: What is the probability that the weather for the next 7 days will be sun-sun-rain-rain-sun-cloudy-sun when today is sunny?
S1 : rain, S 2 : cloudy , S 3 : sunny

P (O | model) = P ( S 3 , S 3 , S 3 , S1 , S1 , S 3 , S 2 , S 3 | model) = P ( S 3 ) P ( S 3 | S 3 ) P ( S 3 | S 3 ) P ( S1 | S 3 ) P ( S1 | S1 ) P ( S 3 | S1 ) P ( S 2 | S 3 ) P ( S 3 | S 2 ) = 3 a33 a33 a31 a11 a13 a32 a23 = 1 (0.8)(0.8)(0.1)(0.4)(0.3)(0.1)(0.2) = 1.536 10 4
0.4 1: rain 0.3 0.1 3: sunny A Tutorial on HMMs 0.8 14 0.3 0.2 0.2 0.1 2: cloudy 0.6
Markov Model: State Probability

State probability at time t :
P ( qt = i )
Simple but slow algorithm:

Probability of a path that ends to state i at time t:
Qt (i ) = (q1 , q2 , L , qt = i ) P (Qt (i )) = q1 P (qk | qk 1 )

k =2 t
Exponential time complexity:
O( N t )
Summation of probabilities of all the paths that ends to i at t
P ( qt = i ) =
A Tutorial on HMMs
all Qt ( i )' s
P(Q (i))
t
15
Markov Model: State Prob. (Cont.) State probability at time t : P ( qt = i )

P ( q t 1 = 1)
a 1i
Each node stores the sum of probabilities of partial paths
Efficient algorithm (Lattice algorithm)

Recursive path probability calculation
P (qt = i) = = =
j =1
P ( q t 1 = j , q t = i )
j =1
P ( q t 1 = j ) P ( q t = i | q t 1 = j ) P ( q t 1 = j ) a
A Tutorial on HMMs
j =1
ji
Time complexity: O ( N 2t ) 16
Whats HMM?
Hidden Markov Model
Hidden
Markov Model
What is hidden?
What is Markov model?
A Tutorial on HMMs
17
Hidden Markov Model

Example Generation process Definition Model evaluation algorithm Path decoding algorithm Training algorithm
18
A Tutorial on HMMs
Time Series Example Representation

X = x1 x2 x3 x4 x5 xT-1 xT = s p iy iy iy ch ch ch ch
A Tutorial on HMMs
19
Analysis Methods Probability-based analysis?
P (s p iy iy iy ch ch ch ch ) = ? Method I
P(s) P( ) P(p) P(iy) P(ch )
3 3 4
Observations are independent; no time/order A poor model for temporal structure

Model size = |V| = N
A Tutorial on HMMs
20
Analysis methods Method II

P(s) P(s | s) P( | s) P(p | ) P(iy | p) P(iy | iy) 2 P( | iy) P( | ) P(ch | ) P(ch | ch ) 2`
A simple model of ordered sequence
A symbol is dependent only on the immediately preceding:
P( xt | x1 x2 x3 L xt 1 ) = P ( xt | xt 1 )
|V||V| matrix model 5050 not very bad 105105 doubly outrageous!!
A Tutorial on HMMs
21
The problem What you see is the truth

Not quite a valid assumption There are often errors or noise
Noisy sound, sloppy handwriting, ungrammatical sentences
There may be some truth process

Underlying hidden sequence Obscured by the incomplete observation
A Tutorial on HMMs
22
Another analysis method Method III

What you see is a clue to what lies behind and is not known a priori
The source that generated the observation The source evolves and generates characteristic observation sequences
q 0 q1 q 2 L qT
P(s, q ) P(s, q
1 Q
P(s, q1 ) P (s, q2 | q1 ) P ( , q3 | q2 ) L P (ch, qT | qT 1 ) = P( xt , qt | qt 1 )

t
| q1 ) P ( , q3 | q2 ) L P (ch, qT | qT 1 ) = P ( xt , qt | qt 1 )
Q t
A Tutorial on HMMs
23
The Auxiliary Variable
qt S = {1, K, N }
N is also conjectured {qt:t0} is conjectured, not visible is Q = q1 q 2 L qT
is Markovian
P(q1q2 L qT ) = P(q1 ) P(q2 | q1 ) L P(qT | qT 1 )

Markov chain
A Tutorial on HMMs
24
Summary of the Concept

P( X ) = P( X , Q)
Q
= P (Q) P ( X | Q)
= P ( q1 q 2 L qT ) P ( x1 x 2 L xT | q1 q 2 L qT )
Q
= P(qt | qt 1 ) p( xt | qt )
Q t =1 t =1
Markov chain process
Output process
A Tutorial on HMMs
25
Hidden Markov Model is a doubly stochastic process

stochastic chain process : { q(t) } output process : { f(x|q) }
is also called as
Hidden Markov chain Probabilistic function of Markov chain
A Tutorial on HMMs
26
HMM Characterization = (A, B, )

A : state transition probability
{ aij | aij = p(qt+1=j|qt=i) }
B : symbol output/observation probability : initial state distribution probability

{ i | i = p(q1=i) } { bj(v) | bj(v) = p(x=v|qt=j) }
P(Q | )P(X | Q, )
Q
= q1 aq1q 2 aq2q3 ... aqT 1qT bq1 ( x1 )bq2 ( x2 ) ... bqT ( xT )

Q
A Tutorial on HMMs
27
Graphical Example
= [ 1.0 0 0 0 ] 0.6 1 0.4 0.5 2 p 0.5 0.7 3 0.3 1.0 4 ch 1 2 A= 3 4 1 2 B= 3 4
A Tutorial on HMMs
iypsch
iy
1 0.6 0.0 0.0 0.0 0.2 0.0 0.0 0.6
2 0.4 0.5 0.0 0.0 0.2 0.2 0.8 0.0
3 0.0 0.5 0.7 0.0 0.0 0.5 0.1 0.2
4 0.0 0.0 0.3 1.0 s 0.6 0.3 0.1 0.2

28
ch iy p
Data interpretation
P(s s p p iy iy iy ch ch ch|) = Q P(ssppiyiyiychchch,Q|) = Q P(Q|) p(ssppiyiyiychchch|Q,) Let Q = 1 1 2 2 3 3 3 4 4 4
0.6 0.0 0.0 0.0 0.2 0.0 0.0 0.6 0.4 0.5 0.0 0.0 0.2 0.2 0.8 0.0 0.0 0.5 0.7 0.0 0.0 0.0 0.3 1.0
P(Q|) p(ssppiyiyiychchch|Q, ) = P(1122333444|) p(ssppiyiyiychchch|1122333444, ) 2 = (1.6)(.6.6)(.4.5)(.5.5)(.5.8)(.7.8) (.3.6)(1..6)2 0.0000878 #multiplications ~ 2TNT

0.0 0.5 0.1 0.2
0.6 0.3 0.1 0.2
Issues in HMM Intuitive decisions

1. number of states (N) 2. topology (state inter-connection) 3. number of observation symbols (V)
Difficult problems
4. efficient computation methods 5. probability parameters ()
A Tutorial on HMMs
30
The Number of States How many states?

Model size Model topology/structure
Factors
Pattern complexity/length and variability The number of samples
Ex:
rrgbbgbbbr
A Tutorial on HMMs
31
(1) The simplest model Model I

N=1 a11=1.0 B = [1/3, 1/6, 1/2] 1.0
1 1 1 1 1 1 P(r r g b b g b b b r | 1 ) = 1 1 1 1 1 1 3 3 6 2 2 6 1 1 1 1 1 1 1 1 2 2 3 2
0.0000322 (< 0.0000338)
A Tutorial on HMMs
32
(2) Two state model Model II:

N=2 A= B= 0.6 0.4 0.6 0.4 1/2 1/3 1/6 1/6 1/6 2/3 0.6 1 0.4 0.6 2 0.4
1 1 1 1 2 2 P(r r g b b g b b b r | 1 ) = .5 .6 .6 .4 .4 .6 2 2 3 3 3 3 2 2 2 1 .4 .4 .4 .6 + LL 3 3 3 2 =?
(3) Three state models

N=3: 0.2 0.6 1 0.2 0.3 2 0.2 0.5 1 0.2 0.6 0.3 0.1
0.3 0.7 3 0.7
0.6 0.2 3 0.3
The Criterion is Obtaining the best model() that maximizes
P( X | )
The best topology comes from insight and experience
the # classes/symbols/samples
A Tutorial on HMMs
35
A trained HMM
.5
1
.1 .2 .6 .5 2 .3 .4 .0 3 .3 .7
.4
R .6 G .2 B .2
=
1 2 3 1 2 3
1. 0. 0.
1 2 3
A=
.5 .4 .1 .0 .6 .4 .0 .0 .0
R G B
B=
.6 .2 .2 .2 .5 .3 .0 .3 .7
A Tutorial on HMMs
36
Hidden Markov Model: Example

2
0.2 0.6 0.3 0.6 0.1 0.2 0.3 0.1
0.6
N pots containing color balls M distinct colors Each pot contains different number of color balls
A Tutorial on HMMs
37
HMM: Generation Process

Sequence generating algorithm
Step 1: Pick initial pot according to some random process Step 2: Randomly pick a ball from the pot and then replace it Step 3: Select another pot according to a random selection process 0.3 2 Step 4: Repeat steps 2 and 3
0.2 0.6 1 0.1 0.1 0.2 0.3 0.6 3 0.6
Markov process: {q(t)} Output process: {f(x|q)}

HMM: Hidden Information Now, what is hidden?
We can just see the chosen balls We cant see which pot is selected at a time So, pot selection (state transition) information is hidden
HMM: Formal Definition Notation: = ( , , )

(1) N : Number of states (2) M : Number of symbols observable in states
V = { v1 , L , v M }
(3) A : State transition probability distribution
A = { a ij },
1 i, j N
(4) B : Observation symbol probability distribution (5) : Initial state distribution
B = { b i ( v k )},
1 i N ,1 j M
1 i N
= P ( q 1 = i ),
A Tutorial on HMMs
40
Three Problems 1. Model evaluation problem

What is the probability of the observation? Forward algorithm
2. Path decoding problem

What is the best state sequence for the observation? Viterbi algorithm
3. Model training problem

How to estimate the model parameters? Baum-Welch reestimation algorithm
41
A Tutorial on HMMs
Solution to Model Evaluation Problem

Forward algorithm Backward algorithm
Definition
Given a model Observation sequence: X = x 1 , x 2 , L , x T P(X| ) = ? P(X | ) =
P ( X , Q | ) = P ( X | Q , ) P (Q | )
Q
(A path or state sequence: Q = q 1 , L , q T

x1 x2 x3 x4 x5 x6 x7 x8
x1 x2 x3 x4 x5 x6 x7 x8
A Tutorial on HMMs
43
Solution
Easy but slow solution: exhaustive enumeration
P(X | ) = =
P ( X , Q | ) = P ( X | Q , ) P (Q | )
Q
b
Q
q1
( x 1 ) b q 2 ( x 2 ) L b q T ( x T ) q 1 a q 1 q 2 a q 2 q 3 L a q T 1 q T
Exhaustive enumeration = combinational explosion!
O(N T )
Smart solution exists?

Yes! Dynamic Programming technique Lattice structure based computation Highly efficient -- linear in frame length
A Tutorial on HMMs
44
Forward Algorithm
Key idea
Span a lattice of N states and T times Keep the sum of probabilities of all the paths coming to each state i at time t
Forward probability t ( j ) = P( x1 x2 ...xt , qt = S j | )

= P( x1 x2 ...xt , Qt = q1...qt | )
Qt N
= t 1 (i )aij b j ( xt )
A Tutorial on HMMs
i =1
45
Forward Algorithm
Initialization 1 (i ) = i bi (x1 ) Induction
1 i N
t ( j ) = t 1 (i )aij b j (x t )
i =1
1 j N , t = 2, 3, L , T
Termination
P ( X | ) = T (i )
i =1 N
A Tutorial on HMMs
46
Numerical Example: P(RRGB|)

=[1 0 0]T .5 .6 .4
R .6 G .2 B .2 1.6
R .5.6
R .5.2
G .5.2
.6
.18
.018
.0018
.4.2 0.2 .0 .6.2
.4.5 .6.5
.4.3 .6.3
.4
.1 .2 .5 .3 .0 .3 .7
.048 .0504 .01123 .1.0 .1.3 .1.7 .4.0 .4.3 .4.7 .0 .01116 .01537
0.0
.0
A Tutorial on HMMs
47
Backward Algorithm (1)

Key Idea
Span a lattice of N states and T times Keep the sum of probabilities of all the outgoing paths at each state i at time t
Backward probability t (i ) = P( xt +1 xt + 2 ...xT | qt = Si , )

Qt +1 N
= P( xt +1 xt + 2 ...xT , Qt +1 = qt +1...qT | qt = Si , )
j =1 A Tutorial on HMMs
= aij b j ( xt +1 ) t +1 ( j )
48
Backward Algorithm (2)

Initialization
T (i ) = 1
Induction
1 i N
t (i ) = aij b j (x t +1 ) t +1 ( j )
j =1
1 i N , t = T 1, T 2, L, 1
A Tutorial on HMMs
49
Solution to Path Decoding Problem

State sequence Optimal path Viterbi algorithm Sequence segmentation
The Most Probable Path

Given a model Observation sequence: X = x 1 , x 2 , L , x T P(X ,Q | ) = ? Q * = arg max Q P ( X , Q | ) = arg max Q P ( X | Q , ) P ( Q | )
(A path or state sequence: Q = q 1 , L , q T )
x1 x2 x3 x4 x5 x6 x7 x8
x1 x2 x3 x4 x5 x6 x7 x8
A Tutorial on HMMs
51
Viterbi Algorithm
Purpose
An analysis for internal processing result The best, the most likely state sequence Internal segmentation
Viterbi Algorithm
Alignment of observation and state transition Dynamic programming technique
A Tutorial on HMMs
52
Viterbi Path Idea

Key idea
Span a lattice of N states and T times Keep the probability and the previous node of the most probable path coming to each state i at time t
Recursive path selection Path probability: t +1 ( j ) = max t (i ) aij b j ( x t +1 )

Path node:
t +1 ( j ) = arg max t (i )aij

1i N
1i N
t (1) a 1 j b j ( x t + 1 ) t ( i ) a ij b j ( x t + 1 )
A Tutorial on HMMs
53
Viterbi Algorithm
Introduction: Recursion:
1 (i ) = i bi (x1 ), 1 i N 1 (i ) = 0 t +1 ( j ) = max t (i )aij b j (x t +1 ), 1 t T 1

1i N
t +1 ( j ) = arg max t (i )aij

1i N
1 j N
Termination:
P = max T (i ) 1i N qT = arg max T (i )

1i N
Path backtracking:
1
qt = t +1 (qt+1 ), t = T 1, K ,1
states
2 3
A Tutorial on HMMs
54
Numerical Example: P(RRGB,Q*|)

=[1 0 0]T .5 .6 .4
R .6 G .2 B .2 1.6
R .5.6
R .5.2
G .5.2
.6
.18
.018
.0018
.4.2 0.2 .0 .6.2
.4.5 .6.5
.4.3 .6.3
.4
.1 .2 .5 .3 .0 .3 .7
.048 .036 .00648 .1.0 .1.3 .1.7 .4.7 .4.0 .4.3 .0 .00576 .01008
0.0
.0
A Tutorial on HMMs
55
Solution to Model training Problem

HMM training algorithm Maximum likelihood estimation Baum-Welch reestimation
HMM Training Algorithm

Given an observation sequence X = x 1 , x 2 , L , x T Find the model parameter * = ( A , B , ) s.t. P ( X | * ) P ( X | ) for
Adapt HMM parameters maximally to training samples Likelihood of a sample
P(X | ) =
P ( X | Q , ) P ( Q | )
State transition is hidden!
NO analytical solution Baum-Welch reestimation (EM)

iterative procedures that locally maximizes P(X|) convergence proven MLE statistic estimation
A Tutorial on HMMs
P(X|)
:HMM parameters
57
Maximum Likelihood Estimation

MLE selects those parameters that maximizes the probability function of the observed sample. [Definition] Maximum Likelihood Estimate
: a set of distribution parameters Given X, * is maximum likelihood estimate of if f(X|*) = max f(X|)
A Tutorial on HMMs
58
MLE Example
Scenario
Known: 3 balls inside pot (some red; some white) Unknown: R = # red balls Observation: (two reds)
Two models
P ( P (
2 1 |R=2) = 2 0
3 1 = 2 3
|R=3) =
3 2
3 =1 2
Which model?
L ( R = 3 ) > L ( R = 2 ) Model(R=3) is our choice

59
A Tutorial on HMMs
MLE Example (Cont.)

Model(R=3) is a more likely strategy, unless we have a priori knowledge of the system. However, without an observation of two red balls
No reason to prefer P(R=3) to P(R=2)
ML method chooses the set of parameters that maximizes the likelihood of the given observation. It makes parameters maximally adapted to training data.
A Tutorial on HMMs
60
EM Algorithm for Training With ( t ) =< { a ij }, { b ik }, i > , estimate EXPECTATION of following quantities:
Expected number of state i visiting Expected number of transitions from i to j
With following quantities:

Expected number of state i visiting Expected number of transitions from i to j
Obtain the MAXIMUM LIKELIHOOD of

( t + 1 ) =< { a ij }, { b ik }, i >
Expected Number of Si Visiting

t (i ) = P ( q t = S i | X , ) P (qt = S i , X | ) = P(X | ) t (i ) t (i ) = t ( j) t ( j)
j
A Tutorial on HMMs
62
Expected Number of Transition

t (i )aij b j ( xt +1 ) t +1 ( j ) t (i, j ) = P(qt = Si , qt +1 = S j | X , ) = i (i)aij b j ( xt +1 )t +1 ( j )
i j
A Tutorial on HMMs
63
Parameter Reestimation MLE parameter estimation

a ij =
T 1
t =1 T 1
t (i, j ) t (i)
t =1 s .t . x t = v k T
P(X|)
t =1
t ( j)
:HMM parameters
b j (vk ) =
= 1 (i)
t =1
t ( j)
Iterative: P ( X | convergence proven: arriving local optima

A Tutorial on HMMs
( t +1)
) P( X | (t ) )
64
Other issues
Other method of training
MAP (Maximum A Posteriori) estimation for adaptation MMI (Maximum Mutual Information) estimation MDI (Minimum Discrimination Information) estimation Viterbi training Discriminant/reinforcement training
A Tutorial on HMMs
65
Other types of parametric structure

Continuous density HMM (CHMM)
More accurate, but much more parameters to train
Semi-continuous HMM
Mix of CHMM and DHMM, using parameter sharing
State-duration HMM
More accurate temporal behavior
Other extensions
HMM+NN, Autoregressive HMM 2D models: MRF, Hidden Mesh model, pseudo-2D HMM
A Tutorial on HMMs
66
Graphical DHMM and CHMM

Models for 5 and 2
A Tutorial on HMMs
67
Pattern Classification using HMMs

Pattern classification Extension of HMM structure Extension of HMM training method Practical issues of HMM HMM history
A Tutorial on HMMs
68
Pattern Classification
Construct one HMM per each class k
Train each HMM k with samples Dk

Baum-Welch reestimation algorithm
1 , L , N
Calculate model likelihood of 1 , L , N with observation X Forward algorithm: P ( X | k ) Find the model with maximum a posteriori probability
* = argmax P(k | X ) P(k ) P( X | k )

k
= argmax k
P( X ) = argmax k P(k ) P( X | k )
Extension of HMM Structure

Extension of state transition parameters
Duration modeling HMM
More accurate temporal behavior
Transition-output HMM
HMM output functions are attached to transitions rather than states
Extension of observation parameter

Segmental HMM
More accurate modeling of trajectories at each state, but more computational cost
Continuous density HMM (CHMM)

Output distribution is modeled with mixture of Gaussian
Semi-continuous HMM (Tied mixture HMM)

Mix of continuous HMM and discrete HMM by sharing Gaussian components
A Tutorial on HMMs
70
Extension of HMM Training Method

Maximum Likelihood Estimation (MLE)*
maximize the probability of the observed samples
Maximum Mutual Information (MMI) Method

information-theoretic measure maximize average mutual information:
maximize discrimination power by training models together
V I = max log P ( X =1
*
| v ) log
P(X
w =1
| w )
Minimum Discrimination Information (MDI) Method

minimize the DI or the cross entropy between pd(signal) and pd(HMM)s use generalized Baum algorithm
Practical Issues of HMM

Architectural and behavioral choices
the unit of modeling -- design choice type of models: ergodic, left-right, parallel path. number of states observation symbols;discrete, continuous; mixture number
Initial estimates
A, : adequate with random or uniform initial values B : good initial estimates are essential for CHMM
ergodic
A Tutorial on HMMs
left-right
parallel path
72
Practical Issues of HMM (Cont.)

Scaling
t (i ) =
a b (x )
s , s +1 s s s =1 s =1
t 1
heads exponentially to zero:
scaling (or using log likelihood)
Multiple observation sequences

accumulate the expected freq. with weight P(X(k)|l)
Insufficient training data

deleted interpolation with desired model & small model output prob. smoothing (by local perturbation of symbols) output probability tying between different states
A Tutorial on HMMs
73
Practical Issues of HMM (Cont.)

HMM topology optimization
What to optimize
# of states # of Gaussian mixtures per state Transitions
Methods
Heuristic methods
# of states from average (or mod) length of input frames
Split / merge
# of states from iterative split / merge
Model selection criteria

# of states and mixtures at the same time ML (maximum likelihood) BIC (Bayesian information criteria) HBIC (HMM-oriented BIC) DIC (Discriminative information criteria) ..
A Tutorial on HMMs
74
HMM applications and Software

On-line handwriting recognition Speech applications HMM toolbox for Matlab HTK (hidden Markov model Toolkit)
A Tutorial on HMMs
75
HMM Applications
On-line handwriting recognition
BongNet: HMM network-based handwriting recognition system
Speech applications
CMU Sphinx : Speech recognition toolkit Dr.Speaking : English pronunciation correction system
A Tutorial on HMMs
76
BongNet
Consortium of CAIR(Center for Artificial Intelligence Research) at KAIST
The name BongNet from its major inventor, BongKee Shin
Prominent performance for unconstrained on-line Hangul recognition Modeling of Hangul handwriting
considers ligature between letters as well as consonants and vowels
(initial consonant)+(ligature)+(vowel) (initial consonant)+(ligature)+(vowel)+(ligature)+(final consonant)
connects letter models and ligature models using Hangul composition principle further improvements
BongNet+ : incorporating structural information explicitly Circular BongNet : successive character recognition Unified BongNet : Hangul and alphanumeric recognition dictionary look-up
A Tutorial on HMMs
77
Network structure
A Tutorial on HMMs
78
A Modification to BongNet
16-dir Chaincode Structure Code Generation
Structure code sequence

carries structural information
not easily acquired using chain code sequence including length, direction, and vending
Distance Rotation 18.213 45.934 41.238 45.796 18.299 16.531 45.957 52.815 26.917 53.588 56.840
Straightness 96.828 87.675 99.997 97.941 98.820 88.824 100.000 99.999 99.961 99.881 80.187
Direction 46.813 146.230 0.301 138.221 8.777 298.276 293.199 95.421 356.488 156.188 17.449 1 1 1 1 1 1 0 1 1 1 1
Real 1 1 0 1 0 -1 0 0 0 0 -1
3 37 0 37 0 31 54 11 28 15 5
A Tutorial on HMMs
79
Dr. Speaking
1 2 , ,
A Tutorial on HMMs
80

Feature Extraction & Acoustic Analysis Score Estimation
speech
Decoder
Acoustic Score
Evaluation Score
Target Speech DB spoken by Native Speaker Target Speech DB spoken by Non-native Speaker (mis-pronunciation)
Acoustic Model (Phoneme Unit of Native) Language Model (Phoneme Unit)
Acoustic Model (Phoneme Unit of non native)
Target Pronunciation Dictionary
Target mis-Pronunciation Dictionary (Analysis Non native speech pattern)
A Tutorial on HMMs
81
Acoustic modeling
Native HMM Non-Native HMM
B C
standard
Language modeling
standard
deletion error modeling
error
error
replacement error modeling

A Tutorial on HMMs
insertion error modeling

82

: , , , , ,
A Tutorial on HMMs
83
1 - 2 - 3 - , ,
Software Tools for HMM

HMM toolbox for Matlab
Developed by Kevin Murphy Freely downloadable SW written in Matlab (Hmm Matlab is not free!) Easy-to-use: flexible data structure and fast prototyping by Matlab Somewhat slow performance due to Matlab Download: http://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html
HTK (Hidden Markov toolkit)

Developed by Speech Vision and Robotics Group of Cambridge University Freely downloadable SW written in C Useful for speech recognition research: comprehensive set of programs for training, recognizing and analyzing speech signals Powerful and comprehensive, but somewhat complicate Download: http://htk.eng.cam.ac.uk/
A Tutorial on HMMs
85
What is HTK ?
Hidden Markov Model Toolkit Set of tools for training and evaluation HMMs Primarily used in automatic speech recognition and economic modeling Modular implementation, (relatively) easy to extend
A Tutorial on HMMs
86
HTK Software Architecture

HShell : User input/output & interaction with the OS HLabel : Label files HLM : Language model HNet : Network and lattices HDic : Dictionaries HVQ : VQ codebooks HModel : HMM definitions HMem : Memory management HGraf : Graphics HAdapt : Adaptation HRec : main recognition processing functions
Generic Properties of a HTK Tool
Designed to run with a traditional command-line style interface Each tool has a number of required argument plus optional arguments
HFoo -T 1 -f 34.3 -a -s myfile file1 file2
This tool has two main arguments called file1 and file2 plus four optional arguments -f : real number, -T : integer, -s : string, -a : no following value
HFoo -C config -f 34.3 -a -s myfile file1 file2
HFoo will load the parameters stored in the configuration file config during its initialization procedures Configuration parameters can sometimes by used as an alternative to using command line arguments
The Toolkit
There are 4 main phases
data preparation, training, testing and analysis
The Toolkit
Data Preparation Tools Training Tools Recognition Tools Analysis Tools
< HTK Processing Stages >
Data Preparation Tools

A set of speech data file and their associated transcriptions are required It must by converted into the appropriate parametric form HSlab : Used both to record the speech and to manually annotate it with and required transcriptions HCopy : simply copying each file performs the required encoding HList : used to check the contents of any speech file HLed : output file to a single Master Label file MLF which is usually more convenient for subsequent processing HLstats : gather and display statistics on label files and where required HQuant : used to build a VQ codebook in preparation for building discrete probability HMM system
90
A Tutorial on HMMs
Training Tools
If there is some speech data available for which the location of the sub-word boundaries have been marked, this can be used as bootstrap data HInit and HRest provide isolated word style training using the fully labeled bootstrap data Each of the required HMMs is generated individually
Training Tools (contd)

HInit : iteratively compute an initial set of parameter values using a segmental k-means procedure HRest : process fully labeled bootstrap data using a Baum-Welch re-estimation procedure HCompV : all of the phone models are initialized to by identical and have state means and variances equal to the global speech mean and variance HERest : perform a single Baum-Welch re-estimation of the whole set of HMM phone models simultaneously HHed : apply a variety of parameter tying and increment the number of mixture components in specified distributions HEadapt : adapt HMMs to better model the characteristics of particular speakers using a small amount of training or adaptation data
A Tutorial on HMMs
92
Recognition Tools
HVite : use the token passing algorithm to perform Viterbi-based speech recognition HBuild : allow sub-networks to be created and used within higher level networks HParse : convert EBNF into the equivalent word network HSgen : compute the empirical perplexity of the task HDman : dictionary management tool
A Tutorial on HMMs
93
Analysis Tools
HResults
Use dynamic programming to align the two transcriptions and count substitution, deletion and insertion errors Provide speaker-by-speaker breakdowns, confusion matrices and time aligned transcriptions Compute Figure of Merit scores and Receiver Operation Curve information
A Tutorial on HMMs
94
HTK Example
Isolated word recognition
A Tutorial on HMMs
95
Isolated word recognition (contd)
A Tutorial on HMMs
96
Speech Recognition Example using HTK

Recognizer for voice dialing application
Goal of our system
Provide a voice-operated interface for phone dialing
Recognizer
digit strings, limited set of names sub-word based
A Tutorial on HMMs
97
1> gram .
- gram grammar . ------------------------ gram -------------------------$digit = | | | | |..... | | ; $name = | | ..... | ; ( SENT-START ( <$digit> | $name) SENT-ENT ) -------------------------------------------------------$ . < > | or() . SENT-START SENT-END . 2> HParse gram wdnet . - HParse.exe gram wdnet .
3> dict
- . -------------------- dict -------------------------SENT-END[] sil SENT-START[] sil kc oxc ngc sp kc uxc sp .... jeoc ngc hc euic sp .... phc axc lc sp hc oxc chc uxc lc sp ---------------------------------------------------A Tutorial on HMMs 99
4> HSGen -l -n 200 wdnet dict

- wdnet dict HSGen.exe 200 .
5> HSGen .
- HSLab
A Tutorial on HMMs
100
6> words.mlf .
- words.mlf . --------------------- words.mlf ---------------------#!MLF!# "*/s0001.lab" . #*/s0002.lab" . ..... ------------------------------------------------------
A Tutorial on HMMs
101
7> mkphones0.led
-mkphones0.led words.mlf . ------------- mkphones0.led ---------------EX IS sil sil DE sp ------------------------------------------- sil sp .
A Tutorial on HMMs
102
8>HLEd -d dict -i phones0.mlf mkphones0.led words.mlf

- HLEd.exe mkphones0.led words.mlf phones0.mlf . ------------------- phones0.mlf -------------------------#!MLF!# EX "*/s0001.lab" sil IS sil sil nc DE sp uxc rc ... oxc ngc kc . "*/s0002.lab" ..... ------------------------------------------------------------
A Tutorial on HMMs
103
9> config
-config mfc . -------------- config --------------------TARGETKIND = MFCC_0 TARGETRATE = 100000.0 SOURCEFORMAT = NOHEAD SOURCERATE = 1250 WINDOWSIZE = 250000.0 ...... -------------------------------------------
A Tutorial on HMMs
104
10> codetr.scp
- *.mfc ------------- codetr.scp ----------------DB\s0001.wav DB\s0001.mfc DB\s0002.wav DB\s0002.mfc ... DB\s0010.wav DB\s0010.mfc ...... ------------------------------------------
11> HCopy -T 1 -C config -S codetr.scp

- HCopy.exe config codetr.scp mfc . mfc config .
A Tutorial on HMMs
105
12> proto train.scp

- proto HMM . 3 left-right ------------------------------ proto --------------------------~o <vecsize> 39 <MFCC_0_D_A> ~h "proto" <BeginHMM> <NumStates> 5 <State>2 <Mean> 39 0.0 0.0 .... <Variance> 39 ... ... <TransP> 5 .... <EndHMM> ----------------------------------------------------------------train.scp: mfc
A Tutorial on HMMs
106
13> config1
- HMM config MFCC_0 MFCC_0_D_A config1 .
14> HCompV -C config1 -f 0.01 -m -S train.scp -m hmm0 proto

- HCompV.exe hmm0 proto vFloors . macros hmmdefs . proto hmmdefs . -------------------- hmmdefs ------------------~h "axc" <BeginHMM> ... <EndHMM> ~h "chc" <BeginHMM> ... <EndHMM> ... ... -------------------------------------------------
A Tutorial on HMMs
107
vFloors ~o macros .
----------------- macros ----------------------~o <VecSize> 39 <MFCC_0_D_A> ~v "varFoorl" <Variance> 39 ... ----------------------------------------------Proto
Hmm0/vFloors
A Tutorial on HMMs
108
15> HERest -C config1 -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp

- H hmm0\macros -H hmm0\hmmdefs -M hmm1 monophones0 - HERest.exe hmm1 macros hmmdefs . - HERest.exe 2 hmm2 macros hmmdefs . - hmm3, hmm4,
16> HVite -H hmm7/macros H hmm7/hmmdefs -S test.scp -l * -i recout.mlf -w wdnet -p 0.0 -s 5.0 dict monophones
A Tutorial on HMMs
109
A Tutorial on HMMs
110
Summary
Markov model
1-st order Markov assumption on state transition Visible: observation sequence determines state transition seq.
Hidden Markov model

1-st order Markov assumption on state transition Hidden: observation sequence may result from many possible state transition sequences Fit very well to the modeling of spatial-temporally variable signal Three algorithms: model evaluation, the most probable path decoding, model training
HMM applications and Software

Handwriting and speech applications HMM tool box for Matlab HTK
Acknowledgement
HMM , .
A Tutorial on HMMs
111
References
Hidden Markov Model
L.R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, IEEE Proc. pp. 267-295, 1989. L.R. Bahl et. al, A Maximum Likelihood Approach to Continuous Speech Recognition, IEEE PAMI, pp. 179-190, May. 1983. M. Ostendorf, From HMMs to Segment Models: a Unified View of Stochastic Modeling for Speech Recognition, IEEE SPA, pp 360-378, Sep., 1996.
HMM Tutorials
, HMM Theory and Applications, 2003
.
, ILVB Tutorial, 2005.04.16, . Sam Roweis, Hidden Markov Models (SCIA Tutorial 2003), http://www.cs.toronto.edu/~roweis/notes/scia03h.pdf Andrew Moore, Hidden Markov Models, http://www-2.cs.cmu.edu/~awm/tutorials/hmm.html
References (Cont.)
HMM Applications
B.-K. Sin, J.-Y. Ha, S.-C. Oh, Jin H. Kim, Network-Based Approach to Online Cursive Script Recognition, IEEE Transactions on Systems Man and Cybernetics Part B-Cybernetics, Vol. 29, No. 2, pp.321-328, 1999. J.-Y. Ha, "Structure Code for HMM Network-Based Hangul Recognition", 18th International Conference on Computer Processing of Oriental Languages, pp.165-170, 1999. , , , , , , , , 46, pp.87-102, 2003.
HMM Topology optimization

H. Singer, and M. Ostendorf, Maximum likelihood successive state splitting, ICASSP , 1996, pp. 601-604. A. Stolcke, and S. Omohundro, Hidden Markov model induction by Bayesian model merging, Advances in NIPS. 1993, pp. 11-18. San Mateo, CA: Morgan Kaufmann. 0 A. Biem, J.-Y. Ha, J. Subrahmonia, "A Bayesian Model Selection Criterion for HMM Topology Optimization", International Conference on Acoustics Speech and Signal Processing, pp.I989~I992, IEEE Signal Processing Society, 2002. A. Biem, A Model Selection Criterion for Classification: Application to HMM Topology Optimization, ICDAR 2003, pp. 204-210, 2003.
HMM Software
Kevin Murphy, HMM toolbox for Matlab, freely downloadable SW written in Matlab, http://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html Speech Vision and Robotics Group of Cambridge University, HTK (Hidden Markov toolkit), freely downloadable SW written in C, http://htk.eng.cam.ac.uk/ Sphinx at CMU http://cmusphinx.sourceforge.net/html/cmusphinx.php 113
A Tutorial on HMMs

4-Hidden Markov Models

Uploaded by

Copyright:

Available Formats

You might also like

4-Hidden Markov Models

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4-Hidden Markov Models

Uploaded by

Copyright:

Available Formats

1 , 2006.2.

A Tutorial on Hidden Markov Models

Characteristics of such data

Advantage of HMM on Sequential Data

Efficient and good modeling tool for

Efficient evaluation, decoding and training algorithms

Tools already exist

Successful Application Areas of HMM

Hidden Markov Model

What is Markov model?

Markov Model: Scenario

Assumption: tomorrow weather depends only on todays weather!

Markov Model: Graphical Representation

Markov Model: Definition

1: rain 0.3 0.1

0.3 0.2 0.2

0.1 3: sunny 0.8

Bayesian network representation

Markov Model: Definition (Cont.)

State transition matrix

1: rain 0.3 0.1

0.3 0.2 0.2

0.1 3: sunny 0.8

Markov Model: Sequence Prob.

= P (q1 ) P (q2 | q1 ) L P (qT 1 | q1 , L , qT 2 ) P (qT | q1 , L , qT 1 ) = P (q1 ) P (q2 | q1 ) L P (qT 1 | qT 2 ) P (qT | qT 1 )

Markov Model: Sequence Prob. (Cont.)

S1 : rain, S 2 : cloudy , S 3 : sunny

Markov Model: State Probability

Simple but slow algorithm:

Qt (i ) = (q1 , q2 , L , qt = i ) P (Qt (i )) = q1 P (qk | qk 1 )

Exponential time complexity:

Summation of probabilities of all the paths that ends to i at t

Markov Model: State Prob. (Cont.) State probability at time t : P ( qt = i )

Efficient algorithm (Lattice algorithm)

Hidden Markov Model

What is Markov model?

Hidden Markov Model

Time Series Example Representation

Analysis Methods Probability-based analysis?

Observations are independent; no time/order A poor model for temporal structure

Analysis methods Method II

The problem What you see is the truth

There may be some truth process

Another analysis method Method III

P(s, q1 ) P (s, q2 | q1 ) P ( , q3 | q2 ) L P (ch, qT | qT 1 ) = P( xt , qt | qt 1 )

The Auxiliary Variable

P(q1q2 L qT ) = P(q1 ) P(q2 | q1 ) L P(qT | qT 1 )

Summary of the Concept

Markov chain process

Hidden Markov Model is a doubly stochastic process

HMM Characterization = (A, B, )

B : symbol output/observation probability : initial state distribution probability

= q1 aq1q 2 aq2q3 ... aqT 1qT bq1 ( x1 )bq2 ( x2 ) ... bqT ( xT )

1 0.6 0.0 0.0 0.0 0.2 0.0 0.0 0.6

2 0.4 0.5 0.0 0.0 0.2 0.2 0.8 0.0

3 0.0 0.5 0.7 0.0 0.0 0.5 0.1 0.2

4 0.0 0.0 0.3 1.0 s 0.6 0.3 0.1 0.2

P(Q|) p(ssppiyiyiychchch|Q, ) = P(1122333444|) p(ssppiyiyiychchch|1122333444, ) 2 = (1.6)(.6.6)(.4.5)(.5.5)(.5.8)(.7.8) (.3.6)(1..6)2 0.0000878 #multiplications ~ 2TNT