Professional Documents
Culture Documents
Machine Learning: Expectation Maximization
Machine Learning: Expectation Maximization
Expectation Maximization
Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1
We’ve seen the update eqs. of GMM, but
• How are they derived?
Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 2
General Settings for EM
• Given data 𝑋 = (𝑥1 , … , 𝑥𝑛 ), where 𝑥𝑖 ~𝑝 𝑥; 𝜃 .
Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 3
Basic Idea of EM
• Since maximizing data log-likelihood log 𝑝 𝑋; 𝜃 is hard
but maximizing complete data log-likelihood
log 𝑝 𝑋, 𝑍; 𝜃 is easy (if we observed 𝑍), out bet is to
maximize the latter and hopefully it also increases the
former.
Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 4
EM Algorithm in General
• 1. Initialize parameter 𝜃 old .
• 2. E step: evaluate posterior dist. of latent variables
𝑝 𝑍|𝑋; 𝜃 old , using old parameter. Then the expected
complete data log-likelihood, under this dist. would be
𝑄 𝜃; 𝜃 old = 𝑝 𝑍|𝑋; 𝜃 old log 𝑝 𝑋, 𝑍; 𝜃
𝑍 Complete data
A function of 𝜃
Taking expectation log-likelihood
• 3. M step: update parameters to maximize the expected
complete data log-likelihood.
𝜃 new = arg max 𝑄 𝜃; 𝜃 old
𝜃
• 4. Check convergence criterion. Return to step 2 if not
satisfied.
Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 5
GMM Revisited – E step
• Evaluate posterior dist. of latent variables
𝑝 𝑍|𝑋; 𝜃 old , using old parameters.
– Since data are i.i.d., we evaluate for each 𝑖.
(𝑗)
𝑝 𝑧𝑖 = 𝑗, 𝑥𝑖
𝑞𝑖 ≡ 𝑝 𝑧𝑖 = 𝑗|𝑥𝑖 =
𝑝 𝑥𝑖
𝑝 𝑥𝑖 |𝑧𝑖 = 𝑗 𝑝 𝑧𝑖 = 𝑗 𝑤𝑗
= 𝐾
(𝑥𝑖 −𝜇𝑗 )2
𝑙=1 𝑝 𝑥𝑖 |𝑧𝑖 = 𝑙 𝑝 𝑧𝑖 = 𝑙
1 −
2𝜎2 𝑗
𝑒
2𝜋𝜎𝑗
Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 6
GMM Revisited – E step
• Then the expected complete data log-likelihood,
under this dist. would be
𝑁 𝐾
𝑄 𝜃; 𝜃 old = 𝑝 𝑧𝑖 = 𝑗|𝑥𝑖 log 𝑝 𝑥𝑖 , 𝑧𝑖 ; 𝜃 old
𝑖=1 𝑗=1
𝑁 𝐾
= 𝑞𝑖 (𝑗) log 𝑝 𝑥𝑖 , 𝑧𝑖 ; 𝜃 old
𝑖=1 𝑗=1
𝑁 𝐾
= 𝑞𝑖 (𝑗) log 𝑝 𝑧𝑖 = 𝑗 𝑝 𝑥𝑖 |𝑧𝑖 = 𝑗
𝑖=1 𝑗=1
(𝑥𝑖 −𝜇𝑗 )2
𝑁 𝐾 1 −
(𝑗) log 2𝜎 2 𝑗
= 𝑞𝑖 𝑤𝑗 ∙ 𝑒
𝑖=1 𝑗=1 2𝜋𝜎𝑗
Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 7
GMM Revisited – M step (for 𝜇𝑗 )
• Update parameters to maximize the expected
complete data log-likelihood.
(𝑥𝑖 −𝜇𝑗 )2
𝑁 𝐾 1 −
2𝜎 2 𝑗
𝑄 𝜃; 𝜃 old = 𝑞𝑖 (𝑗) log 𝑤𝑗 ∙ 𝑒
𝑖=1 𝑗=1 2𝜋𝜎𝑗
Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 10
GMM Revisited – M step (for 𝑤𝑗 )
𝑁 (𝑗)
𝑞
• We get 𝑤𝑗 = 𝑖=1 𝑖
−𝛽
𝐾
• Sum over 𝑗, and using 𝑗=1 𝑤𝑗 = 1, we get
𝐾 𝑁
𝑗
−𝛽 = 𝑞𝑖
𝑗=1 𝑖=1
𝑁 𝐾
𝑗
= 𝑞𝑖 =𝑁
𝑖=1 𝑗=1
• So we get update equation for weights:
1 𝑁
𝑤𝑗 = 𝑞𝑖 (𝑗)
𝑁 𝑖=1
Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 11
More Theoretical Questions
• We’ve seen how the update equations of
GMM are derived from the EM algorithm.
Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 12
Answers
• We will show that the data log-likelihood
never decrease in each iteration.
Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 13
Expected data log-likelihood increases
• Recall that in M step, we maximize the
expected complete log-likelihood
𝜃 new = arg max 𝑄 𝜃; 𝜃 old
𝜃
• So we also have
𝑝 𝑋, 𝑍; 𝜃 new
𝑝 𝑍|𝑋; 𝜃 old log
𝑝 𝑍|𝑋; 𝜃 old
𝑍
𝑝 𝑋, 𝑍; 𝜃 old (1)
≥ 𝑝 𝑍|𝑋; 𝜃 old log
𝑝 𝑍|𝑋; 𝜃 old
𝑍
𝑝 𝑋, 𝑍; 𝜃 new
𝑝 𝑍|𝑋; 𝜃 old log
𝑝 𝑍|𝑋; 𝜃 old
𝑍
𝑝 𝑋, 𝑍; 𝜃 new
≤ log 𝑝 𝑍|𝑋; 𝜃 old
𝑝 𝑋, 𝑍; 𝜃 old
𝑍
Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 16
Continue…
𝑝 𝑋, 𝑍; 𝜃 new
𝑝 𝑍|𝑋; 𝜃 old log
𝑝 𝑍|𝑋; 𝜃 old
𝑍
𝑝 𝑋, 𝑍; 𝜃 new
≤ log 𝑝 𝑍|𝑋; 𝜃 old (2)
𝑝 𝑋, 𝑍; 𝜃 old
𝑍
Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 17
Finally…
• Putting (1) and (2) together, we get
log 𝑝 𝑋; 𝜃 old
𝑝 𝑋, 𝑍; 𝜃 new
≤ 𝑝 𝑍|𝑋; 𝜃 old log
𝑝 𝑍|𝑋; 𝜃 old
𝑍
≤ log 𝑝 𝑋; 𝜃 new
Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 18
You Should Know
• EM algorithm is an efficient way to do maximum
likelihood estimation, when there are latent
variables or missing data.
• The general algorithm of EM
– E step: calculate posterior dist. of latent variables.
– M step: update parameters by maximizing the
expected complete data log-likelihood.
• How to derive EM update equations of GMM?
– Can you derive EM update equations for parameter
estimation of a mixture of categorical distributions?
• Why does EM always converge (to some local
optimum)?
Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 19