Download as pdf or txt
Download as pdf or txt
You are on page 1of 96

Parameter Estimation in Latent Variable Models

Piyush Rai

Introduction to Machine Learning (CS771A)

September 25, 2018

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 1


Some Mid-Sem Statistics

Also normal
Need not worry Normal
Need not worry

Can recover
Need not worry
too much

2 1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 2


Some Mid-Sem Statistics

Also normal
Need not worry Normal
Need not worry

Can recover
Need not worry
too much

2 1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 2


Some Mid-Sem Statistics

Also normal
Need not worry Normal
Need not worry

Can recover
Need not worry
too much

2 1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 2


Some Mid-Sem Statistics

Also normal
Need not worry Normal
Need not worry

Also normal
Can recover

2 1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 2


Latent Variable Models

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 3


A Simple Generative Model

All observations {x 1 , . . . , x N } generated from a distribution p(x|θ)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 4


A Simple Generative Model

All observations {x 1 , . . . , x N } generated from a distribution p(x|θ)

Unknowns: Parameters θ of the assumed data distribution p(x|θ)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 4


A Simple Generative Model

All observations {x 1 , . . . , x N } generated from a distribution p(x|θ)

Unknowns: Parameters θ of the assumed data distribution p(x|θ)


Many ways to estimate the parameters (MLE, MAP, or Bayesian inference)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 4


Generative Model with Latent Variables

Assume each observation x n to be associated with a latent variable z n

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 5


Generative Model with Latent Variables

Assume each observation x n to be associated with a latent variable z n

In this “latent variable model” of data, data x also depends some latent variable(s) z

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 5


Generative Model with Latent Variables

Assume each observation x n to be associated with a latent variable z n

In this “latent variable model” of data, data x also depends some latent variable(s) z
z n is akin to a latent representation or “encoding” of x n ; controls what data “looks like”.

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 5


Generative Model with Latent Variables

Assume each observation x n to be associated with a latent variable z n

In this “latent variable model” of data, data x also depends some latent variable(s) z
z n is akin to a latent representation or “encoding” of x n ; controls what data “looks like”.E.g,
z n ∈ {1, . . . , K } denotes the cluster x n belongs to

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 5


Generative Model with Latent Variables

Assume each observation x n to be associated with a latent variable z n

In this “latent variable model” of data, data x also depends some latent variable(s) z
z n is akin to a latent representation or “encoding” of x n ; controls what data “looks like”.E.g,
z n ∈ {1, . . . , K } denotes the cluster x n belongs to
z n ∈ RK denotes a low-dimensional latent representation or latent “code” for x n

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 5


Generative Model with Latent Variables

Assume each observation x n to be associated with a latent variable z n

In this “latent variable model” of data, data x also depends some latent variable(s) z
z n is akin to a latent representation or “encoding” of x n ; controls what data “looks like”.E.g,
z n ∈ {1, . . . , K } denotes the cluster x n belongs to
z n ∈ RK denotes a low-dimensional latent representation or latent “code” for x n

Unknowns: {z 1 , . . . , z N }, and (θ, φ).

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 5


Generative Model with Latent Variables

Assume each observation x n to be associated with a latent variable z n

In this “latent variable model” of data, data x also depends some latent variable(s) z
z n is akin to a latent representation or “encoding” of x n ; controls what data “looks like”.E.g,
z n ∈ {1, . . . , K } denotes the cluster x n belongs to
z n ∈ RK denotes a low-dimensional latent representation or latent “code” for x n

Unknowns: {z 1 , . . . , z N }, and (θ, φ). z n ’s called “local” variables; (θ, φ) called “global” variables

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 5


Brief Detour/Recap:
Gaussian Parameter Estimation

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 6


MLE for Multivariate Gaussian

Multivariate Gaussian in D dimensions


 
1 1
p(x|µ, Σ) = exp − (x − µ)> Σ−1 (x − µ)
(2π)D/2 |Σ|1/2 2

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 7


MLE for Multivariate Gaussian

Multivariate Gaussian in D dimensions


 
1 1
p(x|µ, Σ) = exp − (x − µ)> Σ−1 (x − µ)
(2π)D/2 |Σ|1/2 2

Goal: Given N i.i.d. observations {x n }N


n=1 from this Gaussian, estimate parameters µ and Σ

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 7


MLE for Multivariate Gaussian

Multivariate Gaussian in D dimensions


 
1 1
p(x|µ, Σ) = exp − (x − µ)> Σ−1 (x − µ)
(2π)D/2 |Σ|1/2 2

Goal: Given N i.i.d. observations {x n }N


n=1 from this Gaussian, estimate parameters µ and Σ

MLE for the D × 1 mean µ ∈ RD and D × D p.s.d. covariance matrix Σ


N N
1 X 1 X
µ̂ = xn and Σ̂ = (x n − µ̂)(x n − µ̂)>
N n=1 N n=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 7


MLE for Multivariate Gaussian

Multivariate Gaussian in D dimensions


 
1 1
p(x|µ, Σ) = exp − (x − µ)> Σ−1 (x − µ)
(2π)D/2 |Σ|1/2 2

Goal: Given N i.i.d. observations {x n }N


n=1 from this Gaussian, estimate parameters µ and Σ

MLE for the D × 1 mean µ ∈ RD and D × D p.s.d. covariance matrix Σ


N N
1 X 1 X
µ̂ = xn and Σ̂ = (x n − µ̂)(x n − µ̂)>
N n=1 N n=1

Note: Σ depends on µ, but µ doesn’t depend on Σ ⇒ no need for alternating opt.

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 7


MLE for Multivariate Gaussian

Multivariate Gaussian in D dimensions


 
1 1
p(x|µ, Σ) = exp − (x − µ)> Σ−1 (x − µ)
(2π)D/2 |Σ|1/2 2

Goal: Given N i.i.d. observations {x n }N


n=1 from this Gaussian, estimate parameters µ and Σ

MLE for the D × 1 mean µ ∈ RD and D × D p.s.d. covariance matrix Σ


N N
1 X 1 X
µ̂ = xn and Σ̂ = (x n − µ̂)(x n − µ̂)>
N n=1 N n=1

Note: Σ depends on µ, but µ doesn’t depend on Σ ⇒ no need for alternating opt.


Note: log works nicely with exp of the Gaussian. Simplifies MLE expressions in this case

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 7


MLE for Multivariate Gaussian

Multivariate Gaussian in D dimensions


 
1 1
p(x|µ, Σ) = exp − (x − µ)> Σ−1 (x − µ)
(2π)D/2 |Σ|1/2 2

Goal: Given N i.i.d. observations {x n }N


n=1 from this Gaussian, estimate parameters µ and Σ

MLE for the D × 1 mean µ ∈ RD and D × D p.s.d. covariance matrix Σ


N N
1 X 1 X
µ̂ = xn and Σ̂ = (x n − µ̂)(x n − µ̂)>
N n=1 N n=1

Note: Σ depends on µ, but µ doesn’t depend on Σ ⇒ no need for alternating opt.


Note: log works nicely with exp of the Gaussian. Simplifies MLE expressions in this case
In general, when the distribution is an exponential family distribution, MLE is usually very easy

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 7


Brief Detour: Exponential Family Distributions
An exponential family distribution is of the form

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

θ is called the natural parameter of the family

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 8


Brief Detour: Exponential Family Distributions
An exponential family distribution is of the form

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

θ is called the natural parameter of the family


h(x), φ(x), and A(θ) are known functions. Note: Don’t confuse φ with kernel mappings!

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 8


Brief Detour: Exponential Family Distributions
An exponential family distribution is of the form

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

θ is called the natural parameter of the family


h(x), φ(x), and A(θ) are known functions. Note: Don’t confuse φ with kernel mappings!
φ(x) is called the sufficient statistics: knowing this is sufficient to estimate θ

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 8


Brief Detour: Exponential Family Distributions
An exponential family distribution is of the form

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

θ is called the natural parameter of the family


h(x), φ(x), and A(θ) are known functions. Note: Don’t confuse φ with kernel mappings!
φ(x) is called the sufficient statistics: knowing this is sufficient to estimate θ
Every exp. family distribution also has a conjugate distribution (often also in exp. family)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 8


Brief Detour: Exponential Family Distributions
An exponential family distribution is of the form

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

θ is called the natural parameter of the family


h(x), φ(x), and A(θ) are known functions. Note: Don’t confuse φ with kernel mappings!
φ(x) is called the sufficient statistics: knowing this is sufficient to estimate θ
Every exp. family distribution also has a conjugate distribution (often also in exp. family)
Many other nice properties (especially useful in Bayesian inference)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 8


Brief Detour: Exponential Family Distributions
An exponential family distribution is of the form

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

θ is called the natural parameter of the family


h(x), φ(x), and A(θ) are known functions. Note: Don’t confuse φ with kernel mappings!
φ(x) is called the sufficient statistics: knowing this is sufficient to estimate θ
Every exp. family distribution also has a conjugate distribution (often also in exp. family)
Many other nice properties (especially useful in Bayesian inference)
Also, MLE/MAP is usually quite simple (note that log p(x|θ) will typically have a simple form)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 8


Brief Detour: Exponential Family Distributions
An exponential family distribution is of the form

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

θ is called the natural parameter of the family


h(x), φ(x), and A(θ) are known functions. Note: Don’t confuse φ with kernel mappings!
φ(x) is called the sufficient statistics: knowing this is sufficient to estimate θ
Every exp. family distribution also has a conjugate distribution (often also in exp. family)
Many other nice properties (especially useful in Bayesian inference)
Also, MLE/MAP is usually quite simple (note that log p(x|θ) will typically have a simple form)

Many well-known distribution (Bernoulli, Binomial, multinoulli, beta, gamma, Gaussian, etc.) are
exponential family distributions
https://en.wikipedia.org/wiki/Exponential_family

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 8


MLE for Generative Classification with Gaussian Class-conditionals
Each class k modeled using a Gaussian with mean µk and covariance matrix Σk

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 9


MLE for Generative Classification with Gaussian Class-conditionals
Each class k modeled using a Gaussian with mean µk and covariance matrix Σk
Note: Can assume label yn to be one-hot and then ynk = 1 if yn = k, and ynk = 0, otherwise

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 9


MLE for Generative Classification with Gaussian Class-conditionals
Each class k modeled using a Gaussian with mean µk and covariance matrix Σk
Note: Can assume label yn to be one-hot and then ynk = 1 if yn = k, and ynk = 0, otherwise
Assuming p(yn = k) = πk , this model has parameters Θ = {πk , µk , Σk }K
k=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 9


MLE for Generative Classification with Gaussian Class-conditionals
Each class k modeled using a Gaussian with mean µk and covariance matrix Σk
Note: Can assume label yn to be one-hot and then ynk = 1 if yn = k, and ynk = 0, otherwise
Assuming p(yn = k) = πk , this model has parameters Θ = {πk , µk , Σk }K
k=1

(We have done this before) Given {x n , yn }N


n=1 , MLE for Θ will be

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 9


MLE for Generative Classification with Gaussian Class-conditionals
Each class k modeled using a Gaussian with mean µk and covariance matrix Σk
Note: Can assume label yn to be one-hot and then ynk = 1 if yn = k, and ynk = 0, otherwise
Assuming p(yn = k) = πk , this model has parameters Θ = {πk , µk , Σk }K
k=1

(We have done this before) Given {x n , yn }N


n=1 , MLE for Θ will be
N
1X
π̂k = ynk
N n=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 9


MLE for Generative Classification with Gaussian Class-conditionals
Each class k modeled using a Gaussian with mean µk and covariance matrix Σk
Note: Can assume label yn to be one-hot and then ynk = 1 if yn = k, and ynk = 0, otherwise
Assuming p(yn = k) = πk , this model has parameters Θ = {πk , µk , Σk }K
k=1

(We have done this before) Given {x n , yn }N


n=1 , MLE for Θ will be
N
1X Nk
π̂k = ynk =
N n=1 N

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 9


MLE for Generative Classification with Gaussian Class-conditionals
Each class k modeled using a Gaussian with mean µk and covariance matrix Σk
Note: Can assume label yn to be one-hot and then ynk = 1 if yn = k, and ynk = 0, otherwise
Assuming p(yn = k) = πk , this model has parameters Θ = {πk , µk , Σk }K
k=1

(We have done this before) Given {x n , yn }N


n=1 , MLE for Θ will be
N
1X Nk
π̂k = ynk =
N n=1 N
N
1 X
µ̂k = ynk x n
Nk n=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 9


MLE for Generative Classification with Gaussian Class-conditionals
Each class k modeled using a Gaussian with mean µk and covariance matrix Σk
Note: Can assume label yn to be one-hot and then ynk = 1 if yn = k, and ynk = 0, otherwise
Assuming p(yn = k) = πk , this model has parameters Θ = {πk , µk , Σk }K
k=1

(We have done this before) Given {x n , yn }N


n=1 , MLE for Θ will be
N
1X Nk
π̂k = ynk =
N n=1 N
N
1 X
µ̂k = ynk x n
Nk n=1
N
1 X
Σ̂k = ynk (x n − µ̂k )(x n − µ̂k )>
Nk n=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 9


MLE for Generative Classification with Gaussian Class-conditionals
Each class k modeled using a Gaussian with mean µk and covariance matrix Σk
Note: Can assume label yn to be one-hot and then ynk = 1 if yn = k, and ynk = 0, otherwise
Assuming p(yn = k) = πk , this model has parameters Θ = {πk , µk , Σk }K
k=1

(We have done this before) Given {x n , yn }N


n=1 , MLE for Θ will be
N
1X Nk
π̂k = ynk =
N n=1 N
N
1 X
µ̂k = ynk x n
Nk n=1
N
1 X
Σ̂k = ynk (x n − µ̂k )(x n − µ̂k )>
Nk n=1

Basically estimating K Gaussians instead of just 1 (each using data only from that class)
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 9
MLE for Generative Classification with Gaussian Class-conditionals
Let’s look at the “formal” procedure of deriving MLE in this case

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 10


MLE for Generative Classification with Gaussian Class-conditionals
Let’s look at the “formal” procedure of deriving MLE in this case
MLE for Θ = {πk , µk , Σk }K
k=1 in this case can be written as (assuming i.i.d. data)

Θ̂ = arg max p(X, y |Θ)


Θ

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 10


MLE for Generative Classification with Gaussian Class-conditionals
Let’s look at the “formal” procedure of deriving MLE in this case
MLE for Θ = {πk , µk , Σk }K
k=1 in this case can be written as (assuming i.i.d. data)
N
Y
Θ̂ = arg max p(X, y |Θ) = arg max p(x n , yn |Θ) =
Θ Θ
n=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 10


MLE for Generative Classification with Gaussian Class-conditionals
Let’s look at the “formal” procedure of deriving MLE in this case
MLE for Θ = {πk , µk , Σk }K
k=1 in this case can be written as (assuming i.i.d. data)
N
Y N
Y
Θ̂ = arg max p(X, y |Θ) = arg max p(x n , yn |Θ) = arg max p(yn |Θ)p(x n |yn , Θ)
Θ Θ Θ
n=1 n=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 10


MLE for Generative Classification with Gaussian Class-conditionals
Let’s look at the “formal” procedure of deriving MLE in this case
MLE for Θ = {πk , µk , Σk }K
k=1 in this case can be written as (assuming i.i.d. data)
N
Y N
Y
Θ̂ = arg max p(X, y |Θ) = arg max p(x n , yn |Θ) = arg max p(yn |Θ)p(x n |yn , Θ)
Θ Θ Θ
n=1 n=1
N Y
Y K
= arg max [p(yn = k|Θ)p(x n |yn = k, Θ)]ynk
Θ
n=1 k=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 10


MLE for Generative Classification with Gaussian Class-conditionals
Let’s look at the “formal” procedure of deriving MLE in this case
MLE for Θ = {πk , µk , Σk }K
k=1 in this case can be written as (assuming i.i.d. data)
N
Y N
Y
Θ̂ = arg max p(X, y |Θ) = arg max p(x n , yn |Θ) = arg max p(yn |Θ)p(x n |yn , Θ)
Θ Θ Θ
n=1 n=1
N Y
Y K
= arg max [p(yn = k|Θ)p(x n |yn = k, Θ)]ynk
Θ
n=1 k=1
N Y
Y K
= arg max log [p(yn = k|Θ)p(x n |yn = k, Θ)]ynk
Θ
n=1 k=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 10


MLE for Generative Classification with Gaussian Class-conditionals
Let’s look at the “formal” procedure of deriving MLE in this case
MLE for Θ = {πk , µk , Σk }K
k=1 in this case can be written as (assuming i.i.d. data)
N
Y N
Y
Θ̂ = arg max p(X, y |Θ) = arg max p(x n , yn |Θ) = arg max p(yn |Θ)p(x n |yn , Θ)
Θ Θ Θ
n=1 n=1
N Y
Y K
= arg max [p(yn = k|Θ)p(x n |yn = k, Θ)]ynk
Θ
n=1 k=1
N Y
Y K
= arg max log [p(yn = k|Θ)p(x n |yn = k, Θ)]ynk
Θ
n=1 k=1
N X
X K
= arg max ynk [log p(yn = k|Θ) + log p(x n |yn = k, Θ)]
Θ
n=1 k=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 10


MLE for Generative Classification with Gaussian Class-conditionals
Let’s look at the “formal” procedure of deriving MLE in this case
MLE for Θ = {πk , µk , Σk }K
k=1 in this case can be written as (assuming i.i.d. data)
N
Y N
Y
Θ̂ = arg max p(X, y |Θ) = arg max p(x n , yn |Θ) = arg max p(yn |Θ)p(x n |yn , Θ)
Θ Θ Θ
n=1 n=1
N Y
Y K
= arg max [p(yn = k|Θ)p(x n |yn = k, Θ)]ynk
Θ
n=1 k=1
N Y
Y K
= arg max log [p(yn = k|Θ)p(x n |yn = k, Θ)]ynk
Θ
n=1 k=1
N X
X K
= arg max ynk [log p(yn = k|Θ) + log p(x n |yn = k, Θ)]
Θ
n=1 k=1
N X
X K
= arg max ynk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 10


MLE for Generative Classification with Gaussian Class-conditionals
Let’s look at the “formal” procedure of deriving MLE in this case
MLE for Θ = {πk , µk , Σk }K
k=1 in this case can be written as (assuming i.i.d. data)
N
Y N
Y
Θ̂ = arg max p(X, y |Θ) = arg max p(x n , yn |Θ) = arg max p(yn |Θ)p(x n |yn , Θ)
Θ Θ Θ
n=1 n=1
N Y
Y K
= arg max [p(yn = k|Θ)p(x n |yn = k, Θ)]ynk
Θ
n=1 k=1
N Y
Y K
= arg max log [p(yn = k|Θ)p(x n |yn = k, Θ)]ynk
Θ
n=1 k=1
N X
X K
= arg max ynk [log p(yn = k|Θ) + log p(x n |yn = k, Θ)]
Θ
n=1 k=1
N X
X K
= arg max ynk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

Given (X, y ), optimizing it w.r.t. πk , µk , Σk will give us the solution we saw on the previous slide
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 10
MLE When Labels Go Missing..

So the MLE problem for generative classification with Gaussian class-conditionals was
N X
X K
Θ̂ = arg max log p(X, y |Θ) = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ Θ
n=1 k=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 11


MLE When Labels Go Missing..

So the MLE problem for generative classification with Gaussian class-conditionals was
N X
X K
Θ̂ = arg max log p(X, y |Θ) = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ Θ
n=1 k=1

This problem has a nice separable structure, and a straightforward solution as we saw

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 11


MLE When Labels Go Missing..

So the MLE problem for generative classification with Gaussian class-conditionals was
N X
X K
Θ̂ = arg max log p(X, y |Θ) = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ Θ
n=1 k=1

This problem has a nice separable structure, and a straightforward solution as we saw
What if we don’t know the label yn (i.e., don’t know if ynk is 0 or 1)?

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 11


MLE When Labels Go Missing..

So the MLE problem for generative classification with Gaussian class-conditionals was
N X
X K
Θ̂ = arg max log p(X, y |Θ) = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ Θ
n=1 k=1

This problem has a nice separable structure, and a straightforward solution as we saw
What if we don’t know the label yn (i.e., don’t know if ynk is 0 or 1)? How to estimate Θ now?

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 11


MLE When Labels Go Missing..

So the MLE problem for generative classification with Gaussian class-conditionals was
N X
X K
Θ̂ = arg max log p(X, y |Θ) = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ Θ
n=1 k=1

This problem has a nice separable structure, and a straightforward solution as we saw
What if we don’t know the label yn (i.e., don’t know if ynk is 0 or 1)? How to estimate Θ now?
When might we need to solve such a problem?

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 11


MLE When Labels Go Missing..

So the MLE problem for generative classification with Gaussian class-conditionals was
N X
X K
Θ̂ = arg max log p(X, y |Θ) = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ Θ
n=1 k=1

This problem has a nice separable structure, and a straightforward solution as we saw
What if we don’t know the label yn (i.e., don’t know if ynk is 0 or 1)? How to estimate Θ now?
When might we need to solve such a problem?
Mixture density estimation: Given N inputs x 1 , . . . , x N , model p(x) as a mixture of distributions

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 11


MLE When Labels Go Missing..

So the MLE problem for generative classification with Gaussian class-conditionals was
N X
X K
Θ̂ = arg max log p(X, y |Θ) = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ Θ
n=1 k=1

This problem has a nice separable structure, and a straightforward solution as we saw
What if we don’t know the label yn (i.e., don’t know if ynk is 0 or 1)? How to estimate Θ now?
When might we need to solve such a problem?
Mixture density estimation: Given N inputs x 1 , . . . , x N , model p(x) as a mixture of distributions

Probabilistic clustering: Same as density estimation; can get cluster ids once Θ is estimated

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 11


MLE When Labels Go Missing..

So the MLE problem for generative classification with Gaussian class-conditionals was
N X
X K
Θ̂ = arg max log p(X, y |Θ) = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ Θ
n=1 k=1

This problem has a nice separable structure, and a straightforward solution as we saw
What if we don’t know the label yn (i.e., don’t know if ynk is 0 or 1)? How to estimate Θ now?
When might we need to solve such a problem?
Mixture density estimation: Given N inputs x 1 , . . . , x N , model p(x) as a mixture of distributions

Probabilistic clustering: Same as density estimation; can get cluster ids once Θ is estimated
Semi-supervised generative classification: In training data, some yn ’s are known, some not known

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 11


MLE When Labels Go Missing..
Recall the MLE problem for Θ when the labels are known
N X
X K
Θ̂ = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 12


MLE When Labels Go Missing..
Recall the MLE problem for Θ when the labels are known
N X
X K
Θ̂ = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

Will estimating Θ via MLE be as easy if yn ’s are unknown? We only have X = {x 1 , . . . , x N }

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 12


MLE When Labels Go Missing..
Recall the MLE problem for Θ when the labels are known
N X
X K
Θ̂ = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

Will estimating Θ via MLE be as easy if yn ’s are unknown? We only have X = {x 1 , . . . , x N }


The MLE problem for Θ = {πk , µk , Σk }K
k=1 in this case would be (assuming i.i.d. data)
N
Y
Θ̂ = arg max log p(X|Θ) = arg max log p(x n |Θ)
Θ Θ
n=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 12


MLE When Labels Go Missing..
Recall the MLE problem for Θ when the labels are known
N X
X K
Θ̂ = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

Will estimating Θ via MLE be as easy if yn ’s are unknown? We only have X = {x 1 , . . . , x N }


The MLE problem for Θ = {πk , µk , Σk }K
k=1 in this case would be (assuming i.i.d. data)
N
Y N
X
Θ̂ = arg max log p(X|Θ) = arg max log p(x n |Θ) = arg max log p(x n |Θ)
Θ Θ Θ
n=1 n=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 12


MLE When Labels Go Missing..
Recall the MLE problem for Θ when the labels are known
N X
X K
Θ̂ = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

Will estimating Θ via MLE be as easy if yn ’s are unknown? We only have X = {x 1 , . . . , x N }


The MLE problem for Θ = {πk , µk , Σk }K
k=1 in this case would be (assuming i.i.d. data)
N
Y N
X
Θ̂ = arg max log p(X|Θ) = arg max log p(x n |Θ) = arg max log p(x n |Θ)
Θ Θ Θ
n=1 n=1

Computing each likelihood p(x n |Θ) in this case requires summing over all possible values of yn
K
X
p(x n |Θ) = p(x n , yn = k|Θ)
k=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 12


MLE When Labels Go Missing..
Recall the MLE problem for Θ when the labels are known
N X
X K
Θ̂ = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

Will estimating Θ via MLE be as easy if yn ’s are unknown? We only have X = {x 1 , . . . , x N }


The MLE problem for Θ = {πk , µk , Σk }K
k=1 in this case would be (assuming i.i.d. data)
N
Y N
X
Θ̂ = arg max log p(X|Θ) = arg max log p(x n |Θ) = arg max log p(x n |Θ)
Θ Θ Θ
n=1 n=1

Computing each likelihood p(x n |Θ) in this case requires summing over all possible values of yn
K
X K
X
p(x n |Θ) = p(x n , yn = k|Θ) = p(yn = k|Θ)p(x n |yn = k, Θ)
k=1 k=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 12


MLE When Labels Go Missing..
Recall the MLE problem for Θ when the labels are known
N X
X K
Θ̂ = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

Will estimating Θ via MLE be as easy if yn ’s are unknown? We only have X = {x 1 , . . . , x N }


The MLE problem for Θ = {πk , µk , Σk }K
k=1 in this case would be (assuming i.i.d. data)
N
Y N
X
Θ̂ = arg max log p(X|Θ) = arg max log p(x n |Θ) = arg max log p(x n |Θ)
Θ Θ Θ
n=1 n=1

Computing each likelihood p(x n |Θ) in this case requires summing over all possible values of yn
K
X K
X K
X
p(x n |Θ) = p(x n , yn = k|Θ) = p(yn = k|Θ)p(x n |yn = k, Θ) = πk N (x n |µk , Σk )
k=1 k=1 k=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 12


MLE When Labels Go Missing..
Recall the MLE problem for Θ when the labels are known
N X
X K
Θ̂ = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

Will estimating Θ via MLE be as easy if yn ’s are unknown? We only have X = {x 1 , . . . , x N }


The MLE problem for Θ = {πk , µk , Σk }K
k=1 in this case would be (assuming i.i.d. data)
N
Y N
X
Θ̂ = arg max log p(X|Θ) = arg max log p(x n |Θ) = arg max log p(x n |Θ)
Θ Θ Θ
n=1 n=1

Computing each likelihood p(x n |Θ) in this case requires summing over all possible values of yn
K
X K
X K
X
p(x n |Θ) = p(x n , yn = k|Θ) = p(yn = k|Θ)p(x n |yn = k, Θ) = πk N (x n |µk , Σk )
k=1 k=1 k=1

The MLE problem for Θ when the labels are unknown


XN K
X
Θ̂ = arg max log πk N (x n |µk , Σk )
Θ
n=1 k=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 12


MLE When Labels Go Missing..
So we saw that the MLE problem for Θ when the labels are unknown
N
X K
X
Θ̂ = arg max log πk N (x n |µk , Σk )
Θ
n=1 k=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 13


MLE When Labels Go Missing..
So we saw that the MLE problem for Θ when the labels are unknown
N
X K
X
Θ̂ = arg max log πk N (x n |µk , Σk )
Θ
n=1 k=1

Solving this would enable us to learn a Gaussian Mixture Model (GMM)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 13


MLE When Labels Go Missing..
So we saw that the MLE problem for Θ when the labels are unknown
N
X K
X
Θ̂ = arg max log πk N (x n |µk , Σk )
Θ
n=1 k=1

Solving this would enable us to learn a Gaussian Mixture Model (GMM)

Note: The Gaussian can be replaced by other distributions too (e.g., Poisson mixture model)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 13


MLE When Labels Go Missing..
So we saw that the MLE problem for Θ when the labels are unknown
N
X K
X
Θ̂ = arg max log πk N (x n |µk , Σk )
Θ
n=1 k=1

Solving this would enable us to learn a Gaussian Mixture Model (GMM)

Note: The Gaussian can be replaced by other distributions too (e.g., Poisson mixture model)
A small issue now: Log can’t go inside the summation. Expressions won’t be simple anymore

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 13


MLE When Labels Go Missing..
So we saw that the MLE problem for Θ when the labels are unknown
N
X K
X
Θ̂ = arg max log πk N (x n |µk , Σk )
Θ
n=1 k=1

Solving this would enable us to learn a Gaussian Mixture Model (GMM)

Note: The Gaussian can be replaced by other distributions too (e.g., Poisson mixture model)
A small issue now: Log can’t go inside the summation. Expressions won’t be simple anymore
Note: Can still take (partial) derivatives and do GG/SGD etc. but these are iterative methods

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 13


MLE When Labels Go Missing..
So we saw that the MLE problem for Θ when the labels are unknown
N
X K
X
Θ̂ = arg max log πk N (x n |µk , Σk )
Θ
n=1 k=1

Solving this would enable us to learn a Gaussian Mixture Model (GMM)

Note: The Gaussian can be replaced by other distributions too (e.g., Poisson mixture model)
A small issue now: Log can’t go inside the summation. Expressions won’t be simple anymore
Note: Can still take (partial) derivatives and do GG/SGD etc. but these are iterative methods
Recall that we didn’t need GD/SGD etc when doing MLE with fully observed yn ’s

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 13


MLE When Labels Go Missing..
So we saw that the MLE problem for Θ when the labels are unknown
N
X K
X
Θ̂ = arg max log πk N (x n |µk , Σk )
Θ
n=1 k=1

Solving this would enable us to learn a Gaussian Mixture Model (GMM)

Note: The Gaussian can be replaced by other distributions too (e.g., Poisson mixture model)
A small issue now: Log can’t go inside the summation. Expressions won’t be simple anymore
Note: Can still take (partial) derivatives and do GG/SGD etc. but these are iterative methods
Recall that we didn’t need GD/SGD etc when doing MLE with fully observed yn ’s

One workaround: Can try doing alternating optimization


Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 13
MLE for Gaussian Mixture Model using ALT-OPT
Based on the fact that MLE is simple when labels are known

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 14


MLE for Gaussian Mixture Model using ALT-OPT
Based on the fact that MLE is simple when labels are known
Notation change: We will now use zn instead of yn and znk instead of ynk

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 14


MLE for Gaussian Mixture Model using ALT-OPT
Based on the fact that MLE is simple when labels are known
Notation change: We will now use zn instead of yn and znk instead of ynk

MLE for Gaussian Mixture Model using ALT-OPT

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 14


MLE for Gaussian Mixture Model using ALT-OPT
Based on the fact that MLE is simple when labels are known
Notation change: We will now use zn instead of yn and znk instead of ynk

MLE for Gaussian Mixture Model using ALT-OPT


1 Initialize Θ as Θ̂

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 14


MLE for Gaussian Mixture Model using ALT-OPT
Based on the fact that MLE is simple when labels are known
Notation change: We will now use zn instead of yn and znk instead of ynk

MLE for Gaussian Mixture Model using ALT-OPT


1 Initialize Θ as Θ̂
2 For n = 1, . . . , N, find the best zn

ẑn = arg maxk∈{1,...,K } p(x n , zn = k|Θ̂)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 14


MLE for Gaussian Mixture Model using ALT-OPT
Based on the fact that MLE is simple when labels are known
Notation change: We will now use zn instead of yn and znk instead of ynk

MLE for Gaussian Mixture Model using ALT-OPT


1 Initialize Θ as Θ̂
2 For n = 1, . . . , N, find the best zn

ẑn = arg maxk∈{1,...,K } p(x n , zn = k|Θ̂)


= arg maxk∈{1,...,K } p(zn = k|x n , Θ̂)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 14


MLE for Gaussian Mixture Model using ALT-OPT
Based on the fact that MLE is simple when labels are known
Notation change: We will now use zn instead of yn and znk instead of ynk

MLE for Gaussian Mixture Model using ALT-OPT


1 Initialize Θ as Θ̂
2 For n = 1, . . . , N, find the best zn

ẑn = arg maxk∈{1,...,K } p(x n , zn = k|Θ̂)


= arg maxk∈{1,...,K } p(zn = k|x n , Θ̂)

3 Given Ẑ = {ẑ1 , . . . , ẑN }, re-estimate Θ using MLE


N X
X K
Θ̂ = arg max ẑnk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 14


MLE for Gaussian Mixture Model using ALT-OPT
Based on the fact that MLE is simple when labels are known
Notation change: We will now use zn instead of yn and znk instead of ynk

MLE for Gaussian Mixture Model using ALT-OPT


1 Initialize Θ as Θ̂
2 For n = 1, . . . , N, find the best zn

ẑn = arg maxk∈{1,...,K } p(x n , zn = k|Θ̂)


= arg maxk∈{1,...,K } p(zn = k|x n , Θ̂)

3 Given Ẑ = {ẑ1 , . . . , ẑN }, re-estimate Θ using MLE


N X
X K
Θ̂ = arg max ẑnk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1
4 Go to step 2 if not yet converged

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 14


Is ALT-OPT Doing The Correct Thing?
Our original problem was N K
X X
Θ̂ = arg max log πk N (x n |µk , Σk )
Θ
n=1 k=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 15


Is ALT-OPT Doing The Correct Thing?
Our original problem was N K
X X
Θ̂ = arg max log πk N (x n |µk , Σk )
Θ
n=1 k=1
What ALT-OPT did was the following
N X
X K
Θ̂ = arg max ẑnk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 15


Is ALT-OPT Doing The Correct Thing?
Our original problem was N K
X X
Θ̂ = arg max log πk N (x n |µk , Σk )
Θ
n=1 k=1
What ALT-OPT did was the following
N X
X K
Θ̂ = arg max ẑnk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

We clearly aren’t solving the original problem!


arg max log p(X|Θ) vs arg max log p(X, Ẑ|Θ)
Θ Θ

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 15


Is ALT-OPT Doing The Correct Thing?
Our original problem was N K
X X
Θ̂ = arg max log πk N (x n |µk , Σk )
Θ
n=1 k=1
What ALT-OPT did was the following
N X
X K
Θ̂ = arg max ẑnk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

We clearly aren’t solving the original problem!


arg max log p(X|Θ) vs arg max log p(X, Ẑ|Θ)
Θ Θ
Also, we updated ẑn as follows
ẑn = arg maxk∈{1,...,K } p(zn = k|x n , Θ̂)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 15


Is ALT-OPT Doing The Correct Thing?
Our original problem was N K
X X
Θ̂ = arg max log πk N (x n |µk , Σk )
Θ
n=1 k=1
What ALT-OPT did was the following
N X
X K
Θ̂ = arg max ẑnk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

We clearly aren’t solving the original problem!


arg max log p(X|Θ) vs arg max log p(X, Ẑ|Θ)
Θ Θ
Also, we updated ẑn as follows
ẑn = arg maxk∈{1,...,K } p(zn = k|x n , Θ̂)
Why choose ẑn to be this (makes intuitive sense, but is there a formal justification)?

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 15


Is ALT-OPT Doing The Correct Thing?
Our original problem was N K
X X
Θ̂ = arg max log πk N (x n |µk , Σk )
Θ
n=1 k=1
What ALT-OPT did was the following
N X
X K
Θ̂ = arg max ẑnk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

We clearly aren’t solving the original problem!


arg max log p(X|Θ) vs arg max log p(X, Ẑ|Θ)
Θ Θ
Also, we updated ẑn as follows
ẑn = arg maxk∈{1,...,K } p(zn = k|x n , Θ̂)
Why choose ẑn to be this (makes intuitive sense, but is there a formal justification)?
It turns out (as we will see), this ALT-OPT is an approximation of the Expectation Maximization
(EM) algorithm for GMM
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 15
Expectation Maximization (EM)

A very popular algorithm for parameter estimation in latent variable models

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 16


Expectation Maximization (EM)

A very popular algorithm for parameter estimation in latent variable models


The EM algorithm is based on the following identity (exercise: verify)
 
p(X, Z|Θ)
log p(X|Θ) = Eq(Z) log + KL[q(Z)||p(Z|X, Θ)]
q(Z)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 16


Expectation Maximization (EM)

A very popular algorithm for parameter estimation in latent variable models


The EM algorithm is based on the following identity (exercise: verify)
 
p(X, Z|Θ)
log p(X|Θ) = Eq(Z) log + KL[q(Z)||p(Z|X, Θ)]
q(Z)

The above is true for any choice of the distribution q(Z)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 16


Expectation Maximization (EM)

A very popular algorithm for parameter estimation in latent variable models


The EM algorithm is based on the following identity (exercise: verify)
 
p(X, Z|Θ)
log p(X|Θ) = Eq(Z) log + KL[q(Z)||p(Z|X, Θ)]
q(Z)

The above is true for any choice of the distribution q(Z)


Since KL divergence is non-negative, we must have
 
p(X, Z|Θ)
log p(X|Θ) ≥ Eq(Z) log
q(Z)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 16


Expectation Maximization (EM)

A very popular algorithm for parameter estimation in latent variable models


The EM algorithm is based on the following identity (exercise: verify)
 
p(X, Z|Θ)
log p(X|Θ) = Eq(Z) log + KL[q(Z)||p(Z|X, Θ)]
q(Z)

The above is true for any choice of the distribution q(Z)


Since KL divergence is non-negative, we must have
 
p(X, Z|Θ)
log p(X|Θ) ≥ Eq(Z) log
h i q(Z)
p(X,Z|Θ)
So L(Θ) = Eq(Z) log q(Z) is a lower bound on what we want to maximize, i.e., log p(X|Θ)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 16


Expectation Maximization (EM)

A very popular algorithm for parameter estimation in latent variable models


The EM algorithm is based on the following identity (exercise: verify)
 
p(X, Z|Θ)
log p(X|Θ) = Eq(Z) log + KL[q(Z)||p(Z|X, Θ)]
q(Z)

The above is true for any choice of the distribution q(Z)


Since KL divergence is non-negative, we must have
 
p(X, Z|Θ)
log p(X|Θ) ≥ Eq(Z) log
h i q(Z)
p(X,Z|Θ)
So L(Θ) = Eq(Z) log is a lower bound on what we want to maximize, i.e., log p(X|Θ)
q(Z)
h i
Also, if we choose q(Z) = p(Z|X, Θ), then log p(X|Θ) = Eq(Z) log p(X,Z|Θ)
q(Z)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 16


EM for GMM

The EM algorithm for GMM does the following


N X
X K
Θ̂new = arg max E[znk ][log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

.. which is nothing but maximizing Eq(Z) [log p(X, Z|Θ)] with q(Z) = p(Z|X, Θ̂old )

Here E[znk ] is the expectation of znk w.r.t. posterior p(z n |x n ) and is given by
E[znk ] = 0 × p(znk = 0|x n ) + 1 × p(znk = 1|x n )

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 17


EM for GMM

The EM algorithm for GMM does the following


N X
X K
Θ̂new = arg max E[znk ][log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

.. which is nothing but maximizing Eq(Z) [log p(X, Z|Θ)] with q(Z) = p(Z|X, Θ̂old )

Here E[znk ] is the expectation of znk w.r.t. posterior p(z n |x n ) and is given by
E[znk ] = 0 × p(znk = 0|x n ) + 1 × p(znk = 1|x n )
= p(znk = 1|x n )

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 17


EM for GMM

The EM algorithm for GMM does the following


N X
X K
Θ̂new = arg max E[znk ][log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

.. which is nothing but maximizing Eq(Z) [log p(X, Z|Θ)] with q(Z) = p(Z|X, Θ̂old )

Here E[znk ] is the expectation of znk w.r.t. posterior p(z n |x n ) and is given by
E[znk ] = 0 × p(znk = 0|x n ) + 1 × p(znk = 1|x n )
= p(znk = 1|x n )
∝ p(znk = 1)p(x n |znk = 1) (from Bayes Rule)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 17


EM for GMM

The EM algorithm for GMM does the following


N X
X K
Θ̂new = arg max E[znk ][log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

.. which is nothing but maximizing Eq(Z) [log p(X, Z|Θ)] with q(Z) = p(Z|X, Θ̂old )

Here E[znk ] is the expectation of znk w.r.t. posterior p(z n |x n ) and is given by
E[znk ] = 0 × p(znk = 0|x n ) + 1 × p(znk = 1|x n )
= p(znk = 1|x n )
∝ p(znk = 1)p(x n |znk = 1) (from Bayes Rule)
Thus E[znk ] ∝ πk N (x n |µk , Σk ) (Posterior prob. that x n is generated by k-th Gaussian)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 17


EM for GMM

The EM algorithm for GMM does the following


N X
X K
Θ̂new = arg max E[znk ][log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

.. which is nothing but maximizing Eq(Z) [log p(X, Z|Θ)] with q(Z) = p(Z|X, Θ̂old )

Here E[znk ] is the expectation of znk w.r.t. posterior p(z n |x n ) and is given by
E[znk ] = 0 × p(znk = 0|x n ) + 1 × p(znk = 1|x n )
= p(znk = 1|x n )
∝ p(znk = 1)p(x n |znk = 1) (from Bayes Rule)
Thus E[znk ] ∝ πk N (x n |µk , Σk ) (Posterior prob. that x n is generated by k-th Gaussian)

Next class: Details of EM for GMM, special cases, and the general EM algorithm and its properties

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 17

You might also like