771 A18 Lec15

Parameter Estimation in Latent Variable Models
Piyush Rai
Introduction to Machine Learning (CS771A)
September 25, 2018
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 1

Some Mid-Sem Statistics
Also normal
Need not worry Normal
Need not worry
Can recover
Need not worry
too much
2 1

Also normal
Need not worry
Can recover
Need not worry
too much
2 1

Also normal
Need not worry
Can recover
Need not worry
too much
2 1

Also normal
Need not worry
Also normal
Can recover
2 1

Latent Variable Models

A Simple Generative Model
All observations {x 1 , . . . , x N } generated from a distribution p(x|θ)

Unknowns: Parameters θ of the assumed data distribution p(x|θ)

Unknowns: Parameters θ of the assumed data distribution p(x|θ)

Many ways to estimate the parameters (MLE, MAP, or Bayesian inference)

Generative Model with Latent Variables
Assume each observation x n to be associated with a latent variable z n

In this “latent variable model” of data, data x also depends some latent variable(s) z

z n is akin to a latent representation or “encoding” of x n ; controls what data “looks like”.

z n is akin to a latent representation or “encoding” of x n ; controls what data “looks like”.E.g,
z n ∈ {1, . . . , K } denotes the cluster x n belongs to

z n ∈ RK denotes a low-dimensional latent representation or latent “code” for x n

Unknowns: {z 1 , . . . , z N }, and (θ, φ).

Unknowns: {z 1 , . . . , z N }, and (θ, φ). z n ’s called “local” variables; (θ, φ) called “global” variables

Brief Detour/Recap:
Gaussian Parameter Estimation

MLE for Multivariate Gaussian
Multivariate Gaussian in D dimensions

1 1
p(x|µ, Σ) = exp − (x − µ)> Σ−1 (x − µ)
(2π)D/2 |Σ|1/2 2


1 1
p(x|µ, Σ) = exp − (x − µ)> Σ−1 (x − µ)
(2π)D/2 |Σ|1/2 2
Goal: Given N i.i.d. observations {x n }N

n=1 from this Gaussian, estimate parameters µ and Σ


1 1
p(x|µ, Σ) = exp − (x − µ)> Σ−1 (x − µ)
(2π)D/2 |Σ|1/2 2

MLE for the D × 1 mean µ ∈ RD and D × D p.s.d. covariance matrix Σ

N N
1 X 1 X
µ̂ = xn and Σ̂ = (x n − µ̂)(x n − µ̂)>
N n=1 N n=1


1 1
p(x|µ, Σ) = exp − (x − µ)> Σ−1 (x − µ)
(2π)D/2 |Σ|1/2 2


N N
1 X 1 X
µ̂ = xn and Σ̂ = (x n − µ̂)(x n − µ̂)>
N n=1 N n=1
Note: Σ depends on µ, but µ doesn’t depend on Σ ⇒ no need for alternating opt.


1 1
p(x|µ, Σ) = exp − (x − µ)> Σ−1 (x − µ)
(2π)D/2 |Σ|1/2 2


N N
1 X 1 X
µ̂ = xn and Σ̂ = (x n − µ̂)(x n − µ̂)>
N n=1 N n=1

Note: log works nicely with exp of the Gaussian. Simplifies MLE expressions in this case


1 1
p(x|µ, Σ) = exp − (x − µ)> Σ−1 (x − µ)
(2π)D/2 |Σ|1/2 2


N N
1 X 1 X
µ̂ = xn and Σ̂ = (x n − µ̂)(x n − µ̂)>
N n=1 N n=1

Note: log works nicely with exp of the Gaussian. Simplifies MLE expressions in this case
In general, when the distribution is an exponential family distribution, MLE is usually very easy

Brief Detour: Exponential Family Distributions
An exponential family distribution is of the form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]
θ is called the natural parameter of the family

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

h(x), φ(x), and A(θ) are known functions. Note: Don’t confuse φ with kernel mappings!

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

φ(x) is called the sufficient statistics: knowing this is sufficient to estimate θ

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Every exp. family distribution also has a conjugate distribution (often also in exp. family)

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Many other nice properties (especially useful in Bayesian inference)

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Also, MLE/MAP is usually quite simple (note that log p(x|θ) will typically have a simple form)

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Also, MLE/MAP is usually quite simple (note that log p(x|θ) will typically have a simple form)
Many well-known distribution (Bernoulli, Binomial, multinoulli, beta, gamma, Gaussian, etc.) are
exponential family distributions
https://en.wikipedia.org/wiki/Exponential_family

MLE for Generative Classification with Gaussian Class-conditionals
Each class k modeled using a Gaussian with mean µk and covariance matrix Σk

Note: Can assume label yn to be one-hot and then ynk = 1 if yn = k, and ynk = 0, otherwise

Assuming p(yn = k) = πk , this model has parameters Θ = {πk , µk , Σk }K
k=1

k=1
(We have done this before) Given {x n , yn }N

n=1 , MLE for Θ will be

k=1

N
1X
π̂k = ynk
N n=1

k=1

N
1X Nk
π̂k = ynk =
N n=1 N

k=1

N
1X Nk
π̂k = ynk =
N n=1 N
N
1 X
µ̂k = ynk x n
Nk n=1

k=1

N
1X Nk
π̂k = ynk =
N n=1 N
N
1 X
µ̂k = ynk x n
Nk n=1
N
1 X
Σ̂k = ynk (x n − µ̂k )(x n − µ̂k )>
Nk n=1

k=1

N
1X Nk
π̂k = ynk =
N n=1 N
N
1 X
µ̂k = ynk x n
Nk n=1
N
1 X
Σ̂k = ynk (x n − µ̂k )(x n − µ̂k )>
Nk n=1
Basically estimating K Gaussians instead of just 1 (each using data only from that class)
Let’s look at the “formal” procedure of deriving MLE in this case

MLE for Θ = {πk , µk , Σk }K
k=1 in this case can be written as (assuming i.i.d. data)
Θ̂ = arg max p(X, y |Θ)

Θ

N
Y
Θ̂ = arg max p(X, y |Θ) = arg max p(x n , yn |Θ) =
Θ Θ
n=1

N
Y N
Y
Θ̂ = arg max p(X, y |Θ) = arg max p(x n , yn |Θ) = arg max p(yn |Θ)p(x n |yn , Θ)
Θ Θ Θ
n=1 n=1

N
Y N
Y
Θ Θ Θ
n=1 n=1
N Y
Y K
= arg max [p(yn = k|Θ)p(x n |yn = k, Θ)]ynk
Θ
n=1 k=1

N
Y N
Y
Θ Θ Θ
n=1 n=1
N Y
Y K
Θ
n=1 k=1
N Y
Y K
= arg max log [p(yn = k|Θ)p(x n |yn = k, Θ)]ynk
Θ
n=1 k=1

N
Y N
Y
Θ Θ Θ
n=1 n=1
N Y
Y K
Θ
n=1 k=1
N Y
Y K
Θ
n=1 k=1
N X
X K
= arg max ynk [log p(yn = k|Θ) + log p(x n |yn = k, Θ)]
Θ
n=1 k=1

N
Y N
Y
Θ Θ Θ
n=1 n=1
N Y
Y K
Θ
n=1 k=1
N Y
Y K
Θ
n=1 k=1
N X
X K
Θ
n=1 k=1
N X
X K
= arg max ynk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

N
Y N
Y
Θ Θ Θ
n=1 n=1
N Y
Y K
Θ
n=1 k=1
N Y
Y K
Θ
n=1 k=1
N X
X K
Θ
n=1 k=1
N X
X K
= arg max ynk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1
Given (X, y ), optimizing it w.r.t. πk , µk , Σk will give us the solution we saw on the previous slide
MLE When Labels Go Missing..
So the MLE problem for generative classification with Gaussian class-conditionals was
N X
X K
Θ̂ = arg max log p(X, y |Θ) = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ Θ
n=1 k=1

N X
X K
Θ Θ
n=1 k=1
This problem has a nice separable structure, and a straightforward solution as we saw

N X
X K
Θ Θ
n=1 k=1
What if we don’t know the label yn (i.e., don’t know if ynk is 0 or 1)?

N X
X K
Θ Θ
n=1 k=1
What if we don’t know the label yn (i.e., don’t know if ynk is 0 or 1)? How to estimate Θ now?

N X
X K
Θ Θ
n=1 k=1
When might we need to solve such a problem?

N X
X K
Θ Θ
n=1 k=1
Mixture density estimation: Given N inputs x 1 , . . . , x N , model p(x) as a mixture of distributions

N X
X K
Θ Θ
n=1 k=1
Probabilistic clustering: Same as density estimation; can get cluster ids once Θ is estimated

N X
X K
Θ Θ
n=1 k=1
Probabilistic clustering: Same as density estimation; can get cluster ids once Θ is estimated
Semi-supervised generative classification: In training data, some yn ’s are known, some not known

Recall the MLE problem for Θ when the labels are known
N X
X K
Θ̂ = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1

N X
X K
Θ
n=1 k=1
Will estimating Θ via MLE be as easy if yn ’s are unknown? We only have X = {x 1 , . . . , x N }

N X
X K
Θ
n=1 k=1

The MLE problem for Θ = {πk , µk , Σk }K
k=1 in this case would be (assuming i.i.d. data)
N
Y
Θ̂ = arg max log p(X|Θ) = arg max log p(x n |Θ)
Θ Θ
n=1

N X
X K
Θ
n=1 k=1

N
Y N
X
Θ̂ = arg max log p(X|Θ) = arg max log p(x n |Θ) = arg max log p(x n |Θ)
Θ Θ Θ
n=1 n=1

N X
X K
Θ
n=1 k=1

N
Y N
X
Θ Θ Θ
n=1 n=1
Computing each likelihood p(x n |Θ) in this case requires summing over all possible values of yn
K
X
p(x n |Θ) = p(x n , yn = k|Θ)
k=1

N X
X K
Θ
n=1 k=1

N
Y N
X
Θ Θ Θ
n=1 n=1
K
X K
X
p(x n |Θ) = p(x n , yn = k|Θ) = p(yn = k|Θ)p(x n |yn = k, Θ)
k=1 k=1

N X
X K
Θ
n=1 k=1

N
Y N
X
Θ Θ Θ
n=1 n=1
K
X K
X K
X
p(x n |Θ) = p(x n , yn = k|Θ) = p(yn = k|Θ)p(x n |yn = k, Θ) = πk N (x n |µk , Σk )
k=1 k=1 k=1

N X
X K
Θ
n=1 k=1

N
Y N
X
Θ Θ Θ
n=1 n=1
K
X K
X K
X
p(x n |Θ) = p(x n , yn = k|Θ) = p(yn = k|Θ)p(x n |yn = k, Θ) = πk N (x n |µk , Σk )
k=1 k=1 k=1
The MLE problem for Θ when the labels are unknown

XN K
X
Θ̂ = arg max log πk N (x n |µk , Σk )
Θ
n=1 k=1

So we saw that the MLE problem for Θ when the labels are unknown
N
X K
X
Θ
n=1 k=1

N
X K
X
Θ
n=1 k=1
Solving this would enable us to learn a Gaussian Mixture Model (GMM)

N
X K
X
Θ
n=1 k=1
Note: The Gaussian can be replaced by other distributions too (e.g., Poisson mixture model)

N
X K
X
Θ
n=1 k=1
A small issue now: Log can’t go inside the summation. Expressions won’t be simple anymore

N
X K
X
Θ
n=1 k=1
Note: Can still take (partial) derivatives and do GG/SGD etc. but these are iterative methods

N
X K
X
Θ
n=1 k=1
Recall that we didn’t need GD/SGD etc when doing MLE with fully observed yn ’s

N
X K
X
Θ
n=1 k=1
Recall that we didn’t need GD/SGD etc when doing MLE with fully observed yn ’s
One workaround: Can try doing alternating optimization

MLE for Gaussian Mixture Model using ALT-OPT
Based on the fact that MLE is simple when labels are known

Notation change: We will now use zn instead of yn and znk instead of ynk



1 Initialize Θ as Θ̂


2 For n = 1, . . . , N, find the best zn
ẑn = arg maxk∈{1,...,K } p(x n , zn = k|Θ̂)



= arg maxk∈{1,...,K } p(zn = k|x n , Θ̂)



3 Given Ẑ = {ẑ1 , . . . , ẑN }, re-estimate Θ using MLE

N X
X K
Θ̂ = arg max ẑnk [log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1



3 Given Ẑ = {ẑ1 , . . . , ẑN }, re-estimate Θ using MLE

N X
X K
Θ
n=1 k=1
4 Go to step 2 if not yet converged

Is ALT-OPT Doing The Correct Thing?
Our original problem was N K
X X
Θ
n=1 k=1

X X
Θ
n=1 k=1
What ALT-OPT did was the following
N X
X K
Θ
n=1 k=1

X X
Θ
n=1 k=1
N X
X K
Θ
n=1 k=1
We clearly aren’t solving the original problem!

arg max log p(X|Θ) vs arg max log p(X, Ẑ|Θ)
Θ Θ

X X
Θ
n=1 k=1
N X
X K
Θ
n=1 k=1

Θ Θ
Also, we updated ẑn as follows
ẑn = arg maxk∈{1,...,K } p(zn = k|x n , Θ̂)

X X
Θ
n=1 k=1
N X
X K
Θ
n=1 k=1

Θ Θ
Why choose ẑn to be this (makes intuitive sense, but is there a formal justification)?

X X
Θ
n=1 k=1
N X
X K
Θ
n=1 k=1

Θ Θ
Why choose ẑn to be this (makes intuitive sense, but is there a formal justification)?
It turns out (as we will see), this ALT-OPT is an approximation of the Expectation Maximization
(EM) algorithm for GMM
Expectation Maximization (EM)
A very popular algorithm for parameter estimation in latent variable models


The EM algorithm is based on the following identity (exercise: verify)

p(X, Z|Θ)
log p(X|Θ) = Eq(Z) log + KL[q(Z)||p(Z|X, Θ)]
q(Z)



p(X, Z|Θ)
q(Z)
The above is true for any choice of the distribution q(Z)



p(X, Z|Θ)
q(Z)

Since KL divergence is non-negative, we must have

p(X, Z|Θ)
log p(X|Θ) ≥ Eq(Z) log
q(Z)



p(X, Z|Θ)
q(Z)


p(X, Z|Θ)
h i q(Z)
p(X,Z|Θ)
So L(Θ) = Eq(Z) log q(Z) is a lower bound on what we want to maximize, i.e., log p(X|Θ)



p(X, Z|Θ)
q(Z)


p(X, Z|Θ)
h i q(Z)
p(X,Z|Θ)
So L(Θ) = Eq(Z) log is a lower bound on what we want to maximize, i.e., log p(X|Θ)
q(Z)
h i
Also, if we choose q(Z) = p(Z|X, Θ), then log p(X|Θ) = Eq(Z) log p(X,Z|Θ)
q(Z)

EM for GMM
The EM algorithm for GMM does the following

N X
X K
Θ̂new = arg max E[znk ][log πk + log N (x n |µk , Σk )]
Θ
n=1 k=1
.. which is nothing but maximizing Eq(Z) [log p(X, Z|Θ)] with q(Z) = p(Z|X, Θ̂old )
Here E[znk ] is the expectation of znk w.r.t. posterior p(z n |x n ) and is given by
E[znk ] = 0 × p(znk = 0|x n ) + 1 × p(znk = 1|x n )

EM for GMM

N X
X K
Θ
n=1 k=1
= p(znk = 1|x n )

EM for GMM

N X
X K
Θ
n=1 k=1
= p(znk = 1|x n )
∝ p(znk = 1)p(x n |znk = 1) (from Bayes Rule)

EM for GMM

N X
X K
Θ
n=1 k=1
= p(znk = 1|x n )
Thus E[znk ] ∝ πk N (x n |µk , Σk ) (Posterior prob. that x n is generated by k-th Gaussian)

EM for GMM

N X
X K
Θ
n=1 k=1
= p(znk = 1|x n )
Thus E[znk ] ∝ πk N (x n |µk , Σk ) (Posterior prob. that x n is generated by k-th Gaussian)
Next class: Details of EM for GMM, special cases, and the general EM algorithm and its properties

771 A18 Lec15

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

771 A18 Lec15

Uploaded by

Copyright:

Available Formats

Parameter Estimation in Latent Variable Models

Introduction to Machine Learning (CS771A)

September 25, 2018

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 1

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 2

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 2

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 2

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 2

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 3

All observations {x 1 , . . . , x N } generated from a distribution p(x|θ)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 4

All observations {x 1 , . . . , x N } generated from a distribution p(x|θ)

Unknowns: Parameters θ of the assumed data distribution p(x|θ)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 4

All observations {x 1 , . . . , x N } generated from a distribution p(x|θ)

Unknowns: Parameters θ of the assumed data distribution p(x|θ)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 4

Assume each observation x n to be associated with a latent variable z n

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 5

Assume each observation x n to be associated with a latent variable z n

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 5

Assume each observation x n to be associated with a latent variable z n

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 5

Assume each observation x n to be associated with a latent variable z n

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 5

Assume each observation x n to be associated with a latent variable z n

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 5

Assume each observation x n to be associated with a latent variable z n

Unknowns: {z 1 , . . . , z N }, and (θ, φ).

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 5

Assume each observation x n to be associated with a latent variable z n

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 5

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 6

Multivariate Gaussian in D dimensions

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 7

Multivariate Gaussian in D dimensions

Goal: Given N i.i.d. observations {x n }N

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 7

Multivariate Gaussian in D dimensions

Goal: Given N i.i.d. observations {x n }N

MLE for the D × 1 mean µ ∈ RD and D × D p.s.d. covariance matrix Σ

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 7

Multivariate Gaussian in D dimensions

Goal: Given N i.i.d. observations {x n }N

MLE for the D × 1 mean µ ∈ RD and D × D p.s.d. covariance matrix Σ

Note: Σ depends on µ, but µ doesn’t depend on Σ ⇒ no need for alternating opt.

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 7

Multivariate Gaussian in D dimensions

Goal: Given N i.i.d. observations {x n }N

MLE for the D × 1 mean µ ∈ RD and D × D p.s.d. covariance matrix Σ

Note: Σ depends on µ, but µ doesn’t depend on Σ ⇒ no need for alternating opt.

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 7

Multivariate Gaussian in D dimensions

Goal: Given N i.i.d. observations {x n }N

MLE for the D × 1 mean µ ∈ RD and D × D p.s.d. covariance matrix Σ

Note: Σ depends on µ, but µ doesn’t depend on Σ ⇒ no need for alternating opt.

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 7

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

θ is called the natural parameter of the family

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 8

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

θ is called the natural parameter of the family