Professional Documents
Culture Documents
771 A18 Lec15
771 A18 Lec15
Piyush Rai
Also normal
Need not worry Normal
Need not worry
Can recover
Need not worry
too much
2 1
Also normal
Need not worry Normal
Need not worry
Can recover
Need not worry
too much
2 1
Also normal
Need not worry Normal
Need not worry
Can recover
Need not worry
too much
2 1
Also normal
Need not worry Normal
Need not worry
Also normal
Can recover
2 1
In this “latent variable model” of data, data x also depends some latent variable(s) z
In this “latent variable model” of data, data x also depends some latent variable(s) z
z n is akin to a latent representation or “encoding” of x n ; controls what data “looks like”.
In this “latent variable model” of data, data x also depends some latent variable(s) z
z n is akin to a latent representation or “encoding” of x n ; controls what data “looks like”.E.g,
z n ∈ {1, . . . , K } denotes the cluster x n belongs to
In this “latent variable model” of data, data x also depends some latent variable(s) z
z n is akin to a latent representation or “encoding” of x n ; controls what data “looks like”.E.g,
z n ∈ {1, . . . , K } denotes the cluster x n belongs to
z n ∈ RK denotes a low-dimensional latent representation or latent “code” for x n
In this “latent variable model” of data, data x also depends some latent variable(s) z
z n is akin to a latent representation or “encoding” of x n ; controls what data “looks like”.E.g,
z n ∈ {1, . . . , K } denotes the cluster x n belongs to
z n ∈ RK denotes a low-dimensional latent representation or latent “code” for x n
In this “latent variable model” of data, data x also depends some latent variable(s) z
z n is akin to a latent representation or “encoding” of x n ; controls what data “looks like”.E.g,
z n ∈ {1, . . . , K } denotes the cluster x n belongs to
z n ∈ RK denotes a low-dimensional latent representation or latent “code” for x n
Unknowns: {z 1 , . . . , z N }, and (θ, φ). z n ’s called “local” variables; (θ, φ) called “global” variables
Many well-known distribution (Bernoulli, Binomial, multinoulli, beta, gamma, Gaussian, etc.) are
exponential family distributions
https://en.wikipedia.org/wiki/Exponential_family
Basically estimating K Gaussians instead of just 1 (each using data only from that class)
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 9
MLE for Generative Classification with Gaussian Class-conditionals
Let’s look at the “formal” procedure of deriving MLE in this case
Given (X, y ), optimizing it w.r.t. πk , µk , Σk will give us the solution we saw on the previous slide
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 10
MLE When Labels Go Missing..
So the MLE problem for generative classification with Gaussian class-conditionals was
N X
X K
Θ̂ = arg max log p(X, y |Θ) = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ Θ
n=1 k=1
So the MLE problem for generative classification with Gaussian class-conditionals was
N X
X K
Θ̂ = arg max log p(X, y |Θ) = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ Θ
n=1 k=1
This problem has a nice separable structure, and a straightforward solution as we saw
So the MLE problem for generative classification with Gaussian class-conditionals was
N X
X K
Θ̂ = arg max log p(X, y |Θ) = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ Θ
n=1 k=1
This problem has a nice separable structure, and a straightforward solution as we saw
What if we don’t know the label yn (i.e., don’t know if ynk is 0 or 1)?
So the MLE problem for generative classification with Gaussian class-conditionals was
N X
X K
Θ̂ = arg max log p(X, y |Θ) = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ Θ
n=1 k=1
This problem has a nice separable structure, and a straightforward solution as we saw
What if we don’t know the label yn (i.e., don’t know if ynk is 0 or 1)? How to estimate Θ now?
So the MLE problem for generative classification with Gaussian class-conditionals was
N X
X K
Θ̂ = arg max log p(X, y |Θ) = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ Θ
n=1 k=1
This problem has a nice separable structure, and a straightforward solution as we saw
What if we don’t know the label yn (i.e., don’t know if ynk is 0 or 1)? How to estimate Θ now?
When might we need to solve such a problem?
So the MLE problem for generative classification with Gaussian class-conditionals was
N X
X K
Θ̂ = arg max log p(X, y |Θ) = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ Θ
n=1 k=1
This problem has a nice separable structure, and a straightforward solution as we saw
What if we don’t know the label yn (i.e., don’t know if ynk is 0 or 1)? How to estimate Θ now?
When might we need to solve such a problem?
Mixture density estimation: Given N inputs x 1 , . . . , x N , model p(x) as a mixture of distributions
So the MLE problem for generative classification with Gaussian class-conditionals was
N X
X K
Θ̂ = arg max log p(X, y |Θ) = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ Θ
n=1 k=1
This problem has a nice separable structure, and a straightforward solution as we saw
What if we don’t know the label yn (i.e., don’t know if ynk is 0 or 1)? How to estimate Θ now?
When might we need to solve such a problem?
Mixture density estimation: Given N inputs x 1 , . . . , x N , model p(x) as a mixture of distributions
Probabilistic clustering: Same as density estimation; can get cluster ids once Θ is estimated
So the MLE problem for generative classification with Gaussian class-conditionals was
N X
X K
Θ̂ = arg max log p(X, y |Θ) = arg max ynk [log πk + log N (x n |µk , Σk )]
Θ Θ
n=1 k=1
This problem has a nice separable structure, and a straightforward solution as we saw
What if we don’t know the label yn (i.e., don’t know if ynk is 0 or 1)? How to estimate Θ now?
When might we need to solve such a problem?
Mixture density estimation: Given N inputs x 1 , . . . , x N , model p(x) as a mixture of distributions
Probabilistic clustering: Same as density estimation; can get cluster ids once Θ is estimated
Semi-supervised generative classification: In training data, some yn ’s are known, some not known
Computing each likelihood p(x n |Θ) in this case requires summing over all possible values of yn
K
X
p(x n |Θ) = p(x n , yn = k|Θ)
k=1
Computing each likelihood p(x n |Θ) in this case requires summing over all possible values of yn
K
X K
X
p(x n |Θ) = p(x n , yn = k|Θ) = p(yn = k|Θ)p(x n |yn = k, Θ)
k=1 k=1
Computing each likelihood p(x n |Θ) in this case requires summing over all possible values of yn
K
X K
X K
X
p(x n |Θ) = p(x n , yn = k|Θ) = p(yn = k|Θ)p(x n |yn = k, Θ) = πk N (x n |µk , Σk )
k=1 k=1 k=1
Computing each likelihood p(x n |Θ) in this case requires summing over all possible values of yn
K
X K
X K
X
p(x n |Θ) = p(x n , yn = k|Θ) = p(yn = k|Θ)p(x n |yn = k, Θ) = πk N (x n |µk , Σk )
k=1 k=1 k=1
Note: The Gaussian can be replaced by other distributions too (e.g., Poisson mixture model)
Note: The Gaussian can be replaced by other distributions too (e.g., Poisson mixture model)
A small issue now: Log can’t go inside the summation. Expressions won’t be simple anymore
Note: The Gaussian can be replaced by other distributions too (e.g., Poisson mixture model)
A small issue now: Log can’t go inside the summation. Expressions won’t be simple anymore
Note: Can still take (partial) derivatives and do GG/SGD etc. but these are iterative methods
Note: The Gaussian can be replaced by other distributions too (e.g., Poisson mixture model)
A small issue now: Log can’t go inside the summation. Expressions won’t be simple anymore
Note: Can still take (partial) derivatives and do GG/SGD etc. but these are iterative methods
Recall that we didn’t need GD/SGD etc when doing MLE with fully observed yn ’s
Note: The Gaussian can be replaced by other distributions too (e.g., Poisson mixture model)
A small issue now: Log can’t go inside the summation. Expressions won’t be simple anymore
Note: Can still take (partial) derivatives and do GG/SGD etc. but these are iterative methods
Recall that we didn’t need GD/SGD etc when doing MLE with fully observed yn ’s
.. which is nothing but maximizing Eq(Z) [log p(X, Z|Θ)] with q(Z) = p(Z|X, Θ̂old )
Here E[znk ] is the expectation of znk w.r.t. posterior p(z n |x n ) and is given by
E[znk ] = 0 × p(znk = 0|x n ) + 1 × p(znk = 1|x n )
.. which is nothing but maximizing Eq(Z) [log p(X, Z|Θ)] with q(Z) = p(Z|X, Θ̂old )
Here E[znk ] is the expectation of znk w.r.t. posterior p(z n |x n ) and is given by
E[znk ] = 0 × p(znk = 0|x n ) + 1 × p(znk = 1|x n )
= p(znk = 1|x n )
.. which is nothing but maximizing Eq(Z) [log p(X, Z|Θ)] with q(Z) = p(Z|X, Θ̂old )
Here E[znk ] is the expectation of znk w.r.t. posterior p(z n |x n ) and is given by
E[znk ] = 0 × p(znk = 0|x n ) + 1 × p(znk = 1|x n )
= p(znk = 1|x n )
∝ p(znk = 1)p(x n |znk = 1) (from Bayes Rule)
.. which is nothing but maximizing Eq(Z) [log p(X, Z|Θ)] with q(Z) = p(Z|X, Θ̂old )
Here E[znk ] is the expectation of znk w.r.t. posterior p(z n |x n ) and is given by
E[znk ] = 0 × p(znk = 0|x n ) + 1 × p(znk = 1|x n )
= p(znk = 1|x n )
∝ p(znk = 1)p(x n |znk = 1) (from Bayes Rule)
Thus E[znk ] ∝ πk N (x n |µk , Σk ) (Posterior prob. that x n is generated by k-th Gaussian)
.. which is nothing but maximizing Eq(Z) [log p(X, Z|Θ)] with q(Z) = p(Z|X, Θ̂old )
Here E[znk ] is the expectation of znk w.r.t. posterior p(z n |x n ) and is given by
E[znk ] = 0 × p(znk = 0|x n ) + 1 × p(znk = 1|x n )
= p(znk = 1|x n )
∝ p(znk = 1)p(x n |znk = 1) (from Bayes Rule)
Thus E[znk ] ∝ πk N (x n |µk , Σk ) (Posterior prob. that x n is generated by k-th Gaussian)
Next class: Details of EM for GMM, special cases, and the general EM algorithm and its properties