Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 17

The EM Algorithm

Still more powerful generative model? 2


Suppose we are concerned that a single Gaussian cannot capture all the
variations in our data
Can we learn 2 (or more) Gaussians to represent our data instead?
Such a generative model is often called a mixture of Gaussians
The Expectation Maximization (EM) algorithm is a very powerful
technique for performing this and several other tasks
Soft clustering, learning Gaussian mixture models (GMM)
Robust learning, Mixed Regression
Also underlies more powerful variational algorithms such as VAE
Learning a Mixture of Two Gaussians 3
We suspect that
This means thatinstead
if someoneoftells
oneusGaussian, two
that this means thatGaussians are involved
the first Gaussian is
in generating ourthatfeature
responsible for vectors
data point and consequently, the likelihood expression is .
Let Similarly,
us call them and tells us that this means that the second Gaussian is
if someone
Each ofresponsible for that
these is called a data point andof
component thethis
likelihood
GMM expression is .
Covariance matrices, more than two components can also be incorporated
Since we are unsure which data point came from which component,
we introduce a latent variable per data point to denote this
The English word “latent” means hidden or dormant or concealed
Nice name since this variable describes something that was hidden from us
These latent variables may seem similar to the one we used in (soft) k-means
Not an accident – the connections will be clear soon!
Latent variables can be discrete or continuous
MLE with Latent Variables 4
We wish to obtain the maximum (log) likelihood models i.e.

Since we do not know the values of latent variables, force them into
the expression using the law of total probability
We did a similar thing (introduce models) in predictive posterior
calculations

Very difficult optimization problem – NP-hard in general


However, two heuristics exist which work reasonably well in practice
Also theoretically sound if data is “nice” (details in a learning theory
course)
Heuristic 1: Alternating Optimization
Keep alternating between step 1 and step 2 till
5
Convert the original optimization problem
you are tired or till the process has converged!

to a double maximization problem (assume const)

In several ML problems with latent vars, although the above double


optimization problem is (still) difficult, following two problems are easy
Step 1: Fix
Theand
mostupdate latent
important variables
difference to their
between optimal
the original andvalues
the new
The intuition
problem is that behind
originalreducing
has a sumthings to of
of log a double optimization
sum which is that
is very difficult
toit optimize
may be mostly
whereasthethe
case that
new only one
problem of rid
gets theof
terms
this in
andthelooks
summation
simply
Step 2: Fixwilllatent variables and update to their optimal values
likedominate and if this
a MLE problem. We is the case,
know how tothen approximating
solve MLE problems the sum
veryby
the largest term should be okay i.e.
easily!
Heuristic 1 at Work Isn’t this like the k-
means clustering
algorithm?
6
As discussed before, we assume a mixture of two Gaussians
and
Not just “like” – this is the k-means algorithm! This
Step 1 becomes means that the k-means algorithm is one heuristic way to
compute an MLE which is difficult to compute directly!

Step 2 becomes Indeed! Notice that even here, instead of choosing just
one value of the latent variables at each time step, we
can instead use a distribution over their support

Thus, and where is the number of data pointsI for


have a feeling that the
which we have
second heuristic will also
Repeat! give us something
familiar!
Heuristic 2: Expectation Maximization 7
Original Prob:
Step 1 (E Step) Consists of two sub-steps
Step 1.1 Assume our current model estimates are
Use the current models to ascertain how likely are different values of for the -th data
Yet again,
point i.e. compute forthe new problem gets rid of the treacherous “sum of
both
log of sum” terms which are difficult to optimize. The new
Step 1.2 Use weights to set up a new objective function
problem instead looks simply like a weighted MLE problem with
As before, weights
assume constant for sake
and we know howoftosimplicity
solve MLE problems very easily!

Step 2 (M Step) Maximize the new obj. fn. to get new models

Repeat!
Derivation
Jensen’s inequality tells of the
us that E
for any Step
convex function. We used the fact that is
a concave function and so the inequality reverses since every concave function is
8
the negative
Let denote the models of a convex
to avoid function
clutter. Also let denote our current
estimate of the model
Just need to see derivation for a single point, sayLaw theof-th
total probability
point
Simply multiply and
divide by the same term

Jensen’s inequality

Just renaming

does not depend on (it


depends on instead)
The EM Algorithm
Note: assumptions such as const are made for sake of simplicity. Can execute
EM perfectly without making these assumptions as well. However, then updates
get more involved – be careful not to make mistakes
9
If we instantiate the EM algorithm with the GMM likelihoods, we
will recover the soft k-means EMalgorithm
for GMM
Thus, the soft k-means algorithm
1. Initialize meansis yet another heuristic way (the k-means
algo was the first) to compute an MLE which is difficult to compute
directly! 2. For , update using
1. Let
The EM algorithm has pros and cons over alternating optimization
Con: EM is usually 2.more
Let expensive
(normalize)to execute than alternating optimization
3. Letthat objective value of the original problem i.e.
Pro: EM will ensures
4. Update
… always keeps5.going
Repeat
up atuntil
everyconvergence
iteration – monotonic progress!!
However, no guarantee that we will ever reach the global maximum
May converge to, and get stuck at, a local maximum instead
The QWeFunction
call it instead of just since uses the values which are calculated
using . Thus, the function keeps changing
10
Let be the new objective function constructed at time step
The Generic
The EM algorithm constructs a new EM Algorithm
function at each time during the
1. Initializeit model
E-step and maximizes during the M-step
2. For every latent variable and every possible
value it could take,
We have already seen that for all compute
Easy enough to show that
3. Compute
Some indication as to whythe
EMQ-function
increases likelihood at each iteration
Alternating optimization can be seen as a cousin of EM that uses a fn
of the form 4.where
Update
5. Repeat until convergence
A pictorial depiction of
is not the
necessarilyEM
an
inverted quadratic fn.
11
Just an illustration

The -curves always lie


below the red curve
k !!
uc
St The curves always touch
the red curve at because

M-step maximizes
Mixed Regression 12
An example of latent variables
Sure, we could try clustering this data first and then apply regression models
in a supervised learning task
separately on both clusters. However, using latent variables may be
Webeneficial
have regression train e.g.
since 1) clustering datak-means may not necessarily work well
since the points here are really not close to two centroids (instead, they lie
Example:
close to twodenotes age
lines which and is really not meant to handle) and 2) using
k-means
denoteslatenttime spent
variables, weon
canwebsite
elegantly cluster and learn regression models
There are two subpopulations jointly!!
in data (gender) which behave
differently even if age is same

?
An indication that our features
may be incomplete/latent
We could have had separate and for the two components as well
Latent Variables to the Rescue
which we could also learn. However, this would make things more
tedious so for now, let us assume and also that
13
As before, if we believe that our data is best explained using two
linear regression models instead of one, we should work with a
mixed model (aka mixture of experts)
Will fit two regression models to the data and use a latent variable
to keep track of which data point belongs to which model
Let us use Gaussian likelihoods since we are comfortable with it

Note: this is not generative learning since we are still learning


discriminative distributions of the form
Will see soon how to perform generative learning in supervised settings
MLE for Mixed Regression 14
which, upon introducing latent variables gives
Method 1: Alternating Optimization

As before, assume constant for sake of simplicity to get

Step 1: Fix and update all


Step 2: Fix all and update the models
Alternating Optimization for MR 15
As before, we assumed theAltOpt
likelihood
for distributions
MR as
for 1. Initialize models
Step 1 becomes
2. For , update using
1. Let
i.e. assign
3.every data point
Update to its “closest” line or the line which fits it better
using
Step 2 becomes 1. Let
4. Repeat until convergence

i.e. perform least squares on the data points assigned to each component
May incorporate a prior as well to add a regularizer (ridge regression)
Repeat!
EM for Mixed Regression 16
Original Prob: EM for MR
1.Step
Initialize
1 (E Step)models (for components)
Consists of two sub-steps
2. Step
For 1.1
, update
Assumeusing
our current model estimates are
1.UseLet
the current models to ascertain how likely are different values of for the -th data
point i.e. compute for both
2. Let (normalize)
Step 1.2 Use weights to set up a new objective function
3. Update
As before, assume constant for sake of simplicity
where
Step(apply
2 (Mfirst order
Step) optimality)
Maximize the new obj. fn. to get new models
4. Repeat until convergence
Repeat!
Quiz 2
Oct 19 (Wednesday), 6PM, L18,L19
Only for registered students (no auditors)
Open notes (handwritten only)
No mobile phones, tablets etc
Bring your institute ID card
Syllabus:
All content uploaded to YouTube/GitHub by
16th Oct
DS content (slides, code) for DS
1,2,3,4,5,6,7,8,9
See previous year’s GitHub for practice

You might also like