Tutorial On Generalized Expectation Maximization: Javier R. Movellan

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Tutorial on Generalized Expectation

Maximization

Javier R. Movellan
1 Preliminaries
The goal of this primer is to introduce the EM (expectation maximization) algorithm and
some of its modern generalizations, including variational approximations.

Notational conventions Unless otherwise stated, capital letters are used for random vari-
ables, small letters for specific values taken by random variables, and Greek letters for
model parameters. We adhere to a Bayesian framework and treat model parameters as ran-
dom variables with known prior. From this point of view Maximum-Likelihood methods
can be interpreted as using weak priors. The probability space in which random vari-
ables are defined is left implicit and assumed to be endowed with the conditions needed
to support the derivations being presented. We present the results using discrete random
variables. Conversion to continuous variables simply requires changing probability mass
functions into probability density fucntions and sums into integrals. When the context
makes it clear, we identify probability functions by their arguments, and drop commas be-
tween arguments: e.g., p(xy) is shorthand for the joint probability mass or joint probability
density that the random variable X takes the specific value x and the random variable Y
takes the value y.
Let O, H be random vectors representing observable data and hidden states. Let Λ repre-
sent model parameters controlling the distribution of O, H. We treat Λ as a random variable
with known prior. We have two problems of interest:
• For a fixed sample o from O find values of Λ with large posterior
• For a fixed sample o from O find values of H with large posterior
Both problems are formally identical so we will focus on the first one. Note
1
p(λ | o) = p(o, λ) (1)
p(o)
Thus
argmax p(λ | o) = argmax log p(o, λ) (2)
λ λ

Let q = {qθ (· | o) : θ ∈ <p } be a family of distributions of H parameterized by θ. We call


q a variational family, and θ the variational parameters of that family. Note
X
log p(o, λ) = qθ (h | o) log p(o, λ) (3)
h
X p(ohλ) qθ (h | o)
= qθ (h | o) log (4)
qθ (h | o) p(h | o, λ)
h
= F(θ, λ) + K(θ, λ) (5)
where
def
X p(o, h, λ)
F(θ, λ) = qθ (h | o) log (6)
qθ (h | o)
h
def
X qθ (h | o)
K(θ, λ) = qθ (h | o) log (7)
p(h | o, λ)
h
Note K(θ, λ) is the KL divergence between the distribution qθ (· | o) and p(· | o θ). Since
KL divergences are non-negative, it follows that F(θ, λ) is a lower bound on log(o, λ),i.e.,
log p(o, λ) ≥ F(θ, λ) (8)
This equation becomes an equality for values of θ for which K(θ, λ) = 0, i.e., values of θ
such that qθ (h | o) = p(h | o λ) for all h.
2 The Generalized EM algorithm

We obtain a sequence, (λ(1) , θ(1) ), (λ(2) , θ(2) ) · · · by iteration over two steps:

• E Step:
θ(k+1) = argmax F(θ, λ(k) ) (9)
θ
Note since
F(θ, λ) = log p(o, λ) + K(θ, λ) (10)
and since log p(o, λ) is a constant with respect to θ, this step amounts to minimiz-
ing K(θ, λ(k) ) with respect to θ, i.e., choose a member of the variational family q
which is as close as possible to the current p.
• M Step:
λ(k+1) = argmax F(θk+1 , λ) (11)
λ

Successive application of EM maximize the lower bound F on log p(o, λ), i.e,

F(θ(k+1) , λ(k) ) ≥ F(θ(k) , λ(k) ) (12)


and
F(θ(k+1) , λ(k+1) ) ≥ F(θ(k+1) , λ(k) ) (13)

2.1 Interpretation
• Optimizing F(θ, λ) with respect to λ is equivalent to optimizing
X
qθ (h | o) log p(o, h, λ) (14)
h

and since the log function is concave from below then


X X
qθ (h | o) log p(o, h, λ) ≤ log p(o, h, λ) = log p(o, λ) (15)
h h

• Successive applications of EM increase a lower bound F on log p(o, λ).


• This lower bound consists of the sum of two terms: a data driven term log p(o, λ)
that measures how well the distribution p(·λ) fits the observable data, and the term
KL(θ, λ) that penalizes deviations from the variational family q:
F(θ, λ) = log p(o, λ) − K(θ, λ) (16)
Thus we can think of the Genearlized EM algorithm as solving a penalized maxi-
mum likelihood problem.
• Note
log p(o, λ(k+1) ) − log p(o, λ(k) ) ≥ K(θ(k+1) , λ(k+1) ) − K(θ(k+1) , λ(k) ) (17)
Note qθ(k+1) was chosen to be closest to p(·|o, λ(k) ). Thus it is not unreasonable
(but also not guaranteed) to expect that it may not be as close to p(·|o, λ(k) ). In
other words, it is not unreasonalbe (but also not guaranteed) to expect that
K(θ(k+1) , λ(k+1) ) − K(θ(k+1) , λ(k) ) ≥ 0 (18)
and thus
log p(o, λ(k+1) ) ≥ log p(o, λ(k) ) (19)
• An important special case occurs when the family {qθ (· | o)} equals the family
{p(· | o, λ)}. In this case θ(k+1) = λ(k) and we can guarantee that

log p(o, λ(k+1) ) − log p(o, λ(k) ) ≥ K(λ(k) , λ(k+1) ) ≥ 0 (20)

Moreover in this case to maximize F with respect to λ(k+1) we just need to max-
imize
def
X
Q(λ(k) , λ(k+1) ) = p(h | o, λ(k) ) log p(λ(k+1) )p(o, h | λ(k+1) ) (21)
h

and if we use an uninformative priors, then we just need to


def
X
Q(λ(k) , λ(k+1) ) = p(h | o, λ(k) ) log p(o, h | λ(k+1) ) (22)
h

which is the objective function maximized by the standard EM algorithm.


• In the same vein, note that F(θ, λ) is a free energy, i.e., the expected energy
of states plus the entropy of the distribution under which the expected value is
computed. In this case the energy of a state h is − log p(o, h, λ). Thus if there are
no further constraints, the optimal distribution q(· | o, θ) is Boltzmann
p(h | o, θ) ∝ exp(log p(o, h, λ)) = p(o, h, λ)) (23)
p(o, h, λ))
p(h | o, θ) = P = p(h | o, λ) (24)
h p(o, h, λ)

• Consider the case in which we are given a set of iid observations o = (o1 , ·on ). If
we directly optimize log p(o | λ) with respect to λ we get
n n
X X 1 X
∇λ log p(o | λ) = ∇λ log p(oi | λ) = ∇λ p(oi , h | λ) (25)
i=1 i=1
p(oi | λ)
h
n
X 1 X
= p(oi , h | λ)∇λ log p(oi , h | λ) (26)
i=1
p(oi | λ)
h
Xn X
= p(h | oi , λ)∇λ log p(oi , h | λ) = 0 (27)
i=1 h

In contrast, when using the EM method we have


n X
X
∇λ Q(λ, λ̃) = p(h | oi , λ̃)∇λ log p(oi , h | λ) = 0 (28)
i=1 h

where λ̃ and thus p(h | oi , λ̃) is no longer a function of λ.

3 Example 1

Consider a simple Gaussian mixture model and a vector of independent observations o =


(o1 , . . . , on )T from that model
n
X
log p(o | λ) = log p(oi | λ) (29)
i=1
where
p(oi | λ) = (1 − π)p(oi | H = 0, λ) + πp(oi | H = 1, λ) = (1 − π)g(oi , 0) + π(oi , λ)
(30)
1 1 2
g(oi , λ) = √ e− 2 (oi −λ) (31)

where the prior mixture term π is fixed. Taking derivatives with respect to λ we get
∂ log p(o | λ) X 1
= πp(oi | H = 1, λ)(λ − oi ) (32)
∂λ i
p(o i | λ)
X
= p(H = 1 | oi , λ) (λ − oi ) = 0 (33)
i

which is a non-linear equation difficult to solve. However EM asks us to optimize


X
p(H = 1 | oi , λ̃) log p(oi , H = 1 | λ) = (34)
i

Taking derivatives we get


X
p(H = 1 | oi , λ̃) (λ − oi ) = 0 (35)
i

which is easily solved P


i oi p(H = 1 | oi , λ̃)
λ= P (36)
i p(H = 1 | oi , λ̃)
4 History
• The first version of this document was written by Javier R. Movellan in January
2005, as part of the Kolmogorov project.

You might also like