Professional Documents
Culture Documents
Tutorial On Generalized Expectation Maximization: Javier R. Movellan
Tutorial On Generalized Expectation Maximization: Javier R. Movellan
Tutorial On Generalized Expectation Maximization: Javier R. Movellan
Maximization
Javier R. Movellan
1 Preliminaries
The goal of this primer is to introduce the EM (expectation maximization) algorithm and
some of its modern generalizations, including variational approximations.
Notational conventions Unless otherwise stated, capital letters are used for random vari-
ables, small letters for specific values taken by random variables, and Greek letters for
model parameters. We adhere to a Bayesian framework and treat model parameters as ran-
dom variables with known prior. From this point of view Maximum-Likelihood methods
can be interpreted as using weak priors. The probability space in which random vari-
ables are defined is left implicit and assumed to be endowed with the conditions needed
to support the derivations being presented. We present the results using discrete random
variables. Conversion to continuous variables simply requires changing probability mass
functions into probability density fucntions and sums into integrals. When the context
makes it clear, we identify probability functions by their arguments, and drop commas be-
tween arguments: e.g., p(xy) is shorthand for the joint probability mass or joint probability
density that the random variable X takes the specific value x and the random variable Y
takes the value y.
Let O, H be random vectors representing observable data and hidden states. Let Λ repre-
sent model parameters controlling the distribution of O, H. We treat Λ as a random variable
with known prior. We have two problems of interest:
• For a fixed sample o from O find values of Λ with large posterior
• For a fixed sample o from O find values of H with large posterior
Both problems are formally identical so we will focus on the first one. Note
1
p(λ | o) = p(o, λ) (1)
p(o)
Thus
argmax p(λ | o) = argmax log p(o, λ) (2)
λ λ
We obtain a sequence, (λ(1) , θ(1) ), (λ(2) , θ(2) ) · · · by iteration over two steps:
• E Step:
θ(k+1) = argmax F(θ, λ(k) ) (9)
θ
Note since
F(θ, λ) = log p(o, λ) + K(θ, λ) (10)
and since log p(o, λ) is a constant with respect to θ, this step amounts to minimiz-
ing K(θ, λ(k) ) with respect to θ, i.e., choose a member of the variational family q
which is as close as possible to the current p.
• M Step:
λ(k+1) = argmax F(θk+1 , λ) (11)
λ
Successive application of EM maximize the lower bound F on log p(o, λ), i.e,
2.1 Interpretation
• Optimizing F(θ, λ) with respect to λ is equivalent to optimizing
X
qθ (h | o) log p(o, h, λ) (14)
h
Moreover in this case to maximize F with respect to λ(k+1) we just need to max-
imize
def
X
Q(λ(k) , λ(k+1) ) = p(h | o, λ(k) ) log p(λ(k+1) )p(o, h | λ(k+1) ) (21)
h
• Consider the case in which we are given a set of iid observations o = (o1 , ·on ). If
we directly optimize log p(o | λ) with respect to λ we get
n n
X X 1 X
∇λ log p(o | λ) = ∇λ log p(oi | λ) = ∇λ p(oi , h | λ) (25)
i=1 i=1
p(oi | λ)
h
n
X 1 X
= p(oi , h | λ)∇λ log p(oi , h | λ) (26)
i=1
p(oi | λ)
h
Xn X
= p(h | oi , λ)∇λ log p(oi , h | λ) = 0 (27)
i=1 h
3 Example 1