Professional Documents
Culture Documents
Eric Jang - A Beginner's Guide To Variational Methods - Mean-Field Approximation
Eric Jang - A Beginner's Guide To Variational Methods - Mean-Field Approximation
Eric Jang
Technology, A.I., Careers
S u n d a y, A u g u s t 7 , 2 0 1 6 Subscribe by Email
Subscribe via RSS
A Beginner's Guide to Variational Methods: Mean-Field
Approximation Blog Archive
► 2021 (7)
►
Variational Bayeisan (VB) Methods are a family of techniques that are very popular in statistical Machine
Learning. VB methods allow us to re-write statistical inference problems (i.e. infer the value of a ► 2020 (5)
►
random variable given the value of another random variable) as optimization problems (i.e. find the ► 2019 (10)
►
parameter values that minimize some objective function). ► 2018 (11)
►
► 2017 (3)
►
This inference-optimization duality is powerful because it allows us to use the latest-and-greatest
optimization algorithms to solve statistical Machine Learning problems (and vice versa, minimize ▼ 2016 (11)
▼
functions using statistical techniques). ► November (1)
►
► September (2)
►
This post is an introductory tutorial on Variational Methods. I will derive the optimization objective for
▼ August (1)
▼
the simplest of VB methods, known as the Mean-Field Approximation. This objective, also known as
A Beginner's Guide to Variational
the Variational Lower Bound, is exactly the same one used in Variational Autoencoders (a neat paper Methods: Mean-Fi...
which I will explain in a follow-up post).
► July (3)
►
This article assumes that the reader is familiar with concepts like random variables, probability distributions, and
expectations. Here's a refresher if you forgot some stuff. Machine Learning & Statistics notation isn't standardized very well,
so it's helpful to be really precise with notation in this post:
Lowercase x ∼ P (X) denotes a value x sampled (∼) from the probability distribution P (X) via some
generative process.
Lowercase p(X) is the density function of the distribution of X . It is a scalar function over the measure space
of X .
p(X = x) (shorthand p(x) ) denotes the density function evaluated at a particular value x .
Many academic papers use the terms "variables", "distributions", "densities", and even "models" interchangeably. This is not
necessarily wrong per se, since X , , and
P (X) p(X) all imply each other via a one-to-one correspondence. However, it's
confusing to mix these words together because their types are different (it doesn't make sense to sample a function, nor
does it make sense to integrate a distribution).
We model systems as a collection of random variables, where some variables (X) are "observable", while
other variables (Z ) are "hidden". We can draw this relationship via the following graph:
https://blog.evjang.com/2016/08/variational-bayes.html 1/9
21/01/2024, 21:11 Eric Jang: A Beginner's Guide to Variational Methods: Mean-Field Approximation
The edge drawn from Z to X relates the two variables together via the conditional distribution
P (X|Z ) .
Here's a more concrete example: X might represent the "raw pixel values of an image", while Z is a
binary variable such that Z = 1 "if X is an image of a cat".
X =
P (Z = 1) = 1 (definitely a cat)
X =
X =
Bayes' Theorem gives us a general relationship between any pair of random variables:
https://blog.evjang.com/2016/08/variational-bayes.html 2/9
21/01/2024, 21:11 Eric Jang: A Beginner's Guide to Variational Methods: Mean-Field Approximation
p(X|Z )p(Z )
p(Z |X) =
p(X)
p(Z |X) is the posterior probability: "given the image, what is the probability that this is of a cat?" If
we can sample from z ∼ P (Z |X), we can use this to make a cat classifier that tells us whether a given
image is a cat or not.
p(X|Z ) is the likelihood: "given a value of Z this computes how "probable" this image X is under that
category ({"is-a-cat" / "is-not-a-cat"}). If we can sample from x ∼ P (X|Z ), then we generate images of
cats and images of non-cats just as easily as we can generate random numbers. If you'd like to learn
more about this, see my other articles on generative models: [1], [2].
p(Z ) is the prior probability. This captures any prior information we know about Z - for example, if we
think that 1/3 of all images in existence are of cats, then and .
1 2
p(Z = 1) = p(Z = 0) =
3 3
This is an aside for interested readers. Skip to the next section to continue with the tutorial.
The previous cat example presents a very conventional example of observed variables, hidden variables,
and priors. However, it's important to realize that the distinction between hidden / observed variables is
somewhat arbitrary, and you're free to factor the graphical model however you like.
p(Z |X)p(X)
= p(X|Z )
p(Z )
Hidden variables can be interpreted from a Bayesian Statistics framework as prior beliefs attached to
the observed variables. For example, if we believe X is a multivariate Gaussian, the hidden variable Z
might represent the mean and variance of the Gaussian distribution. The distribution over parameters
P (Z ) is then a prior distribution to P (X) .
You are also free to choose which values X and Z represent. For example, Z could instead be "mean,
cube root of variance, and X + Y where Y ∼ N (0, 1)". This is somewhat unnatural and weird, but the
structure is still valid, as long as P (X|Z ) is modified accordingly.
You can even "add" variables to your system. The prior itself might be dependent on other random
variables via P (Z |θ), which have prior distributions of their own P (θ), and those have priors still, and
so on. Any hyper-parameter can be thought of as a prior. In Bayesian statistics, it's priors all the way
down.
Problem Formulation
The key problem we are interested in is posterior inference, or computing functions on the hidden
https://blog.evjang.com/2016/08/variational-bayes.html 3/9
21/01/2024, 21:11 Eric Jang: A Beginner's Guide to Variational Methods: Mean-Field Approximation
variable Z . Some canonical examples of posterior inference:
Given this surveillance footage X , did the suspect show up in it?
Given this twitter feed X , is the author depressed?
Given historical stock prices X1:t−1 , what will Xt be?
We usually assume that we know how to compute functions on likelihood function P (X|Z ) and priors
P (Z ).
The problem is, for complicated tasks like above, we often don't know how to sample from P (Z |X) or
compute p(X|Z ). Alternatively, we might know the form of p(Z |X), but the corresponding
computation is so complicated that we cannot evaluate it in a reasonable amount of time. We could try
to use sampling-based approaches like MCMC, but these are slow to converge.
The idea behind variational inference is this: let's just perform inference on an easy, parametric
distribution Qϕ (Z |X) (like a Gaussian) for which we know how to do posterior inference, but adjust the
parameters ϕ so that Qϕ is as close to P as possible.
This is visually illustrated below: the blue curve is the true posterior distribution, and the green
distribution is the variational approximation (Gaussian) that we fit to the blue density via optimization.
What does it mean for distributions to be "close"? Mean-field variational Bayes (the most common type)
uses the Reverse KL Divergence to as the distance metric between two distributions.
qϕ (z|x)
K L(Q ϕ (Z |X)||P (Z |X)) = ∑ qϕ (z|x) log
p(z|x)
z∈Z
Reverse KL divergence measures the amount of information (in nats, or units of bits) required to
1
log(2)
p(x,z)
By definition of a conditional distribution, p(z|x) = . Let's substitute this expression into our
p(x)
qϕ (z|x)p(x)
K L(Q||P ) = ∑ qϕ (z|x) log (1)
p(z, x)
z∈Z
qϕ (z|x)
= ∑ qϕ (z|x)( log + log p(x))
p(z, x)
z∈Z
qϕ (z|x)
= ( ∑ qϕ (z|x) log ) + ( ∑ log p(x)qϕ (z|x))
p(z, x)
z z
qϕ (z|x)
= ( ∑ qϕ (z|x) log ) + ( log p(x) ∑ qϕ (z|x)) note: ∑ q(z) = 1
p(z, x)
z z z
qϕ (z|x)
= log p(x) + ( ∑ qϕ (z|x) log )
p(z, x)
z
https://blog.evjang.com/2016/08/variational-bayes.html 4/9
21/01/2024, 21:11 Eric Jang: A Beginner's Guide to Variational Methods: Mean-Field Approximation
qϕ (z|x) qϕ (z|x)
∑ qϕ (z|x) log = Ez∼Q (Z |X) [ log ]
ϕ
p(z, x) p(z, x)
z
qϕ (z|x)
maximize L = − ∑ qϕ (z|x) log
p(z, x)
z
p(z)
= EQ [ log p(x|z) + log ] (2)
qϕ (z|x)
In literature, L is known as the variational lower bound, and is computationally tractable if we can
evaluate p(x|z), p(z), q(z|x) . We can further re-arrange terms in a way that yields an intuitive
formula:
p(z)
L = EQ [ log p(x|z) + log ]
qϕ (z|x)
p(z)
= EQ [ log p(x|z)] + ∑ q(z|x) log Definition of expectation
qϕ (z|x)
Q
If sampling z ∼ Q(Z |X) is an "encoding" process that converts an observation x to latent code z , then
sampling x ∼ Q(X|Z ) is a "decoding" process that reconstructs the observation from z .
It follows that L is the sum of the expected "decoding" likelihood (how good our variational distribution
can decode a sample of Z back to a sample of X), plus the KL divergence between the variational
approximation and the prior on Z . If we assume Q(Z |X) is conditionally Gaussian, then prior Z is
often chosen to be a diagonal Gaussian distribution with mean 0 and standard deviation 1.
Why is L called the variational lower bound? Substituting L back into Eq. (1), we have:
The meaning of Eq. (4), in plain language, is that p(x) , the log-likelihood of a data point x under the
true distribution, is L, plus an error term K L(Q||P ) that captures the distance between
Q(Z |X = x) and P (Z |X = x) at that particular value of X.
Since K L(Q||P ) ≥ 0, log p(x) must be greater than L. Therefore L is a lower bound for log p(x) . L
is also referred to as evidence lower bound (ELBO), via the alternate formulation:
Note that L itself contains a KL divergence term between the approximate posterior and the prior, so
there are two KL terms in total in log p(x).
KL divergence is not a symmetric distance function, i.e. K L(P ||Q) ≠ K L(Q||P ) (except when
Q ≡ P ) The first is known as the "forward KL", while the latter is "reverse KL". So why do we use
Reverse KL? This is because the resulting derivation would require us to know how to compute p(Z |X),
which is what we'd like to do in the first place.
https://blog.evjang.com/2016/08/variational-bayes.html 5/9
21/01/2024, 21:11 Eric Jang: A Beginner's Guide to Variational Methods: Mean-Field Approximation
I really like Kevin Murphy's explanation in the PML textbook, which I shall attempt to re-phrase here:
Let's consider the forward-KL first. As we saw from the above derivations, we can write KL as the
p(z)
expectation of a "penalty" function log over a weighing function p(z) .
q(z)
p(z)
K L(P ||Q) = ∑ p(z) log
q(z)
z
p(z)
= Ep(z) [ log ]
q(z)
The penalty function contributes loss to the total KL wherever p(Z ) > 0 . For p(Z ) > 0 ,
p(z)
limq(Z )→0 log . This means that the forward-KL will be large wherever
→ ∞ Q(Z ) fails to "cover
q(z)
up" .
P (Z )
Therefore, the forward-KL is minimized when we ensure that q(z) > 0 wherever p(z) > 0 . The
optimized variational distribution Q(Z ) is known as "zero-avoiding" (density avoids zero when p(Z ) is
zero).
q(z)
K L(Q||P ) = ∑ q(z) log
p(z)
z
q(z)
= Ep(z) [ log ]
p(z)
If p(Z ) = 0, we must ensure that the weighting function q(Z ) = 0 wherever denominator p(Z ) = 0 ,
otherwise the KL blows up. This is known as "zero-forcing":
So in summary, minimizing forward-KL "stretches" your variational distribution Q(Z ) to cover over the
entire P (Z ) like a tarp, while minimizing reverse-KL "squeezes" the Q(Z ) under P (Z ).
https://blog.evjang.com/2016/08/variational-bayes.html 6/9
21/01/2024, 21:11 Eric Jang: A Beginner's Guide to Variational Methods: Mean-Field Approximation
It's important to keep in mind the implications of using reverse-KL when using the mean-field
approximation in machine learning problems. If we are fitting a unimodal distribution to a multi-modal
one, we'll end up with more false negatives (there is actually probability mass in P (Z ) where we think
there is none in Q(Z )).
Variational methods are really important for Deep Learning. I will elaborate more in a later post, but
here's a quick spoiler:
1. Deep learning is really good at optimization (specifically, gradient descent) over very large
parameter spaces using lots of data.
2. Variational Bayes give us a framework with which we can re-write statistical inference problems as
optimization problems.
Combining Deep learning and VB Methods allow us to perform inference on extremely complex posterior
distributions. As it turns out, modern techniques like Variational Autoencoders optimize the exact same
mean-field variational lower-bound derived in this post!
15 comments:
There should be a minus in equation (3) for E[log p(x|z)] i.e. E[ -log p(x|z)] otherwise your definition of KL-
divergence isn't consistent.
Ankur.
Reply
Replies
Do you mind explaining where that negative comes from? I was anticipating a plus...
Reply
https://blog.evjang.com/2016/08/variational-bayes.html 7/9
21/01/2024, 21:11 Eric Jang: A Beginner's Guide to Variational Methods: Mean-Field Approximation
Given the title of your post, it's worth giving some motivation behind the name "mean-field approximation".
From a statistical physics point of view, "mean-field" refers to the relaxation of a difficult optimization
problem to a simpler one which ignores second-order effects. For example, in the context of graphical
models, one can approximate the partition function of a Markov random field via maximization of the Gibbs
free energy (i.e., log partition function minus relative entropy) over the set of product measures, which is
significantly more tractable than global optimization over the space of all probability measures (see, e.g.,
M. Mezard and A. Montanari, Sect 4.4.2).
From an algorithmic point of view, "mean-field" refers to the naive mean field algorithm for computing
marginals of a Markov random field. Recall that the fixed points of the naive mean field algorithm are
optimizers of the mean-field approximation to the Gibbs variational problem. This approach is "mean" in
that it is the average/expectation/LLN version of the Gibbs sampler, hence ignoring second-order
(stochastic) effects (see, e.g., M. Wainwright and M. Jordan, (2.14) and (2.15)).
Reply
Replies
Reply
Reply
Replies
Reply
Reply
Comments will be reviewed by administrator (to filter for spam and irrelevant content).
https://blog.evjang.com/2016/08/variational-bayes.html 8/9
21/01/2024, 21:11 Eric Jang: A Beginner's Guide to Variational Methods: Mean-Field Approximation
https://blog.evjang.com/2016/08/variational-bayes.html 9/9