Eric Jang - A Beginner's Guide To Variational Methods - Mean-Field Approximation

21/01/2024, 21:11 Eric Jang: A Beginner's Guide to Variational Methods: Mean-Field Approximation
Eric Jang
Technology, A.I., Careers
S u n d a y, A u g u s t 7 , 2 0 1 6 Subscribe by Email
Subscribe via RSS
A Beginner's Guide to Variational Methods: Mean-Field
Approximation Blog Archive
► 2021 (7)
►
Variational Bayeisan (VB) Methods are a family of techniques that are very popular in statistical Machine
Learning. VB methods allow us to re-write statistical inference problems (i.e. infer the value of a ► 2020 (5)
►
random variable given the value of another random variable) as optimization problems (i.e. find the ► 2019 (10)
►
parameter values that minimize some objective function). ► 2018 (11)
►
► 2017 (3)
►
This inference-optimization duality is powerful because it allows us to use the latest-and-greatest
optimization algorithms to solve statistical Machine Learning problems (and vice versa, minimize ▼ 2016 (11)
▼
functions using statistical techniques). ► November (1)
►
► September (2)
►
This post is an introductory tutorial on Variational Methods. I will derive the optimization objective for
▼ August (1)
▼
the simplest of VB methods, known as the Mean-Field Approximation. This objective, also known as
A Beginner's Guide to Variational
the Variational Lower Bound, is exactly the same one used in Variational Autoencoders (a neat paper Methods: Mean-Fi...
which I will explain in a follow-up post).
► July (3)
►
Table of Contents ► June (4)

►
1. Preliminaries and Notation

2. Problem formulation
3. Variational Lower Bound for Mean-field Approximation
4. Forward KL vs. Reverse KL
5. Connections to Deep Learning
Preliminaries and Notation
This article assumes that the reader is familiar with concepts like random variables, probability distributions, and
expectations. Here's a refresher if you forgot some stuff. Machine Learning & Statistics notation isn't standardized very well,
so it's helpful to be really precise with notation in this post:
Uppercase X denotes a random variable

Uppercase P (X) denotes the probability distribution over that variable
Lowercase x ∼ P (X) denotes a value x sampled (∼) from the probability distribution P (X) via some
generative process.
Lowercase p(X) is the density function of the distribution of X . It is a scalar function over the measure space
of X .
p(X = x) (shorthand p(x) ) denotes the density function evaluated at a particular value x .
Many academic papers use the terms "variables", "distributions", "densities", and even "models" interchangeably. This is not
necessarily wrong per se, since X , , and
P (X) p(X) all imply each other via a one-to-one correspondence. However, it's
confusing to mix these words together because their types are different (it doesn't make sense to sample a function, nor
does it make sense to integrate a distribution).
We model systems as a collection of random variables, where some variables (X) are "observable", while
other variables (Z ) are "hidden". We can draw this relationship via the following graph:
https://blog.evjang.com/2016/08/variational-bayes.html 1/9
The edge drawn from Z to X relates the two variables together via the conditional distribution
P (X|Z ) .
Here's a more concrete example: X might represent the "raw pixel values of an image", while Z is a
binary variable such that Z = 1 "if X is an image of a cat".
X =
P (Z = 1) = 1 (definitely a cat)
X =
P (Z = 1) = 0 (definitely not a cat)
X =
P (Z = 1) = 0.1 (sort of cat-like)
Bayes' Theorem gives us a general relationship between any pair of random variables:
p(X|Z )p(Z )
p(Z |X) =
p(X)
The various pieces of this are associated with common names:
p(Z |X) is the posterior probability: "given the image, what is the probability that this is of a cat?" If
we can sample from z ∼ P (Z |X), we can use this to make a cat classifier that tells us whether a given
image is a cat or not.
p(X|Z ) is the likelihood: "given a value of Z this computes how "probable" this image X is under that
category ({"is-a-cat" / "is-not-a-cat"}). If we can sample from x ∼ P (X|Z ), then we generate images of
cats and images of non-cats just as easily as we can generate random numbers. If you'd like to learn
more about this, see my other articles on generative models: [1], [2].
p(Z ) is the prior probability. This captures any prior information we know about Z - for example, if we
think that 1/3 of all images in existence are of cats, then and .
1 2
p(Z = 1) = p(Z = 0) =
3 3
Hidden Variables as Priors
This is an aside for interested readers. Skip to the next section to continue with the tutorial.
The previous cat example presents a very conventional example of observed variables, hidden variables,
and priors. However, it's important to realize that the distinction between hidden / observed variables is
somewhat arbitrary, and you're free to factor the graphical model however you like.
We can re-write Bayes' Theorem by swapping the terms:
p(Z |X)p(X)
= p(X|Z )
p(Z )
The "posterior" in question is now P (X|Z ) .
Hidden variables can be interpreted from a Bayesian Statistics framework as prior beliefs attached to
the observed variables. For example, if we believe X is a multivariate Gaussian, the hidden variable Z
might represent the mean and variance of the Gaussian distribution. The distribution over parameters
P (Z ) is then a prior distribution to P (X) .
You are also free to choose which values X and Z represent. For example, Z could instead be "mean,
cube root of variance, and X + Y where Y ∼ N (0, 1)". This is somewhat unnatural and weird, but the
structure is still valid, as long as P (X|Z ) is modified accordingly.
You can even "add" variables to your system. The prior itself might be dependent on other random
variables via P (Z |θ), which have prior distributions of their own P (θ), and those have priors still, and
so on. Any hyper-parameter can be thought of as a prior. In Bayesian statistics, it's priors all the way
down.
Problem Formulation
The key problem we are interested in is posterior inference, or computing functions on the hidden
variable Z . Some canonical examples of posterior inference:
Given this surveillance footage X , did the suspect show up in it?
Given this twitter feed X , is the author depressed?
Given historical stock prices X1:t−1 , what will Xt be?
We usually assume that we know how to compute functions on likelihood function P (X|Z ) and priors
P (Z ).
The problem is, for complicated tasks like above, we often don't know how to sample from P (Z |X) or
compute p(X|Z ). Alternatively, we might know the form of p(Z |X), but the corresponding
computation is so complicated that we cannot evaluate it in a reasonable amount of time. We could try
to use sampling-based approaches like MCMC, but these are slow to converge.
Variational Lower Bound for Mean-field Approximation
The idea behind variational inference is this: let's just perform inference on an easy, parametric
distribution Qϕ (Z |X) (like a Gaussian) for which we know how to do posterior inference, but adjust the
parameters ϕ so that Qϕ is as close to P as possible.
This is visually illustrated below: the blue curve is the true posterior distribution, and the green
distribution is the variational approximation (Gaussian) that we fit to the blue density via optimization.
What does it mean for distributions to be "close"? Mean-field variational Bayes (the most common type)
uses the Reverse KL Divergence to as the distance metric between two distributions.
qϕ (z|x)
K L(Q ϕ (Z |X)||P (Z |X)) = ∑ qϕ (z|x) log
p(z|x)
z∈Z
Reverse KL divergence measures the amount of information (in nats, or units of bits) required to
1
log(2)
"distort" P (Z ) into Q ϕ (Z ) . We wish to minimize this quantity with respect to ϕ .
p(x,z)
By definition of a conditional distribution, p(z|x) = . Let's substitute this expression into our
p(x)
original KL expression, and then distribute:
qϕ (z|x)p(x)
K L(Q||P ) = ∑ qϕ (z|x) log (1)
p(z, x)
z∈Z
qϕ (z|x)
= ∑ qϕ (z|x)( log + log p(x))
p(z, x)
z∈Z
qϕ (z|x)
= ( ∑ qϕ (z|x) log ) + ( ∑ log p(x)qϕ (z|x))
p(z, x)
z z
qϕ (z|x)
= ( ∑ qϕ (z|x) log ) + ( log p(x) ∑ qϕ (z|x)) note: ∑ q(z) = 1
p(z, x)
z z z
qϕ (z|x)
= log p(x) + ( ∑ qϕ (z|x) log )
p(z, x)
z
To minimize K L(Q||P ) with respect to variational parameters ϕ , we just have to minimize

qϕ (z|x)
∑
z
qϕ (z|x) log , since log p(x) is fixed with respect to ϕ . Let's re-write this quantity as an
p(z,x)
expectation over the distribution Q ϕ (Z |X) .
qϕ (z|x) qϕ (z|x)
∑ qϕ (z|x) log = Ez∼Q (Z |X) [ log ]
ϕ
p(z, x) p(z, x)
z
= EQ [ log qϕ (z|x) − log p(x, z)]
= EQ [ log qϕ (z|x) − (log p(x|z) + log(p(z)))] (via log p(x, z) = p(x|z)p(z))
= EQ [ log qϕ (z|x) − log p(x|z) − log(p(z)))]
Minimizing this is equivalent to maximizing the negation of this function:
qϕ (z|x)
maximize L = − ∑ qϕ (z|x) log
p(z, x)
z
= EQ [ − log qϕ (z|x) + log p(x|z) + log(p(z)))]
p(z)
= EQ [ log p(x|z) + log ] (2)
qϕ (z|x)
In literature, L is known as the variational lower bound, and is computationally tractable if we can
evaluate p(x|z), p(z), q(z|x) . We can further re-arrange terms in a way that yields an intuitive
formula:
p(z)
L = EQ [ log p(x|z) + log ]
qϕ (z|x)
p(z)
= EQ [ log p(x|z)] + ∑ q(z|x) log Definition of expectation
qϕ (z|x)
Q
= EQ [ log p(x|z)] − K L(Q(Z |X)||P (Z )) Definition of KL divergence (3)
If sampling z ∼ Q(Z |X) is an "encoding" process that converts an observation x to latent code z , then
sampling x ∼ Q(X|Z ) is a "decoding" process that reconstructs the observation from z .
It follows that L is the sum of the expected "decoding" likelihood (how good our variational distribution
can decode a sample of Z back to a sample of X), plus the KL divergence between the variational
approximation and the prior on Z . If we assume Q(Z |X) is conditionally Gaussian, then prior Z is
often chosen to be a diagonal Gaussian distribution with mean 0 and standard deviation 1.
Why is L called the variational lower bound? Substituting L back into Eq. (1), we have:
K L(Q||P ) = log p(x) − L
log p(x) = L + K L(Q||P ) (4)
The meaning of Eq. (4), in plain language, is that p(x) , the log-likelihood of a data point x under the
true distribution, is L, plus an error term K L(Q||P ) that captures the distance between
Q(Z |X = x) and P (Z |X = x) at that particular value of X.
Since K L(Q||P ) ≥ 0, log p(x) must be greater than L. Therefore L is a lower bound for log p(x) . L
is also referred to as evidence lower bound (ELBO), via the alternate formulation:
L = log p(x) − K L(Q(Z |X)||P (Z |X)) = EQ [ log p(x|z)] − K L(Q(Z |X)||P (Z ))
Note that L itself contains a KL divergence term between the approximate posterior and the prior, so
there are two KL terms in total in log p(x).
Forward KL vs. Reverse KL
KL divergence is not a symmetric distance function, i.e. K L(P ||Q) ≠ K L(Q||P ) (except when
Q ≡ P ) The first is known as the "forward KL", while the latter is "reverse KL". So why do we use
Reverse KL? This is because the resulting derivation would require us to know how to compute p(Z |X),
which is what we'd like to do in the first place.
I really like Kevin Murphy's explanation in the PML textbook, which I shall attempt to re-phrase here:
Let's consider the forward-KL first. As we saw from the above derivations, we can write KL as the
p(z)
expectation of a "penalty" function log over a weighing function p(z) .
q(z)
p(z)
K L(P ||Q) = ∑ p(z) log
q(z)
z
p(z)
= Ep(z) [ log ]
q(z)
The penalty function contributes loss to the total KL wherever p(Z ) > 0 . For p(Z ) > 0 ,
p(z)
limq(Z )→0 log . This means that the forward-KL will be large wherever
→ ∞ Q(Z ) fails to "cover
q(z)
up" .
P (Z )
Therefore, the forward-KL is minimized when we ensure that q(z) > 0 wherever p(z) > 0 . The
optimized variational distribution Q(Z ) is known as "zero-avoiding" (density avoids zero when p(Z ) is
zero).
Minimizing the Reverse-KL has exactly the opposite behavior:
q(z)
K L(Q||P ) = ∑ q(z) log
p(z)
z
q(z)
= Ep(z) [ log ]
p(z)
If p(Z ) = 0, we must ensure that the weighting function q(Z ) = 0 wherever denominator p(Z ) = 0 ,
otherwise the KL blows up. This is known as "zero-forcing":
So in summary, minimizing forward-KL "stretches" your variational distribution Q(Z ) to cover over the
entire P (Z ) like a tarp, while minimizing reverse-KL "squeezes" the Q(Z ) under P (Z ).
It's important to keep in mind the implications of using reverse-KL when using the mean-field
approximation in machine learning problems. If we are fitting a unimodal distribution to a multi-modal
one, we'll end up with more false negatives (there is actually probability mass in P (Z ) where we think
there is none in Q(Z )).
Connections to Deep Learning
Variational methods are really important for Deep Learning. I will elaborate more in a later post, but
here's a quick spoiler:
1. Deep learning is really good at optimization (specifically, gradient descent) over very large
parameter spaces using lots of data.
2. Variational Bayes give us a framework with which we can re-write statistical inference problems as
optimization problems.
Combining Deep learning and VB Methods allow us to perform inference on extremely complex posterior
distributions. As it turns out, modern techniques like Variational Autoencoders optimize the exact same
mean-field variational lower-bound derived in this post!
Thanks for reading, and stay tuned!
Posted by Eric at 11:50 PM
Labels: AI, Statistics
15 comments:
Incognito August 8, 2016 at 11:34 AM
There should be a minus in equation (3) for E[log p(x|z)] i.e. E[ -log p(x|z)] otherwise your definition of KL-
divergence isn't consistent.
Ankur.
Reply
Replies
Eric August 8, 2016 at 2:15 PM

Thanks for your sharp eyes! I added the minus in front of the KL term.
angusturner27 June 4, 2017 at 3:42 AM
Do you mind explaining where that negative comes from? I was anticipating a plus...
Reply
Unknown August 8, 2016 at 11:58 PM

Thanks for the great post, Eric! Do you plan (or have a link to) to write a simple tutorial to illustrate the VB
in practice?
Reply
Unknown August 11, 2016 at 5:04 AM

This comment has been removed by a blog administrator.
Reply
Troy Flores August 15, 2016 at 12:07 AM

Reply
skim October 18, 2016 at 1:01 PM
Given the title of your post, it's worth giving some motivation behind the name "mean-field approximation".
From a statistical physics point of view, "mean-field" refers to the relaxation of a difficult optimization
problem to a simpler one which ignores second-order effects. For example, in the context of graphical
models, one can approximate the partition function of a Markov random field via maximization of the Gibbs
free energy (i.e., log partition function minus relative entropy) over the set of product measures, which is
significantly more tractable than global optimization over the space of all probability measures (see, e.g.,
M. Mezard and A. Montanari, Sect 4.4.2).
From an algorithmic point of view, "mean-field" refers to the naive mean field algorithm for computing
marginals of a Markov random field. Recall that the fixed points of the naive mean field algorithm are
optimizers of the mean-field approximation to the Gibbs variational problem. This approach is "mean" in
that it is the average/expectation/LLN version of the Gibbs sampler, hence ignoring second-order
(stochastic) effects (see, e.g., M. Wainwright and M. Jordan, (2.14) and (2.15)).
Reply
Replies
Eric November 6, 2016 at 10:47 PM

I didn't know that! Thank you for sharing this. I hope that interested readers will scroll down and
find your comment.
Reply
Unknown February 24, 2017 at 2:42 PM

Reply
sutony April 25, 2017 at 10:09 PM

I read a few blogs/articles/slides about variational autoencoders, and I personally think this is the best one.
The key ideas are pointed out clearly. The technical terms(e.g., ELBO) are well explained, too. Thanks so
much.
Reply
Unknown May 3, 2017 at 4:04 PM

Hi, can you explain me the relation of the sum over q(z) equal to 1 in equation (1)?. Thanks, I don't catch it.
Reply
Replies
SunFish7 May 7, 2017 at 2:52 AM

Probabilities sum to 1. i.e. Given a probability distribution q over Z, summing q(z) over all
possible z in Z must give 1.
Reply
SunFish7 May 7, 2017 at 2:53 AM

Thanks for this, it is a key resource for our reading group discussion on VAE today https://github.com/p-
i-/machinelearning-IRC-freenode/blob/master/ReadingGroup/README.md
Reply
mathnathan May 12, 2017 at 1:01 PM

I believe the last formula for reverse KL should be an expectation over q, not over p. Great post. Thanks for
your effort.
Reply
Unknown December 16, 2017 at 10:49 AM

I am studying variational Bayes on my own, and this was very helpful. Thank you for writing it
Reply
To leave a comment, click the button below to sign in with Google.
SIGN IN WITH GOOGLE
Comments will be reviewed by administrator (to filter for spam and irrelevant content).
Newer Post Home Older Post
Subscribe to: Post Comments (Atom)
Not for reproduction. Simple theme. Powered by Blogger.

Eric Jang - A Beginner's Guide To Variational Methods - Mean-Field Approximation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Eric Jang - A Beginner's Guide To Variational Methods - Mean-Field Approximation

Uploaded by

Copyright:

Available Formats

21/01/2024, 21:11 Eric Jang: A Beginner's Guide to Variational Methods: Mean-Field Approximation

Table of Contents ► June (4)

1. Preliminaries and Notation

Preliminaries and Notation

Uppercase X denotes a random variable

P (Z = 1) = 0 (definitely not a cat)

P (Z = 1) = 0.1 (sort of cat-like)

The various pieces of this are associated with common names:

Hidden Variables as Priors

We can re-write Bayes' Theorem by swapping the terms:

The "posterior" in question is now P (X|Z ) .

Variational Lower Bound for Mean-field Approximation

"distort" P (Z ) into Q ϕ (Z ) . We wish to minimize this quantity with respect to ϕ .

original KL expression, and then distribute:

To minimize K L(Q||P ) with respect to variational parameters ϕ , we just have to minimize

expectation over the distribution Q ϕ (Z |X) .

= EQ [ log qϕ (z|x) − log p(x, z)]

= EQ [ log qϕ (z|x) − (log p(x|z) + log(p(z)))] (via log p(x, z) = p(x|z)p(z))

= EQ [ log qϕ (z|x) − log p(x|z) − log(p(z)))]

Minimizing this is equivalent to maximizing the negation of this function:

= EQ [ − log qϕ (z|x) + log p(x|z) + log(p(z)))]

= EQ [ log p(x|z)] − K L(Q(Z |X)||P (Z )) Definition of KL divergence (3)

K L(Q||P ) = log p(x) − L

log p(x) = L + K L(Q||P ) (4)

L = log p(x) − K L(Q(Z |X)||P (Z |X)) = EQ [ log p(x|z)] − K L(Q(Z |X)||P (Z ))

Forward KL vs. Reverse KL

Minimizing the Reverse-KL has exactly the opposite behavior:

Connections to Deep Learning

Thanks for reading, and stay tuned!

Posted by Eric at 11:50 PM

Labels: AI, Statistics

Incognito August 8, 2016 at 11:34 AM

Eric August 8, 2016 at 2:15 PM

angusturner27 June 4, 2017 at 3:42 AM

Unknown August 8, 2016 at 11:58 PM

Unknown August 11, 2016 at 5:04 AM

Troy Flores August 15, 2016 at 12:07 AM

skim October 18, 2016 at 1:01 PM

Eric November 6, 2016 at 10:47 PM

Unknown February 24, 2017 at 2:42 PM

sutony April 25, 2017 at 10:09 PM

Unknown May 3, 2017 at 4:04 PM

SunFish7 May 7, 2017 at 2:52 AM

SunFish7 May 7, 2017 at 2:53 AM

mathnathan May 12, 2017 at 1:01 PM

Unknown December 16, 2017 at 10:49 AM

To leave a comment, click the button below to sign in with Google.

SIGN IN WITH GOOGLE

Newer Post Home Older Post

Subscribe to: Post Comments (Atom)

Not for reproduction. Simple theme. Powered by Blogger.

You might also like