Professional Documents
Culture Documents
Slides AML PDF
Slides AML PDF
FALL 2017
Dates
Python tutorial 12/13 September
TensorFlow tutorial 24/25 October
Midterm exam 19/23 October
Final project due 11 December
Office Hours Mon/Tue 5:30-7:30pm, Room 1025, Dept of Statistics, 10th floor SSW
Class Homepage
https://wendazhou.com/teaching/AdvancedMLFall17/
Homework
Some homework problems and final project require coding
Coding: Python
Homework due: Tue/Wed at 4pm no late submissions
You can drop two homeworks from your final score
Grade
Homework + Midterm Exam + Final Project
20% 40% 40%
Email
All email to the TAs, please.
The instructors will not read your email unless it is forwarded by a TA.
Books (optional)
Today
Machine learning and statistics have become hard to tell apart.
Task
Balance the pendulumn upright by moving the sled left and right.
The computer can control only the motion of the sled.
Formalization
State = 4 variables (sled location, sled velocity, angle, angular velocity)
Actions = sled movements
Note well
Learning how the world works is a regression problem.
sgn(hvH , xi c) > 0
sgn(hvH , xi c) < 0
x1 x2 x3
v1 v2 v3
y2 = (vt x)
...
...
.. .. ..
. . .
y1 y2 y3
...
...
Neural networks
Representation of function using a graph
Layers: x1 v13 x2 v31 x3
v21 v23
g v33
x f v12 v22 v32
v11
Symbolizes: f (g(x))
x1 x2 x3
McCulloch-Pitts model
Collect the input signals x1 , x2 , x3 into a vector x = (x1 , x2 , x3 ) R3
Choose fixed vector v R3 and constant c R.
Compute:
y = I{hv, xi > c} for some c R .
x1 x2 x3
v1 v2 v3
1
I{ > 0}
hx,vi
kvk
f (x) = sgn(hv, xi c)
x1 x2 x3
v1 v2 v3
y = I{vt x > c}
J(a) Jp(a)
3 10
2
1 5
0
y3
y2
-2 y2
y1 y1
0
-2 solution -2 solut
0 region 2 a1 0 regio
a2 2 a2 2
4
4
Jq(a) Jr(a)
Advanced Machine Learning Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001 22 / 173
T HE P ERCEPTRON C RITERION
Jp(a)
10
y3 y3
y2
-2 y2 -2
y1
0 0
solution -2 solution
region 2 a1 0 region 2 a1
2 a2 2
4 4
4 4
Jr(a)
5
Advanced Machine Learning Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001 23 / 173
P ERCEPTRON
Train McCulloch-Pitts model (that is: estimate (c, v)) by applying gradient descent to the
function
n
X c 1
Cp (c, v) := I{sgn(hv, xi i c) 6= yi } , ,
v xi
i=1
called the Perceptron cost function.
0.8
1 0.6
(x) =
1 + ex 0.4
0.2
-10 -5 5 10
Note 1.0
0.8
1 + ex 1 1
1(x) = = x = (x) 0.6
1 + ex e +1
0.4
Derivative 0.2
d ex -10 -5 5 10
(x) = = (x) 1 (x)
dx (1 + ex )2
Sigmoid (blue) and its derivative (red)
1.0
In linear classification: Decision
0.8
boundary is a discontinuity
Boundary is represented either by 0.6
The most important use of the sigmoid function in machine learning is as a smooth
approximation to the indicator function.
Given a sigmoid and a data point x, we decide which side of the approximated boundary we
are own by thresholding
1
(x)
2
0.8
0.6
0.4
0.2
-5 0 5 10
Influence of
As increases, approximates I more closely.
For , the sigmoid converges to I pointwise, that is: For every x 6= 0, we have
(x) I{x > 0} as + .
1
Note (0) = 2
always, regardless of .
The decision boundary of a linear classifier in We can stretch into a ridge function on R2 :
R2 is a discontinuous ridge:
The function (hv, xi c) is a sigmoid ridge, where the ridge is orthogonal to the normal
vector v, and c is an offset that shifts the ridge out of the origin.
The plot on the right shows the normal vector (here: v = (1, 1)) in black.
The parameters v and c have the same meaning for I and , that is, (hv, xi c)
approximates I{hv, xi c}.
Setup
Two-class classification problem
Observations x1 , . . . , xn Rd , class labels yi {0, 1}.
P(y|x) := Bernoulli (hv, xi c) .
1
Recall (hv, xi c) takes values in [0, 1] for all , and value 2
on the class boundary.
The logistic regression model interprets this value as the probability of being in class y.
Since the model is defined by a parametric distribution, we can apply maximum likelihood.
Notation
Recall from Statistical Machine Learning: We collect the parameters in a vector w by writing
v x
w := and x := so that hw, xi = hv, xi c .
c 1
Negative log-likelihood
n
X
L(w) := yi log (hw, xi i) + (1 yi ) log 1 (hw, xi i)
i=1
n
X
(wt xi ) yi xi
L(w) =
i=1
Note
Each training data point xi contributes to the sum proportionally to the approximation
error (wt xi ) yi incurred at xi by approximating the linear classifier by a sigmoid.
Maximum likelihood
The ML estimator w for w is the solution of
L(w) = 0 .
For logistic regression, this equation has no solution in closed form.
To find w, we use numerical optimization.
The function L is convex (= -shaped).
Matrix notation
1 (x1 )1 . . . (x1 )j . . . (x1 )d
.. .. ..
(wt x1 )(1 (wt x1 )) ... 0
. . .
1 (xi )1 . . . (xi )j . . . (xi )d D = .. .. ..
xi
X :=
. . .
(wt xn )(1 (wt xn ))
.. .. .. 0 ...
. . .
1 (xn )1 . . . (xn )j . . . (xn )d
X is the data matrix (or design matrix) you know from linear regression. X has size n (d + 1)
and D is n n.
Newton step
( w(k) , x1 )
y1
w(k+1) = Xt D X
1 t (k)
X D Xw D .. .
.
. .
( w(k) , xn )
yn
=: u(k)
Differences:
The vector y of regression responses is substituted by the vector u(k) above.
Newton: Cost
The size of the Hessian is (d + 1) (d + 1).
In high-dimensional problems, inverting HL can become problematic.
Other methods
Maximum likelihood only requires that we minimize the negative log-likelihood; we can choose
any numerical method, not just Newton. Alternatives include:
Pseudo-Newton methods (only invert HL once, for w(1) , but do not guarantee quadratic
convergence).
Gradient methods.
Approximate gradient methods, like stochastic gradient.
longer v
1.0
0.6
0.2
longer v
1.0
(wt xi ) yi
0.8
0.6
0.4
0.2
-5 0 5 10
xi
Recall each training data point xi contributes an error term (wt xi ) yi to the
log-likelihood.
By increasing the lenghts of w, we can make (wt xi ) yi arbitrarily small without
moving the decision boundary.
Solutions
Overfitting can be addressed by including an additive penalty of the form L(w) + kwk.
Logistic regression
Recall two-class logistic regression is defined by P(Y|x) = Bernoulli((wt x)).
Idea: To generalize logistic regression to K classes, choose a separate weight vector wk
for each class k, and define P(Y|x) by
Multinomial (wt1 x), . . . , (wtK x)
(wt1 x)
where (wt1 x) = P t .
k (wk x)
K
Y
P(y|x) = (wtk x)I{y=k}
k=1
A graphical model represents the dependence structure within a set of random variables
as a graph.
Overview
Roughly speaking:
Each random variable is represented by vertex.
X Y
Reason
If X is discrete, L(X) is usually given by a mass function P(x).
If it is continuous, L(X) is usually given by a density p(x).
With the notation above, we do not have to distinguish between discrete and continuous
variables.
Recall
Two random variables are stochastically independent, or independent for short, if their joint
distribution factorizes:
L(X, Y) = L(X)L(Y)
For densities/mass functions:
P(x, y) = P(x)P(y) or p(x, y) = p(x)p(y)
Dependent means not independent.
Intuitively
X and Y are dependent if knowing the outcome of X provides any information about the
outcome of Y.
More precisely:
If someone draws (X, Y) simultanuously, and only discloses X = x to you, does that
Definition
Given random variables X, Y, Z, we say that X is conditionally independent of Y given Z if
L(X, Y|Z = z) = L(X|Z = z)L(Y|Z = z) .
That is equivalent to
L(X|Y = y, Z = z) = L(X|Z = z) .
Notation
X
Z Y
Intuitively
X and Y are dependent given Z = z if, although Z is known, knowing the outcome of X provides
additional information about the outcome of Y.
Definition
Let X1 , . . . , Xn be random variables. A (directed) graphical model represents a factorization
of joint distribution L(X1 , . . . , Xn ) as follows:
Factorize L(X1 , . . . , Xn ).
Lack of uniqueness
The factorization is usually not unique, since e.g.
L(X, Y) = L(X|Y)L(Y) = L(Y|X)L(X) .
That means the direction of edges is not generally determined.
Remark
If we use a graphical model to define a model or visualize a model, we decide on the
direction of the edges.
Estimating the direction of edges from data is a very difficult (and very important)
problem. This is one of the main subjects of a research field called causal inference or
causality.
X
Z Y
... Layer 1
All variables in the (k + 1)st layer are
... Layer 2 conditionally independent given the
variables in the kth layer.
X Y
X
Z Y
Important
X and Y are not independent, independence holds only conditionally on Z.
In other words: If we do not observe Z, X and Y are dependent, and we have to change the
graph:
X Y or X Y
X Y
Example
Suppose we start with two indepedent normal variables X and Y.
Z = X + Y.
If we know Z, and someone reveals the value of Y to us, we know everything about X.
Z1 Z2 Zn1 Zn
X1 X2 Xn1 Xn
Terminology: Belief network or Bayes net are alternative names for graphical models.
X Y X Y X Y
Z Z Z
directed undirected
wi+1,j+1
wi1,i
... i1 i i+1 ...
.. .. ..
. . .
A random variable i is associated with each vertex. Two random variables interact if they are
neighbors in the graph.
N = (VN , WN )
Neighborhoods
The set of all neighbors of vj in the graph, vi
(i) := { j | wij 6= 0}
is called the neighborhood of vj .
purple = (i)
In words
The Markov property says that each i is conditionally independent of the remaining variables
given its Markov blanket.
Definition
A distribution L(1 , . . . , n ) which satisfies the i
Markov property for a given graph N is called a Markov
random field.
Markov blanket of i
MRF energy
In particular, we can write a MRF density for RVs 1:n as
1
p(1 , . . . , n ) = exp(H(1 , . . . , n ))
Z
Graphical models factorize over the graph. How does that work for MRFs?
5 1 2 6
The cliques in this graph are: i) The triangles (1, 2, 3), (1, 3, 4).
ii) Each pair of vertices connected by an edge (e.g. (2, 6)).
Theorem
Let N be a neighborhood graph with vertex set VN . Suppose the random variables
{i , i VN } take values in T , and their joint distribution has probability mass function P, so
there is an energy function H such that
eH(1 ,...,n )
P(1 , . . . , n ) = P P H(1 ,...,n )
.
in i T e
where C is the set of cliques in N , and each HC is a non-negative function with |C| arguments.
Hence,
Y eHc (i ,iC)
P(1 , . . . , n ) = P P Hc (i ,iC)
CC iC i T e
.. .. ..
. . .
Definition
Suppose N = (VN , WN ) a neighborhood graph with n vertices and > 0 a constant. Then
1 X
p(1:n ) := exp wij I{i = j }
Z(, WN ) i,j
Interpretation
If wij > 0: The overall probability increases if i = j .
If wij < 0: The overall probability decreases if i = j .
If wij = 0: No interaction between i and j .
Positive weights encourage smoothness.
.. .. ..
. . .
Ising model
The simplest choice is wij = 1 if (i, j) is an edge. ... j1 j j+1 ...
1 X
p(1:n ) = exp I{i = j } ... i1 i i+1 ...
Z()
(i,j) is an edge
... k1 k k+1 ...
Example
Samples from an Ising model on a 56 56 grid graph.
Increasing
Spatial model
Suppose we model each Xi by a distribution L(X|i ), i.e. each location i has its own parameter
variable i . This model is Bayesian (the parameter is a random variable). We use an MRF as
prior distribution.
Xj Xj+1
observed
Xi Xi+1
j j+1
p( . |i )
unobserved
i i+1
Spatial smoothing
We can define the joint distribution (1 , . . . , n ) as a MRF on the grid graph.
For positive weights, the MRF will encourage the model to explain neighbors Xi and Xj by
the same parameter value. Spatial smoothing.
Problem 1: Sampling
Generate samples from the joint distribution of (1 , . . . , n ).
Problem 2: Inference
If the MRF is used as a prior, we have to compute or approximate the posterior distribution.
Solution
MRF distributions on grids are not analytically tractable. The only known exception is the
Ising model in 1 dimension.
Both sampling and inference are based on Markov chain sampling algorithms.
In general
A sampling algorithm is an algorithm that outputs samples X1 , X2 , . . . from a given
distribution P or density p.
Sampling algorithms can for example be used to approximate expectations:
n
1X
Ep [ f (X)] f (Xi )
n i=1
Posterior expectations
If we are only interested in some statistic of the posterior of the form EQn [ f ()] (e.g. the
posterior mean), we can again approximate by
m
1 X
EQn [ f ()] f (i ) .
m i=1
Example: Predictive distribution
The posterior predictive distribution is our best guess of what the next data point xn+1 looks
like, given the posterior under previous observations. In terms of densities:
Z
p(xn+1 |x1:n ) := p(xn+1 |)Qn (d|X1:n = x1:n ) .
T
This is one of the key quantities of interest in Bayesian statistics.
p(y)
Yi
x
a Xi b
Key observation
Suppose we can define a uniform distribution UA on the blue area A under the curve. If we
sample
(X1 , Y1 ), (X2 , Y2 ), . . . iid UA
and discard the vertical coordinates Yi , the Xi are distributed according to p,
X1 , X2 , . . . iid p .
p(x)
c
x
a b
kc
B B
x x
a b a b
We simply draw Yi Uniform[0, kc] instead of Yi Uniform[0, c].
Consequence
For sampling, it is sufficient if p is known only up to normalization
(only the shape of p is known).
Sampling methods usually assume that we can evaluate the target distribution p up to a constant.
That is:
1
p(x) = p(x) ,
Z
and we can compute p(x) for any given x, but we do not know Z.
We have to pause for a moment and convince ourselves that there are useful examples where
this assumption holds.
Provided that we can compute the numerator, we can sample without computing the
normalization integral Z.
We already know that we can discard the normalization constant, but can we evaluate the
non-normalized posterior qn ?
The problem with computing qn (as a function of unknowns) is that the term
Qn PK
n
i=1 k=1 . . . blows up into K individual terms.
PK
k=1 ck p(xi |k ) collapses to a single
If we evaluate qn for specific values of c, x and ,
and hence a sum over 2n terms. The general Potts model is even more difficult.
If we are not on the interval, sampling uniformly from an enclosing box is not possible (since
there is no uniform distribution on all of R or Rd ).
p(x)
p(x)
Factorization
We factorize the target distribution or density p as
distribution from which we
know how to sample
p(x) = r(x) A(x)
probability function we can evaluate
once a specific value of x is given
If we draw proposal samples Xi i.i.d. from r, the resulting sequence of accepted samples
produced by rejection sampling is again i.i.d. with distribution p. Hence:
Rejection samplers produce i.i.d. sequences of samples.
Important consequence
If samples X1 , X2 , . . . are drawn by a rejection sampler, the sample average
m
1 X
f (Xi )
m i=1
|A|
The fraction of accepted samples is the ratio |B|
of the areas under the curves p and r.
p(x)
Example figures for sampling methods tend to look like this. A high-dimensional distribution of correlated RVs will look
rather more like this.
We can easily end up in situations where we accept only one in 106 (or 1010 , or 1020 ,. . . )
proposal samples. Especially in higher dimensions, we have to expect this to be not the
exception but the rule.
The rejection problem can be fixed easily if we are only interested in approximating an
expectation Ep [ f (X)].
Importance sampling
We can sample X1 , X2 , . . . from q and approximate Ep [ f (X)] as
m
1 X p(Xi )
Ep [ f (X)] f (Xi )
m i=1 q(Xi )
Approximating Ep [ f (X)]
m m m f (X ) i p(X )
1 X p(Xi ) 1 X Zq p(Xi ) X i q(X )
i
Ep [ f (X)] f (Xi ) = f (Xi ) = Pm p(Xj )
m i=1 q(Xi ) m i=1 Zp q(Xi ) i=1 j=1 q(Xj )
Conditions
Given are a target distribution p and a proposal distribution q.
1 1
p= Zp
p and q = Zq
q.
We can evaluate p and q, and we can sample q.
The objective is to compute Ep [ f (X)] for a given function f .
Algorithm
1. Sample X1 , . . . , Xm from q.
2. Approximate Ep [ f (X)] as
Pm p(X )
i
i=1 f (Xi ) q(Xi )
Ep [ f (X)] Pm p(Xj )
j=1 q(Xj )
region of interest
Once we have drawn a sample in the narrow region of interest, we would like to continue
drawing samples within the same region. That is only possible if each sample depends on the
location of the previous sample.
Proposals in rejection sampling are i.i.d. Hence, once we have found the region where p
concentrates, we forget about it for the next sample.
Note: For a Markov chain, Xi+1 can depend on Xi , so at least in principle, it is possible for an
MCMC sampler to "remember" the previous step and remain in a high-probability location.
The Markov chains we discussed so far had a finite state space X. For MCMC, state space now
has to be the domain of p, so we often need to work with continuous state spaces.
In the discrete case, t(y = i|x = j) is the entry pij of the transition matrix p.
Continuous case
X is now uncountable (e.g. X = Rd ).
The transition matrix p is substituted by the conditional probability t.
A distribution Pinv with density pinv is invariant if
Z
t(y|x)pinv (x)dx = pinv (y)
X
P
This is simply the continuous analogue of the equation i pij (Pinv )i = (Pinv )j .
We run the Markov chain n for steps. We "forget" the order and regard the If p (red contours) is both the
Each step moves from the current locations x1:n as a random set of invariant and initial distribution, each
location xi to a new xi+1 . points. Xi is distributed as Xi p.
Note: Making precise what aperiodic means in a continuous state space is a bit more technical than in the finite case, but the
theorem still holds. We will not worry about the details here.
Implication
If we can show that Pinv p, we do not have to know how to sample from p.
Instead, we can start with any Pinit , and will get arbitrarily close to p for sufficiently large i.
The number m of steps required until Pm Pinv p is called the mixing time of the Markov
chain. (In probability theory, there is a range of definitions for what exactly Pm Pinv means.)
In MC samplers, the first m samples are also called the burn-in phase. The first m samples of
each run of the sampler are discarded:
X1 , . . . , Xm1 , Xm , Xm+1 , . . .
Burn-in; Samples from
discard. (approximately) p;
keep.
Convergence diagnostics
In practice, we do not know how large j is. There are a number of methods for assessing whether
the sampler has mixed. Such heuristics are often referred to as convergence diagnostics.
1.0
1.0
Estimating the lag
0.5
0.5
The most commen method uses the autocorrelation function:
i , xi+L )
Autocorrelation
Autocorrelation
E[xi i ] E[xj j ]
0.0
0.0
Auto(xi , xj ) :=
Auto(x
i j
0.5
0.5
We compute Auto(xi , xi+L ) empirically from the sample for
different values of L, and find the smallest L for which the
autocorrelation is close to zero.
1.0
1.0
0 5 15 25 0 5 15 25
Lag L
Lag
There are about half a dozen popular convergence crieteria; the one below is an example.
Gelman-Rubin criterion
Start several chains at random. For each chain k, sample Xik
has a marginal distribution Pki .
The distributions of Pki will differ between chains in early
stages.
Once the chains have converged, all Pi = Pinv are identical.
Criterion: Use a hypothesis test to compare Pki for different k
(e.g. compare P2i against null hypothesis P1i ). Once the test
does not reject anymore, assume that the chains are past
burn-in.
Reference: A. Gelman and D. B. Rubin: "Inference from Iterative Simulation Using Multiple Sequences", Statistical Science, Vol. 7 (1992) 457-511.
If it decreases the probability, we still accept with a probability which depends on the
difference to the current probability.
Hill-climbing interpretation
The MH sampler somewhat resembles a gradient ascent algorithm on p, which tends to
move in the direction of increasing probability p.
However:
The actual steps are chosen at random.
The sampler can move "downhill" with a certain probability.
When it reaches a local maximum, it does not get stuck there.
More generally
For complicated posteriors (recall: small regions of concentration, large low-probability regions
in between) choosing q is much more difficult. To choose q with good performance, we already
need to know something about the posterior.
There are many strategies, e.g. mixture proposals (with one component for large steps and one
for small steps).
By far the most widely used MCMC algorithm is the Gibbs sampler.
Full conditionals
Suppose L(X) is a distribution on RD , so X = (X1 , . . . , XD ). The conditional probability of the
entry Xd given all other entries,
L(Xd |X1 , . . . , Xd1 , Xd+1 , . . . , XD )
is called the full conditional distribution of Xd .
On RD , that means we are interested in a density
p(xd |x1 , . . . , xd1 , xd+1 , . . . , xD )
Gibbs sampling
The Gibbs sampler is the special case of the Metropolis-Hastings algorithm defined by
propsoal distribution for Xd = full conditional of Xd .
Gibbs sampling is only applicable if we can compute the full conditionals for each
dimension d.
If so, it provides us with a generic way to derive a proposal distribution.
Proposal distribution
Suppose p is a distribution on RD , so each sample is of the form Xi = (Xi,1 , . . . , Xi,D ). We
generate a proposal Xi+1 coordinate-by-coordinate as follows:
Xi+1,1 p( . |xi,2 , . . . , xi,D )
..
.
Xi+1,d p( . |xi+1,1 , . . . , xi+1,d1 , xi,d+1 , . . . , xi,D )
.
..
Xi+1,D p( . |xi+1,1 , . . . , xi+1,D1 )
Note: Each new Xi+1,d is immediately used in the update of the next dimension d + 1.
No rejections
It is straightforward to show that the Metropolis-Hastings acceptance probability for each
xi+1,d+1 is 1, so proposals in Gibbs sampling are always accepted.
up
Full conditionals
In a grid with 4-neighborhoods for instance, the Markov
property implies that left d right
p(d |1 , . . . , d1 , d+1 , . . . , D ) = p(d |left , right , up , down )
down
Gibbs sampler
Each step of the Gibbs sampler generates n updates according to
i+1,d p( . |i+1,1 , . . . , i+1,d1 , i,d+1 , . . . , i,D )
exp (I{i+1,d = left } + I{i+1,d = right } + I{i+1,d = up } + I{i+1,d = down })
).(...&/0"1$023%4&
MRFs where introduced as tools for image smoothing and segmentation by D. and S.
Geman in 1984.
They sampled from a Potts model with a Gibbs sampler, discarding 200 iterations as
burn-in.
Such a sample (after 200 steps) is shown above, for a Potts model in which each variable
can take one out of 5 possible values.
These patterns led computer vision researchers to conclude that MRFs are "natural" priors
for image segmentation, since samples from the MRF resemble a segmented image.
200 iterations
).(...&/0"1$023%4&
).(...&/0"1$023%4&
10000 iterations
Chain 1 Chain 5
The "segmentation" patterns are not sampled from the MRF distribution p Pinv , but
rather from P200 6= Pinv .
The patterns occur not because MRFs are "natural" priors for segmentations, but because
the samplers Markov chain has not mixed.
MRFs are smoothness priors, not segmentation priors.
Problem
We have to solve an inference problem where the correct solution is an intractable distribution
with density p (e.g. a complicated posterior in a Bayesian inference problem).
Variational approach
Approximate p as
q := arg min (q, p )
qQ
where Q is a class of simple distributions and is a cost function (small means good fit).
That turns the inference problem into a constrained optimization problem
min (q, p )
s.t. q Q
Derivatives of functionals
If F is a function space and k k a norm on F , we can apply the same idea to : F R:
( f + f ) ( f )
( f ) := lim
k f k&0 k f k
(f ) is called the Frchet derivative of at f .
f is a minimum of a Frchet-differentiable functional only if (f ) = 0.
Horseshoes
We have to represent the infinite-dimensional quantities fk and fk in some way.
Many interesting functionals are not Frchet-differentiable as functionals on F . They
only become differentiable when constrained to a much smaller subspace.
One solution is variational calculus, an analytic technique that addresses both problems. (We
will not need the details.)
s.t. q Q
where Q is a parametric family.
That means each element of Q is of the form q( |), for T Rd .
The problem then reduces back to optimization in Rd :
min (q( |))
s.t. T
We can apply gradient descent, Newton, etc.
Be careful
The differential entropy does not behave like the entropy (e.g. it can be negative).
The KL divergence for densities has properties analogous to the mass function case.
DKL (p kq) emphasizes regions where the true model p has high probability. That is
what we should use if possible.
We use VI because p is intractable, so we can usually not compute expectations under it.
We use the expectation DKL (q, p ) under the approximating simpler model instead.
We have to understand the implications of this choice.
z2 z2
0.5 0.5
0 0
0 0.5 z1 1 0 0.5 z1 1
DKL (p kq) =(b)
DKL (greenkred) DKL (qkp ) =(a)
DKL (redkgreen)
Summary: VI approximation
min F(q)
where F(q) = E[log q(Z)] E[log p(Z, x)]
s.t. qQ
Terminology
The function F is called a free energy in statistical physics.
Since there are different forms of free energies, various authors attach different adjectives
(variational free energy, Helmholtz free energy, etc).
Parts of the machine learning literature have renamed F, by maximizing the objective
function F and calling it an evidence lower bound, since
F(q) + DKL (q k p( |x)) = log p(x) hence eF(q) p(x) ,
and p(x) is the evidence in the Bayes equation.
In previous example
1
z2
0.5
0
0 0.5 spherical)
Mean field (Gaussian z1 1 Not a mean field
(a)
Advanced Machine Learning 122 / 173
E XAMPLE : M EAN F IELD FOR THE P OTTS M ODEL
Model
We consider a MRF distribution for X1 , . . . , Xn with values in {1, +1}, given by
n n
1 X X
P(X1 , . . . , Xn ) = exp(H(X1 , . . . , Xn )) where H = wij Xi Xj hi Xi
Z() i,j=1 i=1
external field
Variational approximation
We choose Q as the family
n
(
1+m
nY o
2
X=1
Q := Qmi mi [1, 1] where Qm (X) :=
1m
2
X = 1
i=1
1+m
Each factor is a Bernoulli 2
, except that the range is {1, 1} not {0, 1}.
Optimization Problem
n
Y
min DKL Qmi
P
i=1
s.t. mi [1, 1] for i = 1, . . . , n
That is: For given values of wij and hi , we have to solve for the values mi .
Interpretation
In the MRF P, the random variables Xi interact.
There is no interaction in the approximation.
Instead, the effect of interactions is approximated by encoding them in the parameters.
This is somewhat like a single effect (field) acting on all variables simultanuously
(mean field).
As a graphical model
Box notation indicates
c c and are not random
Z1 ... Zn
X1 ... Xn
Z1 ... Zn = Z
X1 ... Xn X
n
More precisely
Specifically, the term BMM implies that all priors are natural conjugate priors. That is:
The mixture components p(x|) are an exponential family model.
The prior on each k is a natural conjugate prior of p.
The prior of the vector (C1 , . . . , CK ) is a Dirichlet distribution.
Definition
A model of the form
K
X Z
(x) = Ck p(x|k ) = p(x|)M()d
k=1 T
is called a Bayesian mixture model if p(x|) is an exponential family model and M a random
mixing distribution, where:
1 , . . . , K iid q( . |, y), where q is a natural conjugate prior for p.
As a graphical model
X
n
Posterior distribution
The posterior density of a BMM under observations x1 , . . . , xn is (up to normalization):
n X
Y K K
Y
(c1:K , 1:K |x1:n ) ck p(xi |k ) q(k |, y) qDirichlet (c1:K )
i=1 k=1 k=1
This Gibbs sampler is a bit harder to derive, so we skip the derivation and only look at the algorithm.
Assignment probabilities
Each step of the Gibbs sampler computes an assignment matrix:
a11 . . . a1K
a = .. .. = Pr{x in cluster k}
. . i
ik
an1 . . . anK
Entries are computed as they are in the EM algorithm:
Ck p(xi |k )
aik = PK
l=1 Cl p(xi |l )
In contrast to EM, the values Ck and k are random.
2. For each cluster k, sample a new value for kj from the conjugate posterior (k ) under
the observations currently assigned to k:
n n
kj + I{Zij = k}, y + I{Zij = k}S(xi )
X X
i=1 i=1
j
3. Sample new cluster proportions C1:K from the Dirichlet posterior (under all xi ):
I{Zij = k}
Pn
j j gk +
C1:K Dirichlet( + n, g1:K ) where gkj = i=1
+n
normalization
The BMM Gibbs sampler looks very similar to the EM algorithm, with maximization steps (in
EM) substituted by posterior sampling steps:
Representation of assignments Parameters
EM Assignment probabilities ai,1:K aik -weighted MLE
K-means mi = arg maxk (ai,1:K ) MLE for each cluster
Gibbs for BMM mi Multinomial(ai,1:K ) Sample posterior for each cluster
Dirichlet distribution
The Dirichlet distribution is the distribution on K with density
K
1 X
qDirichlet (c1:K |, g1:K ) := exp (gk 1) log(ck )
Z(, g1:K ) k=1
Parameters:
g1:K K : Mean parameter, i.e. E[c1:K ] = g1:K .
Density plots
= 1.8 = 10
As heat maps
= 0.8 =1 = 1.8 = 10
Large density values Uniform distribution Density peaks Peak sharpens
at extreme points on K around its mean with increasing
Model
The Dirichlet is the natural conjugate prior on the multinomial parameters. If we observe hk
counts in category k, the posterior is
(c1:K |h1 , . . . , hk ) = qDirichlet (c1:K | + n, (g1 + h1 , . . . , gK + hK ))
P
where n = k hk is the total number of observations.
So: Components
P of the prior couple (1) through normalization and (2) through the latent
variable j Xj .
Topics 1 , . . . , K .
kj = Pr{ word j in topic k}.
Problem
Each document is generated by a single topic; that is a very restrictive assumption.
Parameters
Suppose we consider a corpus with K topics and a vocubulary of d words.
K topic proportions (k = Pr{ topic k}).
Observation
LDA is almost a Bayesian mixture model: Both use multinomial components and a Dirichlet
prior on the mixture weights. However, they are not identical.
Comparison
Bayesian MM Admixture (LDA)
Sample c1:K Dirichlet(). Sample c1:K Dirichlet().
Sample topic k Multinomial(c1:K ). For i = 1, . . . , M:
For i = 1, . . . , M: Sample topic ki Multinomial(c1:K ).
Sample wordi Multinomial(k ). Sample wordi Multinomial(ki ).
In admixtures:
c1:K is generated at random, once for each document.
topic proportions
C C Dirichlet()
topic of word
Z Z Multinomial(C)
observed word
word word Multinomial(row Z of )
N N # words
M M # documents
Bayesian mixture
(for M documents of N words each) LDA
C
The parameter matrix is of size
#topics |vocabulary|.
Meaning: ki = probability that term i is Z
observed under topic k.
Note entries of are non-negative and
each row sums to 1.
To learn the parameters along with the
other parameters, we add a Dirichlet prior word
with parameters . The rows are drawn
N
i.i.d. from this prior.
M
is now random
Variational approximation
C
Q = {q(z, c, |, ) | , , }
where
K M N
Y Y Y
q(z, c, |, ) := p(k |) q(cm |m ) p(zmn |mn ) Z
k=1 m=1 n=1
Dirichlet Multinomial
M
min DKL (q(z, c, |, )kp(z, c, |, ))
s.t. q Q
word C Z
N K M N
M M
Algorithmic solution
We solve the minimization problem
min DKL (q(z, c, |, )kp(z, c, |, ))
s.t. q Q
by applying gradient descent to the resulting free energy.
VI algorithm
It can be shown that gradient descent amounts to the following algorithm:
Local updates
(t+1) mn
mn := P
n mn
where
mn = exp Eq [log(Cm1 ), . . . , log(Cmd )| (t) ] + Eq [log(1,wn ), . . . , log(d,wn )|(t) ]
N
(t+1)
X
(t+1) := + n
n=1
Global updates
(t+1) (t+1)
XX
k =+ wordmn mn
m n
The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-
tan Opera Co., New York Philharmonic and Juilliard School. Our board felt that we had a
real opportunity to make a mark on the future of the performing arts with these grants an act
every bit as important as our traditional areas of support in health, medical research, education
and the social services, Hearst Foundation President Randolph A. Hearst said Monday in
announcing the grants. Lincoln Centers share will be $200,000 for its new building, which
will house young artists and provide new public facilities. The Metropolitan Opera Co. and
New York Philharmonic will receive $400,000 each. The Juilliard School, where music and
the performing arts are taught, will get $250,000. The Hearst Foundation, a leading supporter
of the Lincoln Center Consolidated Corporate Fund, will make its usual annual $100,000
donation, too.
Figure 8: An example article from the AP corpus. Each color codes a different factor from which
Advanced Machine Learning the word is putatively generated. From Blei, Ng, Jordan, "Latent Dirichlet Allocation", 2003 154 / 173
R ESTRICTED B OLTZMANN M ACHINES
B OLTZMANN MACHINE
Definition
A Markov random field distribution of variables X1 , . . . , Xn with values in {0, 1} is called a
Boltzmann machine if its joint law is
1 X X
P(x1 , . . . , xn ) = exp wij xi xj ci xi ,
Z i<jn in
where wij are the edge weights of the MRF neighborhood graph, and c1 , . . . , cn are scalar
parameters.
Remarks
The Markov blanket of Xi are those Xj with Wij 6= 0.
For x {1, 1}n instead: Potts model with external
magnetic field.
This is an exponential family with sufficient statistics E[Xi Xj ]
and E[Xi ].
As an exponential family, it is also a maximum entropy
model.
1 X X
P(x1 , . . . , xn ) = exp wij xi xj ci xi
Z i<jn in
Matrix representation
We collect the parameters in a matrix W := (wij ) and a vector c = (c1 , . . . , cn ), and write
equivalently:
1 t t
P(x1 , . . . , xn ) = ex Wx+c x
Z(W, c)
X2 Y4 X4
With observations
If some some vertices represent observation variables Yi :
t t t
e(x,y) W(x,y)+c x+c y Y1 Y2
P(x1 , . . . , xn , y1 , . . . , ym ) =
Z(W, c, c)
X1 Y3 X3
Recall our hierarchical design approach
Only permit layered structure.
Y1 Y2 Y3
Obvious grouping: One layer for X, one for Y.
As before: No connections within layers.
Since the graph is undirected, that makes it bipartite.
X1 X2 X3
Y1 Y2 Y3
Bipartite graphs
A graph is bipartite graph is a graph whose vertex set V can be
subdivided into two sets A and B = V \ A such that all edges have
one end in A and one end in B.
X1 X2 X3
Definition
A restricted Boltzmann machine (RBM) is a Boltzmann machine whose neighborhood graph
is bipartite.
That defines two layers. We usually think of one of these layers as observed, and one as
unobserved.
Consequence
1 2 N
... Input layer
...
.. .. ..
. . .
...
... Data
X1 X2 XN
1 2 N
... L(1:N ) = Prior
...
.. .. ..
. . .
L(X1:N |1:N ) = Likelihood
...
...
X1 X2 XN
1 2 N
Task: Given X1 , . . . , XN , find L(1:N |X1:N ). ...
Problem: Recall explaining away.
...
X Y
.. .. ..
Z . . .
.. .. .. .. .. ..
. . . . . .
1 2 ... N 1 2 . . .. N
...
.. .. ..
. . .
...
X1 X2 ... XN
... Idea
Invent a prior such that the dependencies
in the prior and likelihood cancel out.
.. .. ..
. . . Such a prior is called a complementary
prior.
We will see how to construct a
1 2 . . .. N complementary prior below.
...
.. .. ..
. . .
...
X1 X2 ... XN
Observation
X (1) , X (2) , . . . , X (K) is a Markov chain. .. .. ..
. . .
Advanced Machine Learning See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 168 / 173
C OMPLEMENTARY P RIOR
Observation
X (1) , X (2) , . . . , X (K) is a Markov chain. .. .. ..
. . .
Advanced Machine Learning See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 168 / 173
B UILDING A C OMPLEMENTARY P RIOR
X (1) ...
Advanced Machine Learning See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 169 / 173
B UILDING A C OMPLEMENTARY P RIOR
X (1) ...
Advanced Machine Learning See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 169 / 173
W HERE DO WE GET THE M ARKOV CHAIN ?
...
Start with an RBM X
Y ...
..
.
..
.
..
.
The Gibbs sampler for the RBM becomes the model for the directed network.
Advanced Machine Learning See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 170 / 173
W HERE DO WE GET THE M ARKOV CHAIN ?
..
.
X ...
.. Y ...
.
..
.
For N , we have
d
Y () = Y
Now suppose we use L(Y () ) as our prior distribution. Then:
Using an infinitely deep graphical model given by the Gibbs sampler for the RBM as a
prior is equivalent to using the Y-layer of the RBM as a prior.
...
First two layers: RBM (undirected)
1 2 ... N
...
. . .
.. .. ..
Remaining layers: Directed
...
X1 X2 ... XN
Summary
...
The RBM consisting of the first two layers is
equivalent to an infinitely deep directed network
representing the Gibbs samplers. (The rolled-off 1 2 ... N
Gibbs sampler on the previous slide.)
That network is infinitely deep because a draw from
the actual RBM distribution corresponds to a Gibbs ...
sampler that has reached its invariant distribution.
When we draw from the RBM, the second layer is
distributed according to the invariant distribution of
the Markov chain given by the RBM Gibbs .. .. ..
sampler. . . .
If the transition from each layer to the next in the
directed part is given by the Markov chains
transition kernel pT , the RBM is a complementary ...
prior.
We can then reverse all edges between the -layer
and the X-layer. X1 X2 ... XN