Professional Documents
Culture Documents
A Fast Learning Algorithm For Deep Belief Nets-Hinton-2006 PDF
A Fast Learning Algorithm For Deep Belief Nets-Hinton-2006 PDF
A Fast Learning Algorithm For Deep Belief Nets-Hinton-2006 PDF
is used to construct deep directed nets is itself an undirected stochastic binary variables but the ideas can be generalized to
graphical model. other models in which the log probability of a variable is an
additive function of the states of its directly-connected neigh-
Section 5 shows how the weights produced by the fast bours (see Appendix A for details).
greedy algorithm can be fine-tuned using the “up-down” al-
gorithm. This is a contrastive version of the wake-sleep al-
gorithm Hinton et al. (1995) that does not suffer from the 2 Complementary priors
“mode-averaging” problems that can cause the wake-sleep al-
gorithm to learn poor recognition weights. The phenomenon of explaining away (illustrated in figure 2)
Section 6 shows the pattern recognition performance of makes inference difficult in directed belief nets. In densely
a network with three hidden layers and about 1.7 million connected networks, the posterior distribution over the hid-
weights on the MNIST set of handwritten digits. When no den variables is intractable except in a few special cases such
knowledge of geometry is provided and there is no special as mixture models or linear models with additive Gaussian
preprocessing, the generalization performance of the network noise. Markov Chain Monte Carlo methods (Neal, 1992) can
is 1.25% errors on the 10, 000 digit official test set. This beats be used to sample from the posterior, but they are typically
the 1.5% achieved by the best back-propagation nets when very time consuming. Variational methods (Neal and Hinton,
they are not hand-crafted for this particular application. It is 1998) approximate the true posterior with a more tractable
also slightly better than the 1.4% errors reported by Decoste distribution and they can be used to improve a lower bound on
and Schoelkopf (2002) for support vector machines on the the log probability of the training data. It is comforting that
same task. learning is guaranteed to improve a variational bound even
when the inference of the hidden states is done incorrectly,
Finally, section 7 shows what happens in the mind of the but it would be much better to find a way of eliminating ex-
network when it is running without being constrained by vi- plaining away altogether, even in models whose hidden vari-
sual input. The network has a full generative model, so it is ables have highly correlated effects on the visible variables.
easy to look into its mind – we simply generate an image from It is widely assumed that this is impossible.
its high-level representations.
A logistic belief net (Neal, 1992) is composed of stochas-
Throughout the paper, we will consider nets composed of tic binary units. When the net is used to generate data, the
probability of turning on unit i is a logistic function of the us start by computing the derivative for a generative weight,
00
states of its immediate ancestors, j, and of the weights, wij , wij , from a unit j in layer H0 to unit i in layer V0 (see figure
on the directed connections from the ancestors: 3). In a logistic belief net, the maximum likelihood learning
1 rule for a single data-vector, v0 , is:
p(si = 1) = (1)
∂ log p(v0 )
P
1 + exp(−bi − j sj wij )
00 =<h0j (vi0 − v̂i0 )> (2)
∂wij
where bi is the bias of unit i. If a logistic belief net only
has one hidden layer, the prior distribution over the hidden where < ·> denotes an average over the sampled states and
variables is factorial because their binary states are chosen v̂i0 is the probability that unit i would be turned on if the visi-
independently when the model is used to generate data. The ble vector was stochastically reconstructed from the sampled
non-independence in the posterior distribution is created by hidden states. Computing the posterior distribution over the
the likelihood term coming from the data. Perhaps we could second hidden layer, V1 , from the sampled binary states in the
eliminate explaining away in the first hidden layer by using first hidden layer, H0 , is exactly the same process as recon-
extra hidden layers to create a “complementary” prior that structing the data, so vi1 is a sample from a Bernoulli random
has exactly the opposite correlations to those in the likeli- variable with probability v̂i0 . The learning rule can therefore
hood term. Then, when the likelihood term is multiplied by be written as:
the prior, we will get a posterior that is exactly factorial. It is ∂ log p(v0 )
not at all obvious that complementary priors exist, but figure 3 00 =<h0j (vi0 − vi1 )> (3)
shows a simple example of an infinite logistic belief net with ∂wij
tied weights in which the priors are complementary at every The dependence of vi1 on h0j is unproblematic in the deriva-
hidden layer (see Appendix A for a more general treatment of
the conditions under which complementary priors exist). The tion of Eq. 3 from Eq. 2 because v̂i0 is an expectation that is
use of tied weights to construct complementary priors may conditional on h0j . Since the weights are replicated, the full
seem like a mere trick for making directed models equiva- derivative for a generative weight is obtained by summing the
lent to undirected ones. As we shall see, however, it leads derivatives of the generative weights between all pairs of lay-
to a novel and very efficient learning algorithm that works ers:
by progressively untying the weights in each layer from the
weights in higher layers.
∂ log p(v0 )
= <h0j (vi0 − vi1 )>
∂wij
2.1 An infinite directed model with tied weights
+ <vi1 (h0j − h1j )>
We can generate data from the infinite directed net in fig- + <h1j (vi1 − vi2 )>
ure 3 by starting with a random configuration at an infinitely
+... (4)
deep hidden layer1 and then performing a top-down “ances-
tral” pass in which the binary state of each variable in a layer
is chosen from the Bernoulli distribution determined by the All of the vertically aligned terms cancel leaving the Boltz-
top-down input coming from its active parents in the layer mann machine learning rule of Eq. 5.
above. In this respect, it is just like any other directed acyclic
belief net. Unlike other directed nets, however, we can sam-
ple from the true posterior distribution over all of the hidden 3 Restricted Boltzmann machines and
layers by starting with a data vector on the visible units and contrastive divergence learning
then using the transposed weight matrices to infer the fac-
torial distributions over each hidden layer in turn. At each It may not be immediately obvious that the infinite directed
hidden layer we sample from the factorial posterior before net in figure 3 is equivalent to a Restricted Boltzmann Ma-
computing the factorial posterior for the layer above2 . Ap- chine (RBM). An RBM has a single layer of hidden units
pendix A shows that this procedure gives unbiased samples which are not connected to each other and have undirected,
because the complementary prior at each layer ensures that symmetrical connections to a layer of visible units. To gen-
the posterior distribution really is factorial. erate data from an RBM, we can start with a random state
in one of the layers and then perform alternating Gibbs sam-
Since we can sample from the true posterior, we can com- pling: All of the units in one layer are updated in parallel
pute the derivatives of the log probability of the data. Let given the current states of the units in the other layer and this
1 is repeated until the system is sampling from its equilibrium
The generation process converges to the stationary distribution
distribution. Notice that this is exactly the same process as
of the Markov Chain, so we need to start at a layer that is deep
compared with the time it takes for the chain to reach equilibrium. generating data from the infinite belief net with tied weights.
2
This is exactly the same as the inference procedure used in the To perform maximum likelihood learning in an RBM, we can
wake-sleep algorithm (Hinton et al., 1995) for the models described use the difference between two correlations. For each weight,
in this paper no variational approximation is required because the wij , between a visible unit i and a hidden unit, j we measure
inference procedure gives unbiased samples. the correlation < vi0 h0j > when a datavector is clamped on
t = 0 t = 1 t = infinity
2
etc. j j j j
T
W W < vi0 h0j > < vi∞ h ∞j >
2
V2 vi i i i i
W WT t = infinity
1
H1 hj Figure 4: This depicts a Markov chain that uses alternating
Gibbs sampling. In one full step of Gibbs sampling, the hid-
WT W den units in the top layer are all updated in parallel by apply-
ing Eq. 1 to the inputs received from the the current states
1 of the visible units in the bottom layer, then the visible units
V1 vi are all updated in parallel given the current hidden states. The
chain is initialized by setting the binary states of the visible
W WT units to be the same as a data-vector. The correlations in the
activities of a visible and a hidden unit are measured after the
0
H0 hj first update of the hidden units and again at the end of the
chain. The difference of these two correlations provides the
WT W learning signal for updating the weight on the connection.
0
V0 vi that come from the higher layers of the infinite net. The sum
of all these ignored derivatives is the derivative of the log
Figure 3: An infinite logistic belief net with tied weights. The probability of the posterior distribution in layer Vn , which
downward arrows represent the generative model. The up- is also the derivative of the Kullback-Leibler divergence be-
ward arrows are not part of the model. They represent the tween the posterior distribution in layer Vn , Pθn , and the equi-
parameters that are used to infer samples from the posterior librium distribution defined by the model. So contrastive di-
distribution at each hidden layer of the net when a datavector vergence learning minimizes the difference of two Kullback-
is clamped on V0 . Leibler divergences:
KL(P 0 ||Pθ∞ ) − KL(Pθn ||Pθ∞ ) (6)
the visible units and the hidden states are sampled from their
Ignoring sampling noise, this difference is never negative
conditional distribution, which is factorial. Then, using al-
because Gibbs sampling is used to produce Pθn from P 0 and
ternating Gibbs sampling, we run the Markov chain shown in
Gibbs sampling always reduces the Kullback-Leibler diver-
figure 4 until it reaches its stationary distribution and measure
gence with the equilibrium distribution. It is important to no-
the correlation <vi∞ h∞ j >. The gradient of the log probability tice that Pθn depends on the current model parameters and
of the training data is then
the way in which Pθn changes as the parameters change is
being ignored by contrastive divergence learning. This prob-
∂ log p(v0 ) lem does not arise with P 0 because the training data does not
=<vi0 h0j > − <vi∞ h∞
j > (5) depend on the parameters. An empirical investigation of the
∂wij
relationship between the maximum likelihood and the con-
This learning rule is the same as the maximum likelihood trastive divergence learning rules can be found in Carreira-
learning rule for the infinite logistic belief net with tied Perpinan and Hinton (2005).
weights, and each step of Gibbs sampling corresponds to
computing the exact posterior distribution in a layer of the Contrastive divergence learning in a restricted Boltzmann
infinite logistic belief net. machine is efficient enough to be practical (Mayraz and Hin-
ton, 2001). Variations that use real-valued units and differ-
Maximizing the log probability of the data is exactly ent sampling schemes are described in Teh et al. (2003) and
the same as minimizing the Kullback-Leibler divergence, have been quite successful for modeling the formation of to-
KL(P 0 ||Pθ∞ ), between the distribution of the data, P 0 , and pographic maps (Welling et al., 2003), for denoising natural
the equilibrium distribution defined by the model, Pθ∞ . In images (Roth and Black, 2005) or images of biological cells
contrastive divergence learning (Hinton, 2002), we only run (Ning et al., 2005). Marks and Movellan (2001) describe a
the Markov chain for n full steps3 before measuring the sec- way of using contrastive divergence to perform factor analy-
ond correlation. This is equivalent to ignoring the derivatives sis and Welling et al. (2005) show that a network with logistic,
3 binary visible units and linear, Gaussian hidden units can be
Each full step consists of updating h given v then updating v
given h. used for rapid document retrieval. However, it appears that
the efficiency has been bought at a high price: When applied
in the obvious way, contrastive divergence learning fails for
deep, multilayer networks with different weights at each layer
because these networks take far too long even to reach condi-
tional equilibrium with a clamped data-vector. We now show
that the equivalence between RBM’s and infinite directed nets
with tied weights suggests an efficient learning algorithm for
multilayer networks in which the weights are not tied.
Table 1: The error rates of various learning algorithms on the MNIST digit recognition task. See text for details.
10 Appendix
A Complementary Priors
General complementarity
Consider a joint distribution over observables, x, and hidden variables, y. For a given likelihood function, P (x|y), we define
the corresponding family of complementary priors to be those distributions, P (y), for which the joint distribution, P (x, y) =
P (x|y)P (y), leads to posteriors, P (y|x), that exactly factorise, i.e. leads to a posterior that can be expressed as P (y|x) =
Q
j P (yj |x).
Not all functional forms of likelihood admit a complementary prior. In this appendix we will show that the following family
constitutes all likelihood functions admitting a complementary prior:
1 X
P (x|y) = exp Φj (x, yj ) + β(x)
Ω(y) j
X
= exp Φj (x, yj ) + β(x) − log Ω(y) (11)
j
where Ω is the normalisation term. For this assertion to hold we need to assume positivity of distributions: that both P (y) > 0
and P (x|y) > 0 for every value of y and x. The corresponding family of complementary priors then assume the form:
1 X
P (y) = exp log Ω(y) + αj (yj ) (12)
C j
where C is a constant to ensure normalisation. This combination of functional forms leads to the following expression for the
joint:
1 X X
P (x, y) = exp Φj (x, yj ) + β(x) + αj (yj ) (13)
C j j
To prove our assertion, we need to show that every likelihood function of the form in Eq. 11 admits a complementary
prior, and also that complementarity implies the functional form in Eq. 11. Firstly, it can be directly verified that Eq. 12 is a
complementary prior for the likelihood functions of Eq. 11. To show the converse, let us assume that P (y) is a complementary
prior for some likelihood function P (x|y). Notice that the factorial form of the posterior simply means that the joint distribution
P (x, y) = P (y)P (x|y) satisfies the following set of conditional independencies: yj ⊥⊥ yk | x for every j 6= k. This set of
conditional independencies is exactly those satisfied by an undirected graphical model with an edge between every hidden
variable and observed variable and among all observed variables (Pearl, 1988). By the Hammersley-Clifford Theorem, and
using our positivity assumption, the joint distribution must be of the form of Eq. 13, and the forms for the likelihood function
Eq. 11 and prior Eq. 12 follow from this.
This condition is useful for our construction of the infinite stack of directed graphical models.
Identifying the conditional independencies in Eq. 14 and 15 as those satisfied by a complete bipartite undirected graph-
ical model, and again using the Hammersley-Clifford Theorem (assuming positivity), we see that the following form fully
characterises all joint distributions of interest,
1 X X X
P (x, y) = exp Ψi,j (xi , yj ) + γi (xi ) + αj (yj ) (16)
Z i,j i j
Although it is not immediately obvious, the marginal distribution over the observables, x, in Eq. 16 can also be expressed
as an infinite directed model in which the parameters defining the conditional distributions between layers are tied together.
An intuitive way of validating of this assertion is as follows. Consider one of the methods by which we might draw samples
from the marginal distribution P (x) implied by Eq. 16. Starting from an arbitrary configuration of y we would iteratively
perform Gibbs sampling using, in alternation, the distributions given in Eq. 14 and 15. If we run this Markov chain for long
enough then, since our positivity assumptions ensure that the chain mixes properly, we will eventually obtain unbiased samples
from the joint distribution given in Eq. 16.
Now let us imagine that we unroll this sequence of Gibbs updates in space — such that we consider each parallel update of
the variables to constitute states of a separate layer in a graph. This unrolled sequence of states has a purely directed structure
(with conditional distributions taking the form of Eq. 14 and 15 in alternation). By equivalence to the Gibbs sampling scheme,
after many layers in such an unrolled graph, adjacent pairs of layers will have a joint distribution as given in Eq. 16.
We can formalize this intuition for unrolling the graph as follows. The basic idea is to construct a joint distribution by
unrolling a graph “upwards” (i.e. moving away from the data-layer to successively deeper hidden layers), so that we can put a
well-defined distribution over an infinite stack of variables. Then we verify some simple marginal and conditional properties
of this joint distribution, and show that our construction is the same as that obtained by unrolling the graph downwards from a
very deep layer.
Let x = x(0) , y = y(0) , x(1) , y(1) , x(2) , y(2) , . . . be a sequence (stack) of variables the first two of which are identified as
our original observed and hidden variables, while x(i) and y(i) are interpreted as a sequence of successively deeper layers.
First, define the functions
1 X X X
f (x0 , y0 ) = exp Ψi,j (x0i , yi0 ) + γi (x0i ) + αj (yj0 ) (18)
Z i,j i j
X
fx (x0 ) = f (x0 , y0 ) (19)
y0
X
fy (y0 ) = f (x0 , y0 ) (20)
x0
gx (x0 |y0 ) = f (x0 , y0 )/fy (y0 ) (21)
0 0 0 0 0
gy (y |x ) = f (x , y )/fx (x ) (22)
over dummy variables y0 , x0 . Now define a joint distribution over our sequence of variables (assuming first-order Markovian
dependency) as follows:
and similarly for P (y(i) ). Now we see that the following “downward” conditional distributions also hold true:
P (x(i) |y(i) ) = P (x(i) , y(i) )/P (y(i) ) = gx (x(i) |y(i) ) (29)
(i) (i+1) (i) (i+1) (i+1) (i) (i+1)
P (y |x ) = P (y , x )/P (x ) = gy (y |x ) (30)
So our joint distribution over the stack of variables also gives the unrolled graph in the “downward” direction, since the con-
ditional distributions in Eq. 29 and 30 are the same as those used to generate a sample in a downwards pass and the Markov
chain mixes.
Inference in this infinite stack of directed graphs is equivalent to inference in the joint distribution over the sequence of
variables. In other words, given x(0) we can simply use the definition of the joint distribution Eqs. 23, 24 and 25 to obtain
a sample from the posterior simply by sampling y (0) |x(0) , x(1) |y(0) , y(1) |x(1) , . . .. This directly shows that our inference
procedure is exact for the unrolled graph.
\% UP-DOWN ALGORITHM
\%
\% the data and all biases are row vectors.
\% the generative model is: lab <--> top <--> pen --> hid --> vis
\% the number of units in layer foo is numfoo
\% weight matrices have names fromlayer_tolayer
\% "rec" is for recognition biases and "gen" is for generative biases.
\% for simplicity, the same learning rate, r, is used everywhere.
\% PREDICTIONS
psleeppenstates = logistic(sleephidstates*hidpen + penrecbiases);
psleephidstates = logistic(sleepvisprobs*vishid + hidrecbiases);
pvisprobs = logistic(wakehidstates*hidvis + visgenbiases);
phidprobs = logistic(wakepenstates*penhid + hidgenbiases);