Slides Lecture6

Simulated Annealing
input : (x1 , t1 ), . . . , (xN , tN ) ∈

Rd × {−1, +1}; Tstart , Tstop ∈ R
output: w
begin
Randomly initialize w
T ← Tstart
repeat
b ← N (w) //neighbors of w, e.g. by adding
w
Gaussion noise (N (0, σ))
b < E(w) then w ← w
if E(w) b

b
else if exp − E(w)−E(w)
T > rand[0, 1) then
w←w b
decrease (T )
until T < Tstop
return w
end
– p. 156
Continuous Hopfield Network
Let us consider our previously defined Hopfield network
(identical architecture and learning rule), however with
following activity rule
 
1 X
Si = tanh  · wij Sj 
T
j
Start with a large (temperature) value of T and decrease it by

some magnitude whenever a unit is updated (deterministic
simulated annealing).
This type of Hopfield network can approximate the probability

distribution

1 1 1 T
P (x|W) = exp[−E(x)] = exp x Wx
Z(W) Z(W) 2 – p. 157
Continuous Hopfield Network
X
Z(W) = exp(−E(x′ )) (sum over all possible states)
x′
is the partition function and ensures P (x|W) is a probability

distribution.
Idea: construct a stochastic Hopfield network that
implements the probability distribution P (x|W).
• Learn a model that is capable of generating patterns
from that unknown distribution.
• Quantify (classify) by means of probabilities seen and
unseen patterns.
• If needed, we can generate more patterns (generative
model).
– p. 158
Boltzmann Machines
Given patterns {x(n) }N
1 , we want to learn the weights such
that the generative model

1 1 T
P (x|W) = exp x Wx
Z(W) 2
is well matched to those patterns. The states are updated

according to the stochastic rule:
• 1P
set xn = +1 with probability 1+exp (−2 j wij xj )
• else set xn = −1.
Posterior probability of the weights given the data (Bayes’
theorem)
hQ i
N (n) |W) P (W)
n=1 P (x
(n) N
P (W|{x }1 ) =
P ({x(n) }N1 ) – p. 159
Boltzmann Machines
Apply maximum likelihood method on the first term in
numerator:
"N # N
Y X 1 (n)T
(n)
ln P (x |W) = x Wx(n) − ln Z(W)
2
n=1 n=1
Taking derivative of the log likelihood gives: note that W is

∂ 1 (n)T (n) (n)
symmetric (wij = wji ) that is ∂wij 2 x Wx(n) = xi xj
and X
∂ 1 ∂ 1 (n)T
ln Z(W) = exp x Wx(n)
∂wij Z(w) x ∂wij 2
X
1 1 (n)T
= exp x Wx(n) xi xj
Z(W) x 2
X
= xi xj P (x|W) = hxi xj iP (x|W)
x – p. 160
Boltzmann Machines (cont.)
N h
X i
∂ (n) (n)
ln P ({x(n) }N
1 |W) = x i xj − hxi xj iP (x|W)
∂wij
n=1

= N hxi xj iData − hxi xj iP (x|W)
Empirical correlation between xi and xj

N h
X i
1 (n) (n)
hxi xj iData ≡ xi xj
N
n=1
Correlation between xi and xj under the current model

X
hxi xj iP (x|W) ≡ xi xj P (x|W)
x
– p. 161
Interpretation of Boltzmann Machines Learning
Illustrative description (MacKay’s book, pp. 523):
• Awake state: measure correlation between xi and xj in

the real world, and increase the weights in proportion to
the measured correlations.
• Sleep state: dream about the world using the
generative model P (x|W) and measure the correlation
between xi and xj in the model world. Use these
correlations to determine a proportional decrease in the
weights.
If correlations in dream world and real world are matching,
then the two terms balanced and weights do not change.
– p. 162
Boltzmann Machines with Hidden Units
To model higher order correlations hidden units are required.
• x: states of visible units,
• h: states of hidden units,
• generic state of a unit (either visible or hidden) by yi ,
with y ≡ (x, h),
• state of network when visible units are clamped in state
x(n) is y(n) ≡ (x(n) , h).
Probability of W given a single pattern x(n) is
X X
(n) (n) 1 1 (n)T
P (x |W) = P (x , h|W) = exp y Wy(n)
Z(W) 2
h h
X
where 1 T
Z(W) = exp y Wy
2
x,h
– p. 163
Boltzmann Machines with Hidden Units (cont.)
Applying the maximum likelihood method as before one
obtains
 
∂ X 
(n) N hyi yj iP (h|x(n) ,W) − hyi yj iP (h|x,W) 
ln P ({x }1 |W) = | 
∂wij n {z } | {z }
clamped to x(n) free
Term hyi yj iP (h|x(n) ,W) is the correlation between yi and yj

when Boltzmann machine is simulated with visible variables
clamped to x(n) and hidden variables freely sampling from
their conditional distribution.
Term hyi yj iP (h|x,W) is the correlation between yi and yj

when the Boltzmann machine generates samples from its
model distribution.
– p. 164
Boltzmann Machines with Input-Hidden-Output
The so far considered Boltzmann machine is a powerful
stochastic Hopfield network with no ability to perform
classification. Let us introduce visible input and output units:
• x ≡ (xi , xo )
Note that pattern x (n) consists of an input and output part,

(n) (n)
that is, x(n) ≡ xi , xo .
 
X 
 hyi yj iP (h|x(n) ,W) − hyi yj iP (h|x,W) 
| {z } | {z }
n
(n) (n) (n)
clamped to (xi , xo ) clamped to xi
– p. 165
Boltzmann Machines Updates Weights
Combine gradient descent and simulated annealing to update
weights
 
η 
∆wij =  hyi yj iP (h|x(n) ,W) − hyi yj iP (h|x,W) 

T | {z } | {z }
(n) (n) (n)
clamped to (xi , xo ) clamped to xi
High computational complexity:

• present each pattern several times
• anneal several times
Mean-field version of Boltzmann learning:

• calculate approximations of the correlations ([yi yj ])
entering the gradient
– p. 166
Deterministic Boltzmann Learning
input : {x(n) }N 1 ; η, Tstart , Tstop ∈ R
output: W
begin
T ← Tstart
repeat
randomly select pattern from sample {x(n) }N 1
randomize states
anneal network with input and output clamped
at final, low T , calculate [yi yj ]xi ,xo clamped
randomize states
anneal network with input clamped but output free
at final, low T, calculate [yi yj ]xi clamped
h i
wij ← wij + η/T [yi yj ]xi ,xo clamped ] − [yi yj ]xi clamped
until T < Tstop
return w
end
– p. 167

Slides Lecture6

Uploaded by

Copyright:

Available Formats

You might also like

Slides Lecture6

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slides Lecture6

Uploaded by

Copyright:

Available Formats

Simulated Annealing

input : (x1 , t1 ), . . . , (xN , tN ) ∈

Start with a large (temperature) value of T and decrease it by

This type of Hopfield network can approximate the probability

is the partition function and ensures P (x|W) is a probability

is well matched to those patterns. The states are updated

Taking derivative of the log likelihood gives: note that W is

Empirical correlation between xi and xj

Correlation between xi and xj under the current model

• Awake state: measure correlation between xi and xj in

Term hyi yj iP (h|x(n) ,W) is the correlation between yi and yj

Term hyi yj iP (h|x,W) is the correlation between yi and yj

Note that pattern x (n) consists of an input and output part,

High computational complexity:

Mean-field version of Boltzmann learning:

You might also like