Slides Lecture6

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Simulated Annealing

input : (x1 , t1 ), . . . , (xN , tN ) ∈


Rd × {−1, +1}; Tstart , Tstop ∈ R
output: w
begin
Randomly initialize w
T ← Tstart
repeat
b ← N (w) //neighbors of w, e.g. by adding
w
Gaussion noise (N (0, σ))
b < E(w) then w ← w
if E(w) b
 
b
else if exp − E(w)−E(w)
T > rand[0, 1) then
w←w b
decrease (T )
until T < Tstop
return w
end
– p. 156
Continuous Hopfield Network
Let us consider our previously defined Hopfield network
(identical architecture and learning rule), however with
following activity rule
 
1 X
Si = tanh  · wij Sj 
T
j

Start with a large (temperature) value of T and decrease it by


some magnitude whenever a unit is updated (deterministic
simulated annealing).

This type of Hopfield network can approximate the probability


distribution
 
1 1 1 T
P (x|W) = exp[−E(x)] = exp x Wx
Z(W) Z(W) 2 – p. 157
Continuous Hopfield Network
X
Z(W) = exp(−E(x′ )) (sum over all possible states)
x′

is the partition function and ensures P (x|W) is a probability


distribution.
Idea: construct a stochastic Hopfield network that
implements the probability distribution P (x|W).
• Learn a model that is capable of generating patterns
from that unknown distribution.
• Quantify (classify) by means of probabilities seen and
unseen patterns.
• If needed, we can generate more patterns (generative
model).
– p. 158
Boltzmann Machines
Given patterns {x(n) }N
1 , we want to learn the weights such
that the generative model
 
1 1 T
P (x|W) = exp x Wx
Z(W) 2

is well matched to those patterns. The states are updated


according to the stochastic rule:
• 1P
set xn = +1 with probability 1+exp (−2 j wij xj )
• else set xn = −1.
Posterior probability of the weights given the data (Bayes’
theorem)
hQ i
N (n) |W) P (W)
n=1 P (x
(n) N
P (W|{x }1 ) =
P ({x(n) }N1 ) – p. 159
Boltzmann Machines
Apply maximum likelihood method on the first term in
numerator:
"N # N  
Y X 1 (n)T
(n)
ln P (x |W) = x Wx(n) − ln Z(W)
2
n=1 n=1

Taking derivative of the log likelihood gives: note that W is


∂ 1 (n)T (n) (n)
symmetric (wij = wji ) that is ∂wij 2 x Wx(n) = xi xj
and X  
∂ 1 ∂ 1 (n)T
ln Z(W) = exp x Wx(n)
∂wij Z(w) x ∂wij 2
X  
1 1 (n)T
= exp x Wx(n) xi xj
Z(W) x 2
X
= xi xj P (x|W) = hxi xj iP (x|W)
x – p. 160
Boltzmann Machines (cont.)

N h
X i
∂ (n) (n)
ln P ({x(n) }N
1 |W) = x i xj − hxi xj iP (x|W)
∂wij
n=1
 
= N hxi xj iData − hxi xj iP (x|W)

Empirical correlation between xi and xj


N h
X i
1 (n) (n)
hxi xj iData ≡ xi xj
N
n=1

Correlation between xi and xj under the current model


X
hxi xj iP (x|W) ≡ xi xj P (x|W)
x

– p. 161
Interpretation of Boltzmann Machines Learning
Illustrative description (MacKay’s book, pp. 523):

• Awake state: measure correlation between xi and xj in


the real world, and increase the weights in proportion to
the measured correlations.
• Sleep state: dream about the world using the
generative model P (x|W) and measure the correlation
between xi and xj in the model world. Use these
correlations to determine a proportional decrease in the
weights.
If correlations in dream world and real world are matching,
then the two terms balanced and weights do not change.

– p. 162
Boltzmann Machines with Hidden Units
To model higher order correlations hidden units are required.
• x: states of visible units,
• h: states of hidden units,
• generic state of a unit (either visible or hidden) by yi ,
with y ≡ (x, h),
• state of network when visible units are clamped in state
x(n) is y(n) ≡ (x(n) , h).
Probability of W given a single pattern x(n) is
X X  
(n) (n) 1 1 (n)T
P (x |W) = P (x , h|W) = exp y Wy(n)
Z(W) 2
h h

X  
where 1 T
Z(W) = exp y Wy
2
x,h
– p. 163
Boltzmann Machines with Hidden Units (cont.)
Applying the maximum likelihood method as before one
obtains
 
∂ X 
(n) N hyi yj iP (h|x(n) ,W) − hyi yj iP (h|x,W) 
ln P ({x }1 |W) = | 
∂wij n {z } | {z }
clamped to x(n) free

Term hyi yj iP (h|x(n) ,W) is the correlation between yi and yj


when Boltzmann machine is simulated with visible variables
clamped to x(n) and hidden variables freely sampling from
their conditional distribution.

Term hyi yj iP (h|x,W) is the correlation between yi and yj


when the Boltzmann machine generates samples from its
model distribution.
– p. 164
Boltzmann Machines with Input-Hidden-Output
The so far considered Boltzmann machine is a powerful
stochastic Hopfield network with no ability to perform
classification. Let us introduce visible input and output units:
• x ≡ (xi , xo )

Note that pattern x (n) consists of an input and output part,


 
(n) (n)
that is, x(n) ≡ xi , xo .
 
X 
 hyi yj iP (h|x(n) ,W) − hyi yj iP (h|x,W) 
| {z } | {z }
n
(n) (n) (n)
clamped to (xi , xo ) clamped to xi

– p. 165
Boltzmann Machines Updates Weights
Combine gradient descent and simulated annealing to update
weights
 
η 
∆wij =  hyi yj iP (h|x(n) ,W) − hyi yj iP (h|x,W) 

T | {z } | {z }
(n) (n) (n)
clamped to (xi , xo ) clamped to xi

High computational complexity:


• present each pattern several times
• anneal several times

Mean-field version of Boltzmann learning:


• calculate approximations of the correlations ([yi yj ])
entering the gradient
– p. 166
Deterministic Boltzmann Learning
input : {x(n) }N 1 ; η, Tstart , Tstop ∈ R
output: W
begin
T ← Tstart
repeat
randomly select pattern from sample {x(n) }N 1
randomize states
anneal network with input and output clamped
at final, low T , calculate [yi yj ]xi ,xo clamped
randomize states
anneal network with input clamped but output free
at final, low T, calculate [yi yj ]xi clamped
h i
wij ← wij + η/T [yi yj ]xi ,xo clamped ] − [yi yj ]xi clamped
until T < Tstop
return w
end
– p. 167

You might also like