Professional Documents
Culture Documents
ArtificialNeuralNetworks Mku PDF
ArtificialNeuralNetworks Mku PDF
B. MEHLIG
Department of Physics
University of Gothenburg
Göteborg, Sweden 2019
PREFACE
These are lecture notes for my course on Artificial Neural Networks that I have given
at Chalmers (FFR135) and Gothenburg University (FIM720). This course describes
the use of neural networks in machine learning: deep learning, recurrent networks,
reinforcement learning, and other supervised and unsupervised machine-learning
algorithms.
When I first developed my lectures, my main source was the book by Hertz, Krogh,
and Palmer [1]. Other sources were the book by Haykin [2], as well as the lecture
notes of Horner [3]. My main sources for the Chapter on deep learning were the
deep-learning book by Goodfellow, Bengio & Courville [4], and the online-book by
Nielsen [5].
I am grateful to Martin Čejka who typed the first version of my hand-written lec-
ture notes and made most of the Figures, and to Erik Werner and Hampus Linander
for their interest and their help in preparing Chapter 7. I would like to thank also
Johan Fries and Oleksandr Balabanov for implementing the algorithms described
in Section 7.4. Johan Fries and Marina Rafajlovic made most of the exam questions.
Finally many students – past and present – pointed out misprints and errors and
suggested improvements. I thank them all.
CONTENTS
Preface iii
Contents v
1 Introduction 1
1.1 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 McCulloch-Pitts neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Other models for neural computation . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
I Hopfield networks 9
2 Deterministic Hopfield networks 10
2.1 Associative memory problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Hopfield network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Energy function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Spurious states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Exam questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Stochastic optimisation 51
4.1 Combinatorial optimisation problems . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Energy functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Simulated annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Monte-Carlo simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
II Supervised learning 61
5 Perceptrons 63
5.1 A classification task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Iterative learning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3 Gradient-descent learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4 Multi-layer perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.8 Exam questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
1 Introduction
The term neural networks refers to networks of neurons in the mammalian brain.
Neurons are its fundamental units of computation. In the brain they are connected
together in networks to process data. This can be a very complex task, and the
dynamics of neural networks in the mammalian brain in response to external stimuli
can therefore be quite intricate. Inputs and outputs of each neuron vary as functions
of time, in the form of so-called spike trains, but also the network itself changes. We
learn and improve our data-processing capacities by establishing reconnections
between neurons.
Neural-network algorithms are inspired by the architecture and the dynamics
of networks of neurons in the brain. Yet the algorithms use neuron models that
are highly simplified, compared with real neurons. Nevertheless, the fundamental
principle is the same: artificial neural networks learn by reconnection. Such net-
works can perform a multitude of information-processing tasks. They can learn to
recognise structures in a set of training data and generalise what they have learnt to
other data sets (supervised learning). A training set contains a list of input data sets,
together with a list of the corresponding target values that encode the properties of
the input data that the network is supposed to learn. To solve such association tasks
by artificial neural networks can work well – when the new data sets are governed
by the same principles that gave rise to the training data.
A prime example for a problem of this type is object recognition in images, for
instance in the sequence of camera images of a self-driving car. Recently the use
of neural networks for object recognition has exploded. There are several reasons
for this strong interest. It is driven, first, by the acute need for such algorithms in
industry. Second, there are now much better image data bases available for training
the networks. Third, there is better hardware so that networks with many layers
containing many neurons can be efficiently trained (deep learning) [4, 6].
Another task where neural networks excel is machine translation. These networks
are dynamical (recurrent).. They take an input sequence of words or sometimes
single letters. As one feeds the inputs word by word, the network outputs the words
in the translated sentence. Recurrent networks can be efficiently trained on large
training sets of input sentences and their translations. Google translate works in
this way [7].
Artificial neural networks are good at analysing large sets of high-dimensional
data where it may be difficult to determine a priori which properties are of interest.
In this case one often relies on unsupervised learning algorithms where the network
learns without a training set. Instead it determines in terms of which categories the
data can be analysed. In this way, artificial neural networks can detect familiarity
2 INTRODUCTION
axon
neural cell body
dendrites
Figure 1.1: Neurons in the cerebral cortex (outer layer of the cerebrum, the largest and
best developed part of the mammalian brain) of a macaque, an Asian monkey. Repro-
duced by permission of brainmaps.org [9] under the Creative Commons Attribution 3.0
License. The labels were added.
(which input patterns occur most often), clusters, and other structures in the input
data. Unsupervised-learning algorithms work well when there is redundancy in the
input data that is not immediately obvious because the data is high dimensional.
The different algorithms have much in common. They share the same building
blocks: the neurons are modeled as linear threshold units (McCulloch-Pitts neurons),
and the learning rules are similar (Hebb’s rule). Closely related questions arise also
regarding the network dynamics. A little bit of noise (not too much!) can improve the
performance, and ensures that the long-time dynamics approaches a steady state.
This allows to analyse the convergence of the algorithms using the central-limit
theorem.
dendrites
synapses cell body terminals
axon
Figure 1.3: Spike train in electrosensory pyramidal neuron in fish (eigenmannia). Time
series from Ref. [10]. Reproduced by permission of the publisher.
periods without spikes occur when the neuron is inactive, during spike-rich periods
the neuron is active.
As written, this computation is performed for all neurons i in parallel, and the
outputs ni are the inputs to all neurons at the next time step, therefore the outputs
have the time argument t + 1. These steps are repeated many times, resulting in
MCCULLOCH-PITTS NEURONS 5
n1 (t ) wi 1
n2 (t ) wi 2
PN
wi 3 ni (t + 1) = θH w n (t ) − µi
j =1 i j j
n3 (t )
.. µi
.
nN (t ) wi N
Figure 1.4: Schematic diagram of a McCulloch-Pitts neuron. The index of the neuron
is i , it receives inputs from N other neurons. The strength of the connection from
neuron j to neuron i is denoted by wi j . The function θH (b ) (activation function) is
the Heaviside function. It is equal to zero for b < 0 and equal to unity for b > 0. The
threshold value for neuron i is denoted by µi . The index t = 0, 1, 2, 3, . . . labels the
discrete time sequence of computation steps.
time series of the activity levels of all neurons in the network, referred to as neural
dynamics.
Now consider the details of the computation step Equation (1.2). The function
θH (b ) isP
the activation function. Its argument is often referred to as the local field,
bi (t ) = j wi j n j (t ) − µi . Since the neurons can only assume the states 0/1, the
activation function is taken to be the Heaviside function, θH (b ) = 0 if b < 0 and
θH (b ) = 1 if b > 0 (Figure 1.5). The Heaviside function is not defined at b = 0. To
avoid problems in our computer algorithms, we usually take θH (0) = 1.
Equation (1.2) shows that the neuron performs a weighted linear average of the
inputs n j (t ). The weights wi j are called synaptic weights. Here the first index, i ,
refers to the neuron that does the computation, and j labels all neurons that connect
to neuron i . The connection strengths between different pairs of neurons are in
general different, reflecting different strengths of the synaptic couplings. When the
value of wi j is positive, we say that the coupling is called excitatory. When wi j is
6 INTRODUCTION
θH (b )
1
0 b
Figure 1.5: Heaviside function.
g (b )
1
0 b
Figure 1.6: Continuous activation function.
Different schemes for choosing neurons are used. One possibility is to arrange the
neurons into an array and to update them one them by one, in a certain order (type-
writer scheme). A second possibility is to choose randomly which neuron to update.
This introduces stochasticity into the neural dynamics. This is very important, and
we will see that there are different ways of introducing stochasticity. Random asyn-
chronous updating is one example. In many scientific problems it is advantageous
to avoid stochasticity, when randomness is due to errors (multiplicative or additive
noise) that diminish the performance of the system. In neural-network dynamics,
by contrast, stochasticity is often helpful, as we shall see below.
1.4 Summary
Artificial neural networks use a highly simplified model for the fundamental compu-
tation unit, the neuron. In its simplest form, the model is just a binary threshold unit.
The units are linked together by weights wi j , and each unit computes a weighted av-
erage of its inputs. The network performs these computations in sequence. Usually
one considers discrete sequences of computation time steps, t = 0, 1, 2, 3, . . .. Either
all neurons are updated simultaneously in one time step (synchronous updating), or
only one chosen neuron is updated (asynchronous updating). Most neural-network
algorithms are built using the model described in this Chapter.
8 INTRODUCTION
9
PART I
HOPFIELD NETWORKS
10 DETERMINISTIC HOPFIELD NETWORKS
The Hopfield network [14] is an artificial neural network that can recognise or
reconstruct images. Consider for example the binary images of digits in Figure 2.1.
These images can be stored in the artificial neural network by assigning the weights
wi j in a certain way (called Hebb’s rule). Then one feeds a distorted image of one of
the digits (Figure 2.2) to the network by assigning the initial states of the neurons in
the network to the bits in the distorted image. The idea is that the neural-network
dynamics converges to the correct undistorted digit. In this way the network can
recognise the input as a distorted image of the correct digit (retrieve this digit). The
point is that the network may recognise patterns with many bits very efficiently.
This idea is quite old though. In the past such networks were used to perform
pattern recognition tasks. Today there are more efficient algorithms for this purpose
(Chapter 7).
Yet the first part of these lectures deals with Hopfield networks, for several reasons.
First, Hopfield nets form the basis for more recent algorithms such as Boltzmann
machines [2] and deep-belief networks [2]. Second, all other neural-network algo-
rithms discussed in these lectures are built from the same building blocks and
use learning rules that are closely related to Hebb’s rule. Third, Hopfield networks
can solve optimisation problems, and the resulting algorithm is closely related to
Markov-chain Monte-Carlo algorithms which are much used for a wide range of
problems in Physics and Mathematical Statistics. Fourth, and most importantly, a
certain degree of noise (not too much) can substantially improve the performance
of Hopfield networks, and it is understood in detail why. The reason that so much is
known about the role of noise in Hopfield networks is that they are closely related to
stochastic systems studied in Physics, namely random magnets and spin glasses. The
point is: understanding the effect of noise on the dynamics of Hopfield networks
helps to analyse the performance of other neural-network models.
Figure 2.1: Binary representation of the digits 0 to 4. Each digit has 16 × 10 pixels.
(µ)
to random patterns where each bit xi is chosen randomly. Each bit takes either
value with probability 12 , and different bits (in the same and in different patterns)
are independent. It is convenient to gather the bits of a pattern in a column vector
(µ)
x1
x (µ)
x (µ) = . .
2
(2.1)
..
(µ)
xN
The Hamming distance equals the number of bits by which the patterns differ.
Two patterns are identical if they have Hamming distance zero. For 0/1 patterns,
Equation (2.2) is equivalent to:
N
hµ 1 X (µ) 2
= xi − xi . (2.3)
N N i =1
This means that the Hamming distance is given by the mean-squared error, summed
over all bits. Note that the Hamming distance does not refer to distortions by trans-
lations, rotations, or shearing. An improved version of the distance involves taking
12 DETERMINISTIC HOPFIELD NETWORKS
xi = 1
xi = 0
i = 1, . . . , N
Figure 2.2: Binary image (N = 160) of the digit 0, and a distorted version of the same
image.
the minimum distance between the patterns subject to all possible translations,
rotations, and so forth.
In summary, the association task is to find the index ν for which the Hamming
distance hν is minimal, hν ≤ hµ for all µ = 1, . . . , p . How can one solve this task using
a neural network? One feeds the distorted pattern x with bits xi into the network by
assigning ni (t = 0) = xi . Assume that x is a distorted version of x (ν) . Now the idea is
to find a set of weights wi j so that the network dynamics converges to the correct
stored pattern:
(ν)
ni (t ) → xi as t → ∞ . (2.4)
Which weights to choose depends on the patterns x (µ) , so the weights must be
functions of x (µ) . We say that we store these patterns in the network by choosing
the appropriate weights. If the network converges as in Equation (2.4), the pattern
x (ν) is said to be an attractor in the space of all possible states of this network, the
so-called configuration space or state space.
sgn(b )
+1
−1 b
instead of 0/1, because this simplifies the mathematical analysis as we shall see.
The transformation from ni ∈ {0, 1} to Si ∈ {−1, 1} is straightforward:
Si = 2ni − 1 . (2.6)
As before i is the index of the chosen neuron. The arrow indicates that Si (t + 1) is
assigned the r.h.s of this equation evaluated at S j (t ). The synchronous update rule
reads X
Si ← sgn wi j S j − θi , (2.9)
j
| {z }
≡bi
where all bits are updated in parallel. In Eqs. (2.8) and (2.9) the argument of the
activation function is denoted by bi , sometimes called the local field.
Now we need a strategy for choosing the weights wi j , so that the patterns x (µ) are
attractors. If one feeds a pattern x close to x (ν) to the network, we want the network
to converge to x (ν)
This means that the network succeeds in correcting a small number of errors. If the
number of errors is too large, the network may converge to another pattern. The
region in configuration space around pattern x (ν) in which all patterns converge to
x (ν) is called the region of attraction of x (ν) .
However, we shall see that it is in general very difficult to prove convergence
according to Eq. (2.10). Therefore we try to answer a different question first: if one
feeds one of the undistorted patterns x (ν) , does the network recognise that it is one
of the stored, undistorted patterns? The network should not make any changes to
x (ν) because all bits are correct:
1 (1) (1)
wi j = x x and θi = 0 . (2.12)
N i j
We say that the pattern x (1) is stored in the network by assigning the weights wi j using
the rule (2.12). Note that the weights are symmetric, wi j = w j i . To check that the
(1)
rule (2.12) does the trick, feed the pattern to the network by assigning S j (t = 0) = x j ,
and evaluate Equation (2.8):
N N N
X
(1) 1 X (1) (1) (1) 1 X (1)
wi j x j = x x x = x . (2.13)
j =1
N j =1 i j j N j =1 i
(1)
The last equality follows because x j can only take the values ±1. The empty sum
evaluates to N , so that
XN
(1) (1)
sgn wi j x j = xi . (2.14)
j =1
1 2
1 2
3 4
(1) (1)
x1 = x4 = 1
3 4 (1) (1)
x2 = x3 = −1
T
(a) Network layout. The network has four (b) Pattern x (1)T = 1, −1, −1, 1 .
neurons, ◦. The arrows indicate symmet- Here T denotes the transpose of the col-
ric connections. umn vector x (1) .
Figure 2.4: Hopfield network with N = 4 neurons.
The factor 14 takes into account that the patterns take the values ±1 and not 0/1 as in
Section 2.1. To feed the pattern to the network, one sets Si (t = 0) = xi . Now iterate the
dynamics using synchronous updating (2.9). Results for different distorted patterns
are shown in Figure 2.5. We see that the first two distorted patterns (distance 1)
converge to the stored pattern, cases (a) and (b). But the third distorted pattern
does not [case (c)].
To understand this behaviour it is most convenient to analyse the synchronous
dynamics using the weight matrix
1 (1) (1)T
W= x x . (2.16)
N
Here x (1)T denotes the transpose of the column vector x (1) , so that x (1)T is a row vector.
The standard rules for matrix multiplication apply also to column and row vectors,
they are just N × 1 and 1 × N matrices. This means that the product on the r.h.s. of
Equation (2.16) is an N × N matrix. In the following, matrices with elements A i j or
Bi j are written as A, B, and so forth. The product in Equation (2.16) is also referred
to as an outer product. The product
N
X
x (1)T x (1) = [x (1) ]2 = N , (2.17)
j =1
16 DETERMINISTIC HOPFIELD NETWORKS
(a)
→
(b)
→
(c)
→
by contrast, is just a number (equal to N ). The product (2.17) is also called scalar
product. It is denoted by x (1) · x (1) = x (1)T x (1) .
Using Equation (2.17) we see that W projects onto the vector x (1) ,
Wn = W for n = 1, 2, 3, . . . . (2.19)
Equations (2.18) and (2.19) mean that the network recognises the pattern x (1) as
the stored one. The pattern is not updated [Eq. (2.11)]. This example illustrates the
general proof, Equations (2.13) and (2.14).
Now consider the distorted pattern (a) in Figure 2.5. We feed this pattern to the
network by assigning
−1
−1
S (t = 0) = . (2.20)
−1
1
−1
−1
x (1)T S (t = 0) = 1, −1, −1, 1 = 2 ,
(2.21)
−1
1
W S (t = 0) = 21 x (1) . (2.22)
This means that the state of the network converges to the stored pattern, in one
synchronous update. Since W is idempotent, the network stays there: the pattern
x (1) is an attractor. Case (b) in Figure 2.5 works in a similar way.
Now look at case (c), where the network fails to converge to the stored pattern.
We feed this pattern to the network by assigning S (t = 0) = [−1, 1, −1, −1]T . For one
iteration of the synchronous dynamics we first evaluate
−1
1
x (1)T S (0) = 1, −1, −1, 1 = −2 .
(2.24)
−1
−1
It follows that
W S (t = 0) = − 12 x (1) . (2.25)
Using the update rule (2.9) we find
S (t = 1) = −x (1) . (2.26)
Thus the network shown in Figure 2.4 has two attractors, the pattern x (1) as well as
the inverted pattern −x (1) . This is a general property of McCulloch-Pitts dynamics
with Hebb’s rule: if x (1) is an attractor, then the pattern −x (1) is an attractor too. But
one ends up in the correct pattern x (1) when more than half of bits in S (t = 0) are
correct.
18 DETERMINISTIC HOPFIELD NETWORKS
In summary we have shown that Hebb’s rule (2.12) allows the Hopfield network
to recognise a stored pattern: if we feed the stored pattern without any distortions
to the network, then it does not change the bits. This does not mean, however, that
the network recognises distorted patterns. It may or may not converge to the correct
pattern. We expect that convergence is more likely when the number of wrong bits
is small. If all distorted patterns near the stored pattern x (1) converge to x (1) then we
say that x (1) is an attractor. If x (1) is an attractor, then −x (1) is too.
When there are more than one patterns then Hebb’s rule (2.12) must be gener-
alised. A guess is to simply sum Equation (2.12) over the stored patterns:
p
1 X (µ) (µ)
wi j = x x and θi = 0 (2.28)
N µ=1 i j
(Hebb’s rule for p > 1 patterns). As for p = 1 the weight matrix is symmetric, W = WT ,
so that wi j = w j i . The diagonal weights are not zero in general. An alternative
version of Hebb’s rule [2] defines the diagonal weights to zero:
p
1 X (µ) (µ)
wi j = x x for i 6= j , wi i = 0 , and θi = 0 . (2.29)
N µ=1 i j
If we store only one pattern, p = 1, this modified rule Hebb’s rule (2.29) satisfies
Equation (2.11). In this Section we use Equation (2.29).
If we assign the weights according to Equation (2.29), does the network recognise
distorted patterns? We saw in the previous Section that this question is difficult
to answer in general, even for p = 1. Therefore we ask, first, whether the network
recognises the stored pattern x (ν) . The question is whether
1 XX
(µ) (µ) (ν) (ν)
sgn xi x j x j = xi . (2.30)
N j 6=i µ
| {z }
(ν)
≡bi
To check whether Equation (2.30) holds, we must repeat the calculation described
on page 14. As a first step we evaluate the argument of the signum function,
(ν)
(ν) 1 X X (µ) (µ) (ν)
bi = 1 − N1 xi + x x x . (2.31)
N j 6=i µ6=ν i j j
Here we have split the sum over the patterns into two contributions. The first term
corresponds to µ = ν, where ν refers to the pattern that was fed to the network, the
one that we want the network to recognise. The second term in Equation (2.31)
HOPFIELD NETWORK 19
contains the sum over the remaining patterns. For large N we can approximate
(1 − N1 ) ≈ 1. It follows that condition (2.30) is satisfied if the second term in (2.31)
does not affect the sign of the r.h.s. of this Equation. This second term is called
cross-talk term.
Whether adding the cross-talk term to x (ν) changes the signum of the r.h.s. of
Equation (2.31) or not, depends on the stored patterns. Since the cross-talk term
contains a sum over µ we may expect that the cross-talk term does not matter if p is
small enough. If this is true for all i and ν then all p stored patterns are recognised.
Furthermore, by analogy with the example described in the previous Section, we
may expect that the stored patterns are then also attractors, so that slightly distorted
patterns converge to the correct stored pattern, patterns close to x (ν) converge to
x (ν) under the network dynamics (but this is not guaranteed).
For a more quantitative analysis of the effect of the cross-talk term we store
patterns with random bits (random patterns)
(ν)
Prob(xi = ±1) = 21 . (2.32)
(ν) (ν)
If Ci < 0 then the cross-talk term has same sign as xi , so that it does not matter.
(ν) (ν)
If 0 < Ci < 1 it does not matter either, only when Ci > 1. The network produces
(ν)
an error in updating neuron i if Ci > 1 for particular values i and ν: if initially
20 DETERMINISTIC HOPFIELD NETWORKS
(ν)
Si (0) = xi , the sign of the bit changes under the update although it should not – so
that an error results.
How frequently does this happen? For random patterns we can answer this
question by computing the one-step (t = 1) error probability:
t =1 (ν)
Perror = Prob(Ci > 1) . (2.35)
(ν)
Since patterns and bits are identically distributed, Prob(Ci > 1) does not depend
t =1
on i or ν. Therefore Perror does not carry any indices.
t =1
How does Perror depend on the parameters of the problem, p and N ? When both p
and N are large we can use the central-limit theorem to answer this question. Since
(ν)
different bits/patterns are independent, we can think of Ci as a sum of independent
random numbers cm that take the values −1 and +1 with equal probabilities,
(N −1)(p
X −1)
(ν) 1 X X (µ) (µ) (ν) (ν) 1
Ci =− xi x j x j xi = − cm . (2.36)
N j 6=i µ6=ν N m =1
There are M = (N − 1)(p − 1) terms in the sum on the r.h.s. because terms with µ = ν
are excluded, and also those with j = i [Equation (2.29)]. If we use Equation (2.28)
instead, then there is a correction to Equation (2.36) from the diagonal weights. For
p N this correction is small. P
When p and N are large, then the sum m cm contains a large number of inde-
pendently identically distributed random numbers P with mean zero and variance
unity. It follows from the central-limit theorem that N1 m cm is Gaussian distributed
with mean zero, and with variance
M M M
1 ¬ X 2 ¶ 1 XX
σC2 = cm = 〈cn cm 〉 . (2.37)
N 2 m =1 N 2 n=1 m =1
t =1
Perror
(ν)
1 Ci
(ν)
Figure 2.6: Gaussian distribution of Ci . The hashed area equals the error probability.
Here
M M!
= (2.40)
k k !(M − k )!
is the number of ways in which k occurrences of +1 can be distributed over M
places. We expect that the quantity 2k − M is Gaussian distributed with mean
zero and variance
p M . To demonstrate this, it is convenient to use the variable
z = (2k − M )/ M which p is Gaussian with mean zero and unit variance. Therefore
we substitute k = 2 + 2M z into Equation (2.39) and take the limit of large M using
M
Stirling’s approximation
1
n! = en log n−n+ 2 log 2πn . (2.41)
−1
Expanding Pkp ,M to leading order in M assuming that z remains of order unity
gives
p Pk ,M = 2/(πM ) exp (−z /2). From P (z )dz = P (k )dk it follows that P (z ) =
2
t =1
Perror
0.02
0.01
other hand, it is easier to distinguish between stored patterns if they have more bits.
We also see that the one-step error probability depends on p and N only through
the combination
p
α≡ . (2.45)
N
The parameter α is called the storage capacity of the network. Figure 2.7 shows
t =1
how Perror depends on the storage capacity. Take α = 0.185 for example. Then the
one-step error probability (the probability of an error in one asynchronous attempt
to update a bit) is about 1%.
The error probability defined in this Section refers only to the initial update,
the first iteration. What happens in the next iteration, and after many iterations?
Numerical experiments show that the error probability can be much higher in later
iterations, because more errors tend to increase the probability of making another
t =1
error. The estimate Perror is a lower bound.
Also: realistic patterns are not random with independent bits. We nevertheless
t =1
expect that Perror describes the typical one-step error probability of the Hopfield
network when p and N are large. However, it is straightforward to construct counter
examples. Consider for example orthogonal patterns:
t =1
The cross-talk term vanishes in this case, so that Perror = 0.
1X
H =− wi j Si S j . (2.47)
2 ij
ENERGY FUNCTION 23
The name comes from an analogy to spin systems in Physics. An alternative name for
H is Hamiltonian. The energy function (2.47) allows us to analyse the convergence of
the dynamics of the Hopfield model. More generally, energy functions are important
tools in analysing the convergence of different kinds of neural networks. A second
reason for considering the energy function is that it allows us to derive Hebb’s rule
in a different way.
We can can write the energy function as
1X
H =− (wi j + w j i )Si S j + const. (2.48)
2 〈i j 〉
There are two possibilities, either Si0 = Si or Si0 = −Si . In the first case H remains
unchanged, H 0 = H . Here H 0 refers to the value of the energy function after the
update (2.49). The other case is Si0 = −Si . Then
1X X
H −H =−
0
(wi j + w j i )(Si S j − Si S j ) =
0
(wi j + w j i )Si S j . (2.50)
2 j j
The sum goes over all neurons j that are connected to the neuron i that is updated
in Equation (2.49). Now if the weights are symmetric, H 0 − H equals
X
H0−H =2 wi j Si S j . (2.51)
j
P
Since the sign of j wi j S j is that of Si0 , and since the sign of Si0 differs from that of Si ,
it follows from Equation (2.51) that
H 0 − H < 0. (2.52)
dynamics
states
x (2) x (1) x (spurious)
Figure 2.8: Minima in the energy function are attractors in state space. Not all minima
correspond to stored patterns, and stored patterns need not correspond to minima.
(S1 , . . .SN ) – is illustrated schematically by a single axis, the x -axis. But when N is
large, the state space is really very high dimensional.
Not all stored patterns are attractors. Our analysis of the cross-talk term showed
this. If the cross-talk term causes errors for a certain stored pattern that is fed into
the network, then this pattern is not located at a minimum of the energy function.
Conversely there may be minima that do not correspond to stored patterns. Such
states are referred to as spurious states. The network may converge to spurious
states, this is undesirable but inevitable.
We now turn to an alternative derivation of Hebb’s rule that uses the energy
function. To keep things simple we assume p = 1 at first. The idea is to write down
an energy function that assumes a minimum at the stored pattern x (1) :
N 2
1 X (1)
H =− Si xi . (2.53)
2N i =1
(1) (1)
This function is minimal when Si = xi for all i (and also when Si = −xi ). You will
see in a moment why the factor 1/(2N ) is inserted. Now we evaluate the expression
on the r.h.s.:
N
1 X 1 X 1 (1) (1)
X
(1) (1)
H =− Si xi Sj x j = − xi x j Si S j . (2.54)
2N i j
2 ij |
N
{z }
=wi j
This shows that the function H has the same form as the energy function (2.47) for
the Hopfield model, if we assign the weights wi j according to Hebb’s rule (2.12).
Thus this argument provides an alternative motivation for this rule: we write down
an energy function that has a minimum at the stored pattern x (1) . This ensures that
SPURIOUS STATES 25
p N 2
1 X X (µ)
H =− Si xi , (2.55)
2N µ=1 i =1
Here the patterns x (ν) are not necessarily minima of H , because a maximal value
PN (ν) 2
of S x
i =1 i i
may be compensated by terms stemming from other patterns. But
one can hope that this happens rarely when p is small (Section 2.2).
Mixed states come in large numbers, 22n +1 2n+1 , the more the larger n .
p
p N
1 X X (µ) (µ) (mix)
(mix)
sgn xi x j x j = xi , (2.58)
N µ=1 j =1
26 DETERMINISTIC HOPFIELD NETWORKS
under the update (2.8) using Hebb’s rule (2.28). To this end we split the sum in the
usual fashion
p N 3 N
1 X X (µ) (µ) (mix) X (µ) 1 X (µ) (mix)
x x x = xi x x + cross-talk term . (2.59)
N µ=1 j =1 i j j µ=1
N j =1 j j
Let us ignore the cross-talk term for the moment and check whether the first term
(mix)
reproduces xi . To make progress we assume random patterns [Equation (2.32)],
(mix)
and compute the probability that the sum on the r.h.s of Equation (2.59) yields xi .
Patterns i and j are uncorrelated, and the sum over j on the r.h.s. of Equation (2.59)
(µ) (mix)
is an average over sµ = x j x j . Table 2.1 lists all possible combinations of bits of
pattern j and the corresponding values of sµ . We see that on average 〈sµ 〉 = 12 , so
that
p N 3
1 X X (µ) (µ) (mix) 1 X (µ)
x x x = x + cross-talk term . (2.60)
N µ=1 j =1 i j j 2 µ=1 i
Neglecting the cross-talk term and taking the sgn-function we see that x (mix) is
reproduced. So mixed states such as (2.57) are recognised, at least for small α, and
it may happen that the network converges to these states, loosely referred to as
superpositions of odd numbers of patterns.
Finally, for large values of p there are local minima of H that are not correlated
(µ)
with any number of the stored patterns x j . Such spin-glass states are discussed
further in the book by Hertz, Krogh and Palmer [1].
SUMMARY 27
2.5 Summary
We have analysed the dynamics of the Hopfield network as a means of solving the
associative memory problem (Algorithm 1). The Hopfield network is a network of
McCulloch-Pitts neurons. Its layout is defined by connection strengths wi j , chosen
according to Hebb’s rule. These weights wi j are symmetric, and the network is in
general fully connected. Hebb’s rule ensures that stored patterns x (µ) are recognised,
at least most of the time if the storage capacity is not too large. A single-step estimate
for the error probability was given in Section 2.2. If one iterates several steps, the
error probability is generally much larger, and it is difficult to evaluate it. It turns out
that it is much simpler to compute the error probability when noise is introduced
into the network dynamics (not just random patterns).
2.6 Exercises
Modified Hebb’s rule. Show that the modified rule Hebb’s rule (2.29) satisfies Equa-
tion (2.11) if we store only one pattern, for p = 1.
Orthogonal patterns. Show that the cross-talk term vanishes for orthogonal pat-
t =1
terns, so that Perror = 0.
Mixed states. Explain why there are no mixed states that are superpositions of an
even number of stored patterns.
28 DETERMINISTIC HOPFIELD NETWORKS
w12
1 2
w21
One-step error probability for mixed states. Write a computer program imple-
menting the asynchronous deterministic dynamics of a Hopfield network to deter-
mine the one-step error probability for the mixed state (2.57). Plot how the one-step
error probability depends on α for N = 50 and N = 100. Repeat this exercise for
mixed patterns that are superpositions of the bits of 5 and 7 patterns.
Energy function. For the Hopfield network with two neurons shown in Figure
2.9 demonstrate that the energy function cannot increase under the determinstic
dynamics. Write the energy function as H = − w12 +w 2
21
S1S2 and use the update rule
S1 = sgn(w12S2 ). In which step do you need to assume that the weights are symmetric,
0
i=1 i=2
i=3 i=4
(1) (1)
Figure 2.10: The pattern x (1) has N = 4 bits, x1 = 1, and xi = −1 for i = 2, 3, 4.
Question 2.7.2.
Here N is the number of neurons, wi j are the weights, and the state Si of neuron i
is equal to ±1. The update rule is
N
X
Si ← sgn wi j S j . (2.62)
j =1
Pp (µ) (µ)
(a) Use Hebb’s rule: wi j = N1 µ=1 xi x j for i = 6 j , and wi i = 0 . Show that H
either decreases or stays constant after a single asynchronous update (2.62). Which
property of weights assures that this is the case? (1p).
Pp (µ) (µ)
(b) Assume that the weights are wi j = N1 µ=1 xi x j for all i , j . In this case, how
does H change after a single asynchronous update according to Eq. (2.62)? Compare
to the result of (a). Discuss. (1p).
N
X
Si ← sgn(bi ) with bi = wi j S j . (2.65)
j =1
EXAM QUESTIONS 31
1 PN
K K x S j c ≈ mν for K 1. Finally, on the right-hand-side of the resulting
j =1 i j j
PN (ν)
expression, approximate N1 i =1 by an integral over the distribution P (〈Ci 〉c ) you
obtained in (b). Evaluate the integral to find an approximate expression for mν .
(1.5p)
(c) Rewrite the expression you derived in (a) as a sum of two terms. The first term
is a sum over µ = 1, 2, 3. The second term is the cross-talk term, a sum over the
remaining values of µ. Explain why the cross-talk term can be neglected in the limit
stated above. (0.5p).
(d) Combine the results of (a), (b) and (c) to show that the network recognises the
mixed state (2.67). (0.5p).
33
Si ← sgn(bi ) , (3.1)
where bi is the local field. This rule is called deterministic, because a given set of
states Si determines the outcome of the update. To introduce noise, one replaces
the rule (3.1) by the stochastic rule
¨
+1 with probability P (bi ) ,
Si = (3.2)
−1 with probability 1 − P (bi ) ,
P (b )
β = 10
1
β =0
-4 -2 2 4 b
Figure 3.1: Probability function (3.3) used in the definition of the stochastic rule (3.2),
plotted for β = 10 and β = 0.
eter β is the noise parameter. Unfortunately it is defined the wrong way around.
When β is large the noise level is small. In particular one obtains the deterministic
dynamics as β tends to infinity (3.1):
mµ ≡ lim mµ (T ) . (3.5)
T →∞
Here
T N
1X 1 X
(µ)
mµ (T ) = Si (t )xi (3.6)
T t =1 N i =1
PN (µ)
is the average of 1
N S (t )xi
i =1 i
over the noisy dynamics of the network, for given
(µ) (1)
bits xi . Since we feed pattern x (1) to the network, we have initially Si (t = 0) = xi
MEAN-FIELD THEORY 35
m1 (T )
1
〈m1 〉
T
Figure 3.2: Illustrates how the average m1 (T ) depends upon the total iteration time T .
The light grey lines show different realisations of m1 (T ) for different realisations of the
stored patterns, at a large but finite value of N . The thick red line is the average over
the different realisations of patterns.
and thus
N
1 X (1)
Si (t = 0)xi = 1 . (3.7)
N i =1
PN (1)
After a transient, the quantity N1 i =1 Si (t )xi settles into a steady state, where it
fluctuates around a mean value with a definite distribution that is independent of
(1)
time t . If the network works well, we expect that Si (t ) remains close to xi , so that
m1 converges to a value of order unity as T → ∞. Since there is noise, this mean
value is usually smaller than unity.
Figure 3.2 illustrates how the average (3.6) converges to a definite value when T
becomes large. For finite values of N the mean m1 depends upon the stored patterns.
In this case it is useful to average m1 over different realisations of stored patterns
(thick red line in Figure 3.2). In the limit of N → ∞, the mean m1 is independent of
the stored patterns, we say that the system is self averaging.
The other order parameters are expected to be small, because the bits of the
patterns x (2) to x (p ) are independent from those of x (1) . As a consequence the indi-
(1)
vidual terms in the sum in Equation (3.6) cancel upon summation, if Si (t ) ≈ xi . In
summary, we expect that
¨
1 if µ = 1,
mµ ≈ (3.8)
0 otherwise.
〈S1 〉
β = 10
1
β =0
-4 -2 2 4
b1
-1
Figure 3.3: Average 〈S1 〉 under the dynamics (3.2), as a function of b1 for different noise
levels.
Note that this is a different average from the one defined on page 19. The average
there is over random bits, here it is over the stochastic network dynamics (3.2). We
can evaluate the time average 〈S1 〉 using Equation (3.2):
e β b1 e −β b1
〈S1 〉 = − = tanh(β b1 ) . (3.11)
e β b1 + e −β b1 e β b1 + e −β b1
Figure 3.3 illustrates how 〈S1 〉 depends upon b1 . For weak noise levels (large β ), the
threshold b1 acts as a bias. When b1 is negative then 〈S1 〉 ≈ −1, while 〈S1 〉 ≈ 1 when
b1 is positive. So the state S1 reflects the bias most of the time. When however the
noise is large (small β ), then 〈S1 〉 ≈ 0. In this case the state variable S1 is essentially
unaffected by the bias, S1 is equally likely to be negative as positive so that its average
evaluates to zero.
How can we generalise this calculation to a Hopfield model with many neurons?
It can be done approximately at least, in the limit of large values of N . Consider
MEAN-FIELD THEORY 37
This makes the problem challenging. But when N is large we may assume that bi (t )
remains essentially constant in the steady state, independent of t . The argument
is that fluctuations of S j (t ) average out when summing over j , at least when N is
large. In this case bi (t ) is approximately given by its time average:
Here the average local field 〈bi 〉 is called the mean field, and theories that neglect the
small corrections in Equation (3.13) are called mean-field theories. Let us assume
that the fluctuations in Equation (3.13) are negligible. Then the local field for neuron
i is approximately constant
bi (t ) ≈ 〈bi 〉 , (3.14)
so that the problem of evaluating the average of Si (t ) reduces to the case discussed
above, N = 1. From Equations (2.28) and (3.11) one deduces that
1 X X (µ) (µ)
〈Si 〉 = tanh(β 〈bi 〉) with 〈bi 〉 = x x 〈S j 〉 . (3.15)
N µ j 6=i i j
These mean-field equations are a set of N non-linear equations for 〈Si 〉, for given
(µ)
fixed patterns xi . The aim is to solve these equations to determine the time aver-
ages 〈Si 〉, and then the order parameters from
N
1 X (1)
mµ = 〈S j 〉x j . (3.16)
N j =1
The last equality is only approximate because the j -sum in the definition of mµ
contains the term j = i . Whether or not to include this term makes only a small
difference to mµ , in the limit of large N .
38 STOCHASTIC HOPFIELD NETWORKS
Now assume that m1 ≈ m with m of order unity, and mµ ≈ 0 for µ 6= 1. Then the
first term in the sum over µ dominates, provided that the small terms do not add up
to a contribution of order unity. This is the case if
p
α= (3.18)
N
is small enough. Roughly speaking this works if α is of order N −1 (a more precise
estimate is that α is at most (log N )/N [15]). Now we use the mean-field equations
(3.15) to approximate
(1)
〈Si 〉 = tanh(β 〈bi 〉) ≈ tanh(β m1 xi ) . (3.19)
we find
N
1 X (1)
(1)
m1 = tanh β m1 xi xi (3.20)
N i =1
(µ)
Using that tanh(z ) = − tanh(−z ) as well as the fact that the bits xi can only assume
the values ±1, we get
m1 = tanh(β m1 ) . (3.21)
This is a self-consistent equation for m1 . We assumed that m1 is of order unity. So
the question is: does this equation admit such solutions? For β → 0 there is one
solution, m1 = 0. This is not the desired one. For β → ∞, by contrast, there are
three solutions, m1 = 0, ±1. Figure 3.4 shows results of the numerical evaluation of
Equation (3.21) for intermediate values of β . Below a critical noise level there are
three solutions, namely for β larger than
βc = 1 . (3.22)
For β > βc , the solution m1 = 0 is unstable (this can be shown by computing the
derivatives of the free energy of the Hopfield network [1]). Even if we were to start
with an initial condition that corresponds to m1 = 0, the network would not stay
there. The other two solutions are stable: when the network is initialised close to
x (1) , then it converges to m1 = +m (with m > 0).
The symmetry of the problem dictates that there must also be a solution with
m1 = −m at small noise levels. This solution corresponds to the inverted pattern
−x (1) (Section 2.4). If we start in the vicinity of x (1) , then the network is unlikely to
MEAN-FIELD THEORY 39
m1
1
β −1
1
-1
Figure 3.4: Solutions of the mean-field equation (3.21). The critical noise level is βc = 1.
The dashed line corresponds to an unstable solution.
converge to −x (1) , provided that N is large enough. The probability of x (1) → −x (1)
vanishes very rapidly as N increases and as the noise level decreases. If this transition
were to happen in a simulation, the network would then stay near −x (1) for a very
long time. Consider the limit where T tends to ∞ at a finite but possibly large value
of N . Then the network would (at a very small rate) jump back and forth between
x (1) and −x (1) , so that the order parameter would average to zero. This shows that
the limits of large N and large T do not commute
In practice the interesting limit is the left one, that of a large network run for a time
T much longer than the initial transient, but not infinite. This is precisely where the
mean-field theory applies. It correponds to taking the limit N → ∞ first, at finite
but large T . This describes simulations where the transition x (1) → −x (1) does not
occur.
In summary, Equation (3.21) predicts that the order parameter converges to a
definite value, m1 , independent of the stored patterns when N is large enough.
Figure 3.2 shows that the order parameter converges for large but finite values of N .
However, the limiting value does depend on the stored patterns, as mentioned above.
The system is not self averaging. When N is finite we should therefore average the
result over different realisations of the stored patterns.
The value of 〈m1 〉 determines the average number of correctly retrieved bits in
the steady state:
N
1X (1) N
〈Ncorrect 〉 = 1 + 〈Si 〉xi = (1 + 〈m1 〉) (3.24)
2 i =1 2
The outer average is over different realisations of random patterns (the inner average
is over the network dynamics). Since m1 → 1 as β → ∞ we see that the number of
40 STOCHASTIC HOPFIELD NETWORKS
〈Ncorrect 〉 → N . (3.25)
This is expected since the stored patterns x (µ) are recognised for small enough values
of α in the deterministic limit, because the cross-talk term is negligible. But it is
important to know that the stochastic dynamics slows down as the noise level tends
to zero. The lower the noise level, the longer the network remains stuck in local
minima, so that it takes longer time to reach the steady state, and to sample the
steady-state statistics of H . Conversely when the noise is strong, then
〈Ncorrect 〉 → N2 . (3.26)
In this limit the stochastic network ceases to function. If one assigns N bits entirely
randomly, then half of them are correct, on average.
We define the error probability in the steady state as
t =∞ N − 〈Ncorrect 〉
Perror = . (3.27)
N
From Equation (3.24) we find
t =∞
Perror = 21 (1 − 〈m1 〉) . (3.28)
In the limit of β → ∞, the 〈Si 〉 converge to the mixed states (2.57) when m is given
by Equation (3.29). Using the definition of mµ and averaging over the bits of the
random patterns one finds:
p
(µ) (ν)
¬ X ¶
mµ = xi tanh β mν xi . (3.31)
ν=1
The relevant calculation is summarised in Section 2.5 of Hertz, Krogh and Palmer [1],
and the remainder of this Section follows that outline quite closely. One starts by
rewriting the mean-field equations (3.15) in terms of the order parameters mµ . Using
(µ)
X
〈Si 〉 = tanh(β xi mµ ) (3.32)
µ
we find
1 X (ν) 1 X (ν)
X
(µ)
mν = x 〈Si 〉 = x tanh β xi mµ . (3.33)
N i i N i i µ
1 X (µ)
mµ = x 〈S j 〉 for µ 6= 1
N j j
remain finite (yet small). The trick to evaluate these order parameters is to assume
random patterns, so that the mµ become random numbers that fluctuate around
zero with variance 〈mµ2 〉 (this average is an average over the ensemble of random
patterns). We use Equation (3.33) to compute the variance approximately. In the µ-
sum on the r.h.s of Equation (3.33) we must treat the term µ = ν separately (because
the index ν appears also on the l.h.s. of this equation). Also the term µ = 1 must be
treated separately, as before, because µ = 1 is the index of the pattern that was fed
to the network. Therefore it is necessary to distinguish between ν = 1 and ν =6 1. We
start with the second case and write
1 X (ν)
(1) (ν) (µ)
X
mν = x tanh β xi m1 + β xi mν + β xi mµ
N i i µ6=1
µ6=ν
1 X (ν) (1)
(1) (ν) (µ) (1)
X
= xi xi tanh β m1 + β xi xi mν + β xi xi mµ (3.34)
N i |{z} | {z } µ6=1
1
2 µ6=ν
| {z }
3
Using d
dx tanh(x ) = 1 − tanh2 (x ) one gets
1 X (ν) (1)
(µ) (1)
X
mν = xi xi tanh β m1 + β xi xi mµ
N |{z} µ6=1
1 µ6=ν
| {z }
3 (3.36)
1 X (ν) (1)
(1) (ν) (µ) (1)
X
2
+ x x β x x mν 1 − tanh β m1 + β xi xi mµ .
N i i i | i {zi } µ6=1
2 µ6=ν
(µ)
Using the fact that x (µ) = ±1 and thus [xi ]2 = 1, this expression simplifies somewhat:
1 X (ν) (1)
(µ) (1)
X
mν = xi xi tanh β m1 + β xi xi mµ +
N i µ6=1
µ6=ν
(3.37)
1 X
(µ) (1)
X
2
+β mν 1 − tanh β m1 + β xi xi mµ .
N i µ6=1
µ6=ν
The situation is similar to our discussion of the sum defining the cross-talk term in
Section 2.2. The sums in Equation (3.37) depend on the stored patterns, through
(µ)
the xi . To estimate the order parameters mµ we proceed as in Section 2.2 and
(µ)
assume that the pattern bits xi are independently randomly distributed. Since the
sums in Equation (3.37) contain many terms, we can estimate the sums using the
central-limit theorem. If the patterns are random, then
(µ) (1)
X
z≡ xi xi mµ (3.38)
µ6=1
µ6=ν
Now return to Equation (3.37). The sum N1 i in the second line can be approxi-
P
This is a self-consistent equation for m1 . In summary there are three coupled equa-
tions, for m1 , q , and σz . Equations (3.41a), (3.45), and (3.46). They must be solved
together to determine how m1 depends on β and α.
To compare with the results described in Section 2.2 we must take the determin-
istic limit, β → ∞. In this limit the coupled equations simplify [1]:
v
m2
t 2 − 2σ12z
u
β (1 − q ) = e , (3.47a)
πσ2z
α
σ2z = , (3.47b)
[1 − β (1 − q )]2
m1
m1 = erf p . (3.47c)
2σ2z
Compare this with Equation (2.43) for the one-step error probability in the deter-
ministic limit. That equation was derived for only one step in the dynamics of the
network, while Equation (3.48) describes the long-time limit. Yet it turns out that
Equation (3.48) reduces to (2.43) in the limit of α → 0. To see this one solves Equation
(3.47) by introducing the variable y = m1 / 2σ2z [1]. One obtains a one-dimensional
p
1 1
t =1
Perror = 1 − erf p
2 2α
t =∞ t =1
derived in Section 2.2. You see that Perror approaches Perror for small α. This means
that the error probability does not increase significantly as one iterates the network,
at least for small α. In this case errors in earlier iterations have little effect on the
probability that later errors occur. But the situation is different at larger values of
t =1
α. In that case Perror significantly underestimates the steady-state error probability.
46 STOCHASTIC HOPFIELD NETWORKS
Figure 3.5: Error probability as a function of the storage capacity α in the deterministic
t =1
limit. The one-step error probability Perror [Equation (2.43)] is shown as a black line,
t =∞
the steady-state error probability Perror [Equation (3.27)] is shown as a red line. In the
hashed region error avalanches increase the error probability. Similar to Figure 1 in
Ref. [17].
In the hashed region, errors in the dynamics increase the probability of errors in
subsequent steps, giving rise to error avalanches.
Figure 3.5 illustrates that there is a critical value αc where the steady-state error
probability tends to 21 . Solution of the mean-field Equations gives
αc ≈ 0.1379 . (3.50)
When α > αc the steady-state error probability equals 12 , in this region the network
produces just noise. When α is small, the error probability is small, here the network
works well. Figure 3.5 shows that the steady-state error probability changes very
abruptly near αc . Assume you store 137 patterns with 1000 bits in a Hopfield network.
Figure 3.5 demonstrates that the network can reliably retrieve the patterns. However,
if you try to store one or two more patterns, the network fails to produce any output
meaningfully related to the stored patterns. This rapid change is an example of a
phase transition. In many physical systems one observes similar transitions between
ordered and disordered phases.
What happens at higher noise levels? The numerical solution of Equations (3.41a),
(3.46), and (3.45) shows that the critical storage capacity αc decreases as the noise
level increases (smaller values of β ). This is shown schematically in Figure 3.6. In-
side the hashed region the error probability is smaller than 12 so that the network
operates reliably (although less so as one approaches the phase-transition bound-
ary). Outside this region the the error probability equals 12 . In this region the network
fails. In the limit of small α the critical noise level is βc = 1. In this limit the network
is described by the theory explained in Section 3.3, Equation (3.21).
Often these two different phases of the Hopfield network are characterised in
terms of the order parameter m1 . We see that m1 > 0 in the hashed region, while
BEYOND MEAN-FIELD THEORY 47
β −1
1
t =∞
Perror = 21
0.5
t =∞
Perror < 21
0.05 0.1
αc0.15 α
Figure 3.6: Phase diagram of the Hopfield network in the limit of large N (schematic).
t =∞ t =∞
The region with Perror < 12 is the ordered phase, the region with Perror = 12 is the disor-
dered phase. After Figure 2 in Ref. [17].
m1 = 0 outside.
e−β H (s ) .
X
Z= (3.52)
s
One can compute the order parameter by adding a threshold term to the energy
function (2.47)
1X X
H =− wi j Si S j − λSi . (3.53)
2 ij i
48 STOCHASTIC HOPFIELD NETWORKS
The outer average is over different realisations of random patterns. Since the log-
arithm of Z is difficult to average, one resorts to the replica trick. The idea is to
represent the average of the logarithm as
The function Z n looks like the partition function of n copies of the system, thus the
name replica trick. It is still debated when the replica trick works and when not [18].
Nevertheless, the most accurate theoretical result for the critical storage capacity is
obtained in this way [19]
αc = 0.138187 . (3.56)
The mean-field result (3.50) is different from (3.56), but it is very close. Most of the
time, mean-field theories do not give such good results, they are usually used to gain
merely a qualitative understanding of phase transitions. In the Hopfield model the
mean-field theory works so well because the connections are global: every neuron
is connected with every other neuron. This helps to average out the fluctuations in
Equation (3.14). The most precise Monte-Carlo simulations (Section 4.4) for finite
values of N [20] yield upon extrapolation to N = ∞
This is close to, yet significantly different from the best theoretical estimate, Equation
(3.56), and also different from the mean-field result (3.50).
3.6 Summary
In this Chapter the use of Hopfield networks for pattern recognition was discussed.
Hopfield networks share many properties with the networks discussed later on in
these lectures. The most important point is perhaps that introducing noise in the
dynamics allows to study the convergence and performance of the network: in the
presence of noise there is a well-defined steady state that can be analysed. Without
noise, in the deterministic limit, the network dynamics can get stuck in local minima
of the energy function, and may not reach the stored patterns. Naturally the noise
must be small enough for the network to function reliably. Apart from the noise level
there is a second significant parameter, the storage capacity α, equal to the ratio
FURTHER READING 49
of the number of patterns to the number of bits per pattern. When α is small the
network is reliable. A mean-field analysis of the N → ∞-limit shows that there is a
phase transition in the parameter plane (phase diagram) of the Hopfield network,
Figure 3.6.
The building blocks of Hopfield networks are McCulloch-Pitts neurons and Hebb’s
rule for the weights. These elements are fundamental to the networks discussed in
the coming Chapters.
3.8 Exercises
Mixed states. Write a computer program that implements the stochastic dynamics
of a Hopfield model. Compute how the order parameter for mixed states that are
superpositions of the bits of three stored patterns depends on the noise level for
0.5 ≤ β ≤ 2.5. Compare your numerical results with the predictions in Section 3.3.
Repeat the exercise for mixed states that consist of superpositions of the bits of five
stored patterns. To this end, first derive the equation for the order parameter and
solve this equation numerically. Second, perform your computer simulations and
compare.
Phase diagram of the Hopfield network. Derive Equation (3.49) from Equation
(3.47). Numerically solve (3.49) to find the critical storage capacity αc in the deter-
ministic limit. Quote your result with three-digit accuracy. To determine how the
critical storate capacity depends on the noise level, numerically solve the three cou-
pled Equations (3.46), (3.41b), and (3.45). Compare your result with the schematic
Figure 3.6.
50 STOCHASTIC HOPFIELD NETWORKS
N
X 1
bi = wi j S j , P (bi ) = , (3.59)
j =1
1 + e−2β bi
where β is the noise parameter. The weights wi j are given by Hebb’s rule wi j =
Pp (µ) (µ)
N
1
x x j for i 6= j , and wi i = 0 . Whether the stochastic Hopfield network can
µ=1 i
reliably retrieve stored patterns depends upon
Explain and discuss how each of these factors influences the reliability of retrieval
of stored patterns in the stochastic Hopfield model. In the discussion, refer to and
explain the following terms: “storage capacity", “order parameter", “phase diagram",
“local minima", “attractors", “spurious states". Your answer must not be longer than
one A4 page. (1p).
51
4 Stochastic optimisation
Hopfield networks were introduced as a means of solving the associative mem-
ory problem. In Section 2.3 it was shown that this corresponds to minimizing an
energy function H . In this Chapter we see how this can be used to solve combinato-
rial optimisation problems using neural networks. Such problems admit 2k or k !
configurations - too many to list and check in a serial approach when k is large.
The idea is to write down an energy function that is at most quadratic in the
state variables, like Equation (2.47). Then one can use the Hopfield dynamics to
minimize H . The problem is of course that the deterministic network dynamics
gets stuck in first local minimum encountered, usually not the desired optimum
(Figure 4.1). Therefore it is important to introduce a little bit of noise. As discussed
in Chapter 3, the noise helps the network to escape local minima.
A common strategy is to lower the noise level on the fly. In the beginning of the
simulation the noise level is high, so that the network explores the rough features of
the energy landscape. When the noise level is lowered, the network can see finer
and finer features, and the hope is that it ends up in the global minimum. This
method is called simulated annealing.
Figure 4.1: Spurious local minima in the energy function H . The global minimum
corresponds to one of the stored patterns, x (µ) .
52 STOCHASTIC OPTIMISATION
(a) (b)
city coordinates
A ( 0.1,0.15 )
B ( 0.4 0.2 )
C ( 0.5 0.7 )
D ( 0.2 0.1 )
E ( 0.1 0.8 )
F ( 0.8 0.9 )
G ( 0.9 0.3 )
1 1
1 1
1 1
M=
1
M= 1
1 1
1 1
1 1
Figure 4.2: Traveling-salesman problem for k = 7. Given are the coordinates of k cities
as points in the unit square. The problem is to find the shortest connected path that joins
all cities, visits each city exactly once, and returns to the starting point. (a) best solution,
(b) a second solution with a longer path. Also given are the matrix representations of
these solutions.
of seven cities A,. . . ,F are given in Figure 4.2 (this Figure illustrates the problem for
k = 7). The Figure shows two different solutions. Denoting the distance between
city A and B by d AB and so forth, the length of the path in panel (a) is
L = d AD + d D B + d B G + d G F + d F C + d C E + d E A . (4.1)
The Figure also demonstrates how paths are represented in terms of k × k matrices.
Path (a) corresponds to
1
1
1
M= 1 . (4.2)
1
1
Each row corresponds to a city, and the j -th element in this row has the entry 1
if the city is the j -th stop in the tour. The other entries are 0. Since each city is
visited only once, there can be only one 1 in each row. Since each visit corresponds
to exactly one city, there can be only one 1 in each column. Any permutation of
the elements that satisfies these constraints is an allowed path. There are k ! such
permutations. They are 2k -fold degenerate (there are k paths of the same length
COMBINATORIAL OPTIMISATION PROBLEMS 53
that differ by which city is visited first, and each path can be traveled clockwise or
anti-clockwise). Therefore there are k !/(2k ) possible paths to consider in trying
to determine the shortest one. This makes the problem hard. Note that integer
linear-programming methods for solving the travelling salesman problem usually
use a different representation of the paths [21].
The k -queens problem is derived from the game of chess. The question is how to
arrange k queens on a k × k chess board so that they cannot take each other out.
This means that each row and column as well as each diagonal can have only one
queen. The problem is illustrated in Figure 4.3. This Figure shows one solution
for k = 8. The task is to find all solutions. Each arrangement of queens can be
represented as a matrix M, where one sets M i j = 1 if there is a queen on site (i , j ).
All other elements are set to zero. To represent valid solutions, M must satisfy the
following constraints [22]
k X
X k
Mi j = k , (4.3)
i =1 j =1
The double-digest problem. The Human Genome sequence was first assembled
by piecing together overlapping DNA segments in the right order by making sure
that overlapping segments share the same DNA sequence. To this end it is necessary
to uniquely identify the DNA segments. The actual DNA sequence of a segment is a
unique identifier. But it is sufficient and more efficient to identify a DNA segment
by a fingerprint, for example the sequence of restriction sites. These are short subse-
quences (four or six base pairs long) that are recognised by enzymes that cut (digest)
the DNA strand precisely at these sites. A DNA segment is identified by the types
and locations of restriction sites that it contains, the so-called restriction map.
54 STOCHASTIC OPTIMISATION
When a DNA segment is cut by two different enzymes one can experimentally
determine the lengths of the resulting fragments. Is it possible to determine how
the cuts were ordered in the DNA sequence of the segment from the fragment
lengths? This is the double-digest problem [23]. The order of cut sites is precisely
the restriction map. In a double-digest experiment, a given DNA sequence is first
digested by one enzyme (A say). Assume that this results in n fragments with lengths
a i (i = 1, . . . , n ). Second, the DNA sequence is digested by another enzyme, B . In this
case m fragments are found, with lengths b1 , b2 , . . . , bm . Third, the DNA sequence
is digested with both enzymes A and B , yielding l fragments with lengths c1 , . . . , cl .
The question is now to determine all possible ordering of the a - and b -cuts that
result in l fragments with lengths c1 , c2 , . . . , cl ?
are incorporated using Lagrange multipliers A and B (both positive), so that the
function to be minimised becomes
X X 2 X X 2
H = L + A2 1− M m j + B2 1− Mm j . (4.7)
m j j m
When the constraints (4.6) are satisfied, their contributions to H vanish, otherwise
they are positive. We conclude that H has a global minimum at the desired solution.
If we use a stochastic method to minimise H , it is not guaranteed that the algorithm
SIMULATED ANNEALING 55
finds the global minimum, either because the constraints are not exactly satisfied,
or because the path found is not the shortest one. The magnitude of the Lagrange
multipliers A and B determines how strongly the constraints are enforced, during
the search and in sub-optimal solutions.
The expression (4.7) is a quadratic function of M m j . This suggests that we can
write H as the energy function of a Hopfield model with 0/1 neurons ni j :
X X
H = − 21 wi j k l ni j nk l + µi j ni j + const. (4.8)
i jkl ij
where
1X
bi j = w i j k l n k l − µi j (4.9b)
2 kl
are local fields including thresholds.
Here we take P (b ) = 1+e1−β b . This form differs slightly from the definition (3.3). For
β → ∞ the stochastic rule reduces to the deterministic dynamics where H (ni0 j ) −
H (ni j ) ≤ 0.
56 STOCHASTIC OPTIMISATION
Now consider large but finite values of β , corresponding to small noise levels. In
this case the energy function may increase in an update, yet down-moves are much
more likely than up-moves. This becomes clear by considering an alternative (yet
equivalent) formulation of the network dynamics:
1
Prob(ni j → ni0 j ) = (4.12a)
1 + e β ∆Hi j
with
∆Hi j = H (ni0 j ) − H (ni j ) = −bi j (ni0 j − ni j ) . (4.12b)
Equation (4.12) shows that moves with ∆Hi j > 0 are less likely when β is large. To
demonstrate that this scheme is equivalent to stochastic dynamics (4.11) we break
the prescription (4.12) up into different cases. Changes occur with the following
probabilities:
1
if ni j = 0 obtain ni0 j = 1 with prob. = P (bi j ) , (4.13a)
1 + e −β bi j
1
if ni j = 1 obtain ni0 j = 0 with prob. = 1 − P (bi j ) . (4.13b)
1 + e β bi j
−1 −β b
In the second row we used that 1 − P (b ) = 1 − 1+e1−β b = 1+e
1+e −β b
= 1
1+e β b
. The state
remains unchanged with the complimentary probabilities:
1
if ni j = 0 obtain ni j = 0 with prob. = 1 − P (bi j ) , (4.13c)
1 + e −β bi j
1
if ni j = 1 obtain ni j = 1 with prob. = P (bi j ) . (4.13d)
1 + e β bi j
Comparing with Equation (4.11) we see that the two schemes are equivalent.
How does the network find a solution of the optimisation problem? We let it run
with stochastic dynamics and compute 〈ni j 〉. If 〈ni j 〉 ' 1, we set M i j = 1, otherwise
0. When the noise is weak, we expect that all 〈ni j 〉 are either be close to zero or to
one. One strategy is to change the noise level as the simulation proceeds. One starts
with larger noise, so that the network explores first the rough features of the energy
landscape. As the simulation proceeds, one reduces the noise level, so that the
network can learn finer features of the landscape. This is called simulated annealing.
See Section 10.9 in Numerical Recipes [25].
MONTE-CARLO SIMULATION 57
n represents the state vector of the system, as in the Hopfield network. The distri-
bution (4.14) plays an important role in the equilibrium statistical mechanics of
systems with energy function (Hamiltonian) H . In this context β −1 = kB T where kB
is the Boltzmann constant and T is the temperature of the system.
For systems with a large number of degrees of freedom, the function Pβ (n ) can be
very expensive to compute, because of the sum over n in the normalisation factor
Z . Is there a way of sampling this distribution without actually evaluating it? The
answer is yes, by constructing a Markov chain [26] of states n that are distributed
according to Pβ (n ), after an initial transient. A Markov chain is a memoryless random
sequence of states defined by transition probabilities pl →k from state n l to n k . The
transition probability pl →k connects arbitrary states, allowing for local moves (as in
the previous Section where only one element of n was changed) or global moves.
The Monte-Carlo algorithm proceeds in two steps. First, given the state n l a new
state n k is suggested, with probability pls→k . Second, n k is accepted with probability
pla→k , so that the transition probability is given by pl →k = pls→k pla→k . These steps
are repeated many times, creating a sequence of states. If the process satisfies the
detailed-balance condition
Pβ (n l )pl →k = Pβ (n k )pk →l , (4.15)
then the Markov chain of states n l has the steady-state distribution Pβ (n ). Usually
this means that the distribution of states generated in this way converges to Pβ (n )
(see Ref. [26] for details).
Now consider the formulation of the network dynamics on page 56. The dynamics
is local: a single neuron is picked randomly so that p s is a constant, independent of
l or k , and then accepted with probability
1
pla→k = , (4.16)
1 + eβ ∆H
58 STOCHASTIC OPTIMISATION
The fact that the algorithm produces states distributed according to Equation (4.14)
offers a different perspective upon the idea of simulated annealing. Slowly lowering
the temperature through the simulation mimics slow cooling of a physical system.
It passes through a sequence of quasi-equilibrium Boltzmann distributions with
lower and lower temperatures, until the system finds the global minimum Hmin of
the energy function at zero temperature, where Pβ (n ) = 0 when H (n ) > Hmin , but
Pβ (n ) > 0 when H (n ) = Hmin .
4.5 Summary
In this Chapter it was shown how Hopfield networks can perform optimisation tasks,
exploring the energy function with stochastic Hopfield dynamics. This approach
is equivalent to the Markov-chain Monte-Carlo algorithm. In simulated annealing
one gradually reduces the noise level as the simulation proceeds. This mimics the
slow cooling of a physical system, an efficient way of bringing the system into its
global optimum.
4.7 Exercises
Travelling-salesman problem. Derive Equation (4.5) for the path length in the
travelling-salesman problem.
µ. Check that the scheme of suggesting new states is symmetric. This is necessary
for the algorithm to converge. The solutions of the double-digest problem are
degenerate. Determine the degeneracy of the solutions for the fragment sets shown
in Table 4.1.
L = 10000
a = [5976, 1543, 1319, 1120, 42]
b = [4513, 2823, 2057, 607]
c = [4513, 1543, 1319, 1120, 607, 514, 342, 42]
L = 20000
a = [8479, 4868, 3696, 2646, 169, 142]
b = [11968, 5026, 1081, 1050, 691, 184]
c = [8479, 4167, 2646, 1081, 881, 859, 701, 691, 184, 169, 142]
L = 40000
a = [9979, 9348, 8022, 4020, 2693, 1892, 1714, 1371, 510, 451]
b = [9492, 8453, 7749, 7365, 2292, 2180, 1023, 959, 278, 124, 85]
c = [7042, 5608, 5464, 4371, 3884, 3121, 1901, 1768, 1590, 959, 899, 707, 702, 510, 451, 412,
Table 4.1: Example configurations for the double-digest problem for three different
chromosome lengths L . For each example, three ordered fragment sets are given,
corresponding to the result of digestion with A, with B, and with both A and B.
61
PART II
SUPERVISED LEARNING
62 SUPERVISED LEARNING
A related type of problem that is very common are classification tasks. The
machine-learning repository [31] at the University of California Irvine contains a
large number of such problems. A well-known example is the iris data set. It lists
attributes of 150 iris plants. The data set was described by the geneticist R. A. Fisher
[32]. For each plant four attributes are given (Figure 5.1): its sepal length, sepal width,
petal length, and petal width. Also, each plant is classified into one of three classes:
iris setosa, iris versicolor, or iris virginica. The task is to program a neural network
that determines the class of a plant from its attributes. To each input (attributes of
an iris plant) the network should associate the correct output, the class of the plant.
The correct output is referred to as the target.
In supervised learning one uses a training data set of correct input/output pairs.
One feeds an input from the training data into the input terminals of the network
and compares the states of the output neurons to the target values. The weights and
thresholds are changed to minimise the differences between network outputs and
targets for all input patterns in the training set. In this way the network learns to
associate input patterns in the training set with the correct target values. A crucial
question is whether the trained network can generalise: does it find the correct
targets for input patterns that were not in the training set?
The networks used for supervised learning are called perceptrons [12, 13]. They
consist of layers of McCulloch-Pitts neurons: an input layer, a number of hidden
layers, and an output layer. The layers are usually arranged from the left (input)
to the right (output). All connections are one-way, from neurons in one layer to
neurons in the layer immediately to the right. There are no connections between
neurons in a given layer, or back to layers on the left. This arrangement ensures
convergence of the training algorithm (stochastic gradient descent). During training
with this algorithm the weights are updated iteratively. In each step, an input is
applied and the weights of the network are updated to reduce the error in the output.
In a sense each step corresponds to adding a little bit of Hebb’s rule to the weights.
This is repeated until the network classifies the training set correctly.
Figure 5.1: Left: petals and sepals of the iris flower. Right: six entries of the iris data set
[31]. All lengths in cm. The whole data set contains 150 entries.
5 Perceptrons
Perceptrons [12, 13] are trained by iteratively updating their weights and thresholds.
In the Hopfield networks described in Part I, by contrast, the weights were always
assigned using Hebb’s rule. To motivate the idea of updating weights iteratively,
consider Hebb’s rule, Equation (2.29). We estimated in Section 2.2 how frequently
neurons are erroneously updated because the cross-talk term in Equation (2.31)
changed the sign of the bit in question. To this end we assumed that all bits of
all patterns were independently identically randomly distributed, and we used
the central-limit theorem. For correlated patterns the effect of the cross-talk term
is different from the results calculated in Chapter 3. It has been argued that the
storage capacity increases when the patterns are more strongly correlated, others
have claimed that the capacity decreases in this limit (see Ref. [33] for a discussion).
When we must deal with a definite set of patterns (no randomness to average
over), the situation seems to be even more challenging. Is there a way of modifying
Hebb’s rule to deal with this problem? Yes there is! We simply incorporate the
overlaps
1
Qµν = x (µ) · x (ν) . (5.1)
N
into Hebb’s rule. To this end, define the p × p overlap matrix Q with elements Qµν .
The modified Hebb’s rule reads:
1 X (µ) −1 (ν)
wi j = x Q µν x j . (5.2)
N µν i
For orthogonal patterns (Qµν = δµν ) this rule is identical to Equation (2.28). For
non-orthogonal patterns, the rule (5.2) ensures that all patterns are recognised,
provided that the matrix Q is invertible. In this case one can find the weights wi j
64 PERCEPTRONS
The index µ labels the different input patterns in the training set. It ranges from 1 to
p . All neurons are McCulloch-Pitts neurons. The output neurons in the network on
the left of Figure 5.2, for example, perform the computation:
X
Oi = g (Bi ) with Bi = Wi j x j − Θi (5.4)
j
The index i labels the output neurons, it ranges from 1 to M . Each output neuron
has a threshold, Θi . In the literature on deep learning the thresholds are sometimes
referred to as biases, defined as −Θi . The function g is an activation function as
Figure 5.2: Feed-forward network without hidden layer (left), and with one hidden
layer (right). The input terminals are coloured black.
A CLASSIFICATION TASK 65
Legend
x2 t (µ) = 1
t (µ) = −1
x (µ) x2
x1 O1
x1
Figure 5.3: Left: classification problem with two-dimensional real-valued inputs and
target values equal to ±1. The red line is the decision boundary (see text). Right:
corresponding perceptron.
described in Section (1.2). Now consider the network on the right of Figure 5.2. The
states of the neurons in the hidden layer are denoted by V j , with thresholds θ j and
weights w j k . In summary:
X
V j = θH b j with bj = w j k xk − θ j , (5.5a)
k
X
Oi = θH (Bi ) with Bi = Wi j V j − Θi . (5.5b)
j
A classification task is given by a training set of input patterns x (µ) and corresponding
target values (µ)
t1
t (µ)
t (µ) = . .
2
(5.6)
..
(µ)
tM
The perceptron is trained by choosing its weights and thresholds so that the network
produces the desired output.
(µ) (µ)
Oi = ti for all i and µ. (5.7)
(µ) (µ)
Remark: if we take t i = xi for i = 1, . . . , N the task is the associative memory
problem discussed in Part I.
x
ϕ x1
w
Figure 5.4: The perceptron classifies the patterns correctly for the weight vector w
shown, orthogonal to the decision boundary.
(µ)
(µ) x1
x = (µ) . (5.8)
x2
In Figure 5.3 the patterns are drawn as points in the x1 -x2 plane, the input plane.
There are two classes of patterns, with targets ±1
The activation function consistent with the possible target values is the signum
function, g (b ) = sgn(b ). The perceptron has two input terminals connected to a
single output neuron. Since there is only one neuron, we can arrange the weights
into a weight vector
w
w= 1 . (5.10)
w2
The network performs the computation
O = sgn(w · x ) . (5.12)
x2 x2
θ 1
1 w=
w= w2 1
1
w · x (µ) = 0
w · x (µ) = θ
x1
x1
Figure 5.5: Decision boundaries without and with threshold.
Therefore the threshold determines the intersection of the decision boundary with
the x2 -axis (equal to θ /w2 ). This is illustrated in Figure 5.5.
The decision boundary – the straight line orthogonal to w – should divide inputs
with positive and negative targets. If no such line can be found, then the problem
cannot be solved with a single neuron. Conversely, if such a line exists, the problem
can be solved (and it is called linearly separable). Otherwise the problem is not
linearly separable. This can occur only when p > N . Examples of problems that are
linearly separable and not linearly separable are shown in Figure 5.6.
t = +1
x2 x2
t = −1
x1 x1
Legend
x2 t (µ) = 1
3
2 t (µ) = −1
x1 x2 t
w 1 θ = 32
0 0 -1 1
0 1 -1 1
1
1 0 -1 2 w · x (µ) = θ
1 1 +1 1 3
2 1 2 x1
Figure 5.7: Boolean AND function: value table, geometrical representation, and net-
work layout. The weight values are written next to the connections.
Other examples are Boolean functions. A Boolean function takes N binary inputs
and has one binary output. The Boolean AND function (two inputs) is illustrated
in Figure 5.7. The value table of the function is shown on the left. The graphical
representation is shown in the centre of the Figure ( corresponds to t = −1 and
to t = +1). Also shown is the decision boundary, the weight vector w , and the
network layout with the corresponding values of the weights and the threshold. It is
important to note that the decision boundary is not unique, neither are the weight
and threshold values that solve the problem. The norm of the weight vector, in
particular, is arbitrary. Neither is its direction uniquely specified.
Figure 5.8 shows that the Boolean XOR function is not linearly separable [34].
There are 16 different Boolean functions of two variables. Only two are not linearly
separable, the XOR and the NOT XOR function.
Up to now we discussed only one output unit. If the classification task requires
several output units, each has its own weight vector w i and threshold θi . We can
group the weight vectors into a weight matrix as in Part I, so that the w i are the rows
of W.
Legend
x2 t (µ) = 1
x1 x2 t t (µ) = −1
0 0 -1
0 1 +1
1 0 +1
1 1 -1 x1
Figure 5.8: The Boolean XOR function is not linearly separable.
ITERATIVE LEARNING ALGORITHM 69
Legend
x (8) x (8)
(a) x2 (b) x2 t (µ) = 1
t (µ) = −1
x1 w 0 x1
w
(c) x2 (d) x2
x1 w 0 x1
w
x (4) x (4)
Figure 5.9: Illustration of the learning algorithm. In panel (a) the t = 1 pattern x (8) is on
the wrong side of the decision boundary. To correct the error the weight must be rotated
anti-clockwise [panel (b)]. In panel (c) the t = −1 pattern x (4) is on the wrong side of
the decision boundary. To correct the error the weight must be rotated anti-clockwise
[panel (d)].
The parameter η > 0 is called the learning rate. It must be small, so that the decision
boundary is not rotated too far. The result is shown in panel (b). Panel (c) shows
another case, where pattern x (4) (t (4) = −1) is on the wrong side of the decision
boundary. In order to turn the decision boundary in the right way, anti-clockwise,
one subtracts a small multiple of x (4) :
as you can verify by inserting Equation (5.23) into (5.21). Here Q is the overlap matrix
with elements
Qµν = N1 x (µ) · x (ν) (5.24)
(page 63). For the solution (5.23) to exist, the matrix Q must be invertible. This
requires that p ≤ N , because otherwise the pattern vectors are linearly dependent,
and thus also the columns (or rows) of Q. If the matrix Q has linearly dependent
columns or rows it cannot be inverted.
Let us assume that the patterns are linearly independent, so that the solution
(5.23) exists. In this case we can find the solution iteratively. To this end one defines
the energy function
1 X (µ) (µ) 2
1 X (µ) X (µ) 2
H ({wi j }) = t i − Oi = ti − wi j x j . (5.25)
2 iµ 2 iµ j
GRADIENT-DESCENT LEARNING 71
∂H
0
wmn = wmn + δwmn with δwmn = −η . (5.26)
∂ wmn
The small parameter η > 0 is the learning rate. The negative gradient points in the
direction of steepest descent of H . The idea is to take many downhill steps until
one hopefully (but not necessarily) reaches the global minimum.
To evaluate the derivatives one uses the chain rule together with
∂ wi j
= δi m δ j n . (5.27)
∂ wmn
This learning rule is very similar to Equation (5.20). The difference is that Equa-
tion (5.28) contains a sum over all patterns (batch training). An advantage of the
rule (5.28) is that it is derived from an energy function. This allows to analyse the
convergence of the algorithm.
Linear units [Equation (5.21)] are special. You cannot solve the Boolean AND
problem (Figure 5.7) with a linear unit – although the problem is linearly separable
– because the pattern vectors x (µ) are linearly dependent. This means that the
solution (5.23) does not exist. Shifting the patterns or introducing a threshold does
not change this fact. Linear separability does not imply linear independence (but
the converse is true).
72 PERCEPTRONS
Legend Legend
t (µ) = 1 x2 t (µ) = 1
x2
t (µ) = −1 t (µ) = −1
x1 x1
(a) Linearly separable problem. (b) Problem not linearly separable.
Figure 5.10: Problems that are not linearly separable can be solved by a piecewise
linear decision boundary.
Here θH (b ) is the Heaviside function. Each of the three neurons in the hidden layer
has its own decision boundary. The idea is to choose weights and thresholds in such
a way that the three decision boundaries divide the input plane into distinct regions,
MULTI-LAYER PERCEPTRONS 73
threshold 21 :
O1 = sgn(V1 − V2 − 12 ) . (5.31)
Minsky and Papert [34] proved in 1969 that all Boolean functions can be rep-
resented by multilayer perceptrons, but that at least one hidden neuron must be
connected to all input terminals. This means that not all neurons in the network are
locally connected (have only a few incoming weights). Since fully connected net-
works are much harder to train, Minsky and Papert offered a somewhat pessimistic
wjk 1 W1 j O1
xk Vj
Figure 5.11: Hidden-layer perceptron to solve the problem shown in Figure 5.10 (b).
The three hidden neurons are 0/1 neurons, the output neuron produces ±1.
74 PERCEPTRONS
V1 V2 V3 target
000 Legend 0 0 0 -1
010
t (µ) = 1
x2 1 0 0 -
001 t (µ) = −1
0 1 0 -1
011 0 0 1 -1
110
1 1 0 +1
1
3 2 101 1 0 1 +1
111 0 1 1 +1
1 1 1 +1
x1
Figure 5.12: Left: decision boundaries and regions. Right: encoding of the regions and
corresponding targets. The region 100 does not exist.
V3
011
111
101
001
V2
010 110
000 V1
Figure 5.13: Graphical representation of the output problem for the classification
problem shown in Figure 5.12.
SUMMARY 75
Legend
x2 t (µ) = 1
w2 t (µ) = −1 1
0.5
V1 V2 t 1 +1 0.5
w1
0 0 -1 1
1 0 +1 11 1
0 1 +1 10 2 −1
00 1
1.5
1 1 -1
x1
Figure 5.14: Boolean XOR function: value table, geometrical representation, and net-
work layout. The two hidden neurons as 0/1 neurons, the output produces ±1.
Legend
(a) x2 (b) x2 t (µ) = 1
t (µ) = −1
x1 x1
error
Figure 5.15: (a) Result of training the network on a training set. (b) Validation by
feeding the patterns of a validation set.
network is applied to patterns in a validation set. We see that most patterns are
correctly classified, save for one error. This means that the energy function (5.25)
is not exactly zero for the validation set. Nevertheless, the network does quite a
good job. Usually it is not a good idea to try to precisely classify all patterns near the
decision boundary, because real-world data sets are subject to noise. It is a futile
effort to try to learn and predict noise.
5.5 Summary
Perceptrons are layered feed-forward networks that can learn to classify data in
a training set (x (µ) , t (µ) ). For each input pattern x (µ) the network finds the correct
targets t (µ) . We discussed the learning algorithm for a simple example: real-valued
patterns with just two components, and one binary target. This allowed us to
represent the classification problem graphically. There are three different ways of
understanding how the perceptron learns. First, geometrically, to learn means to
move the decision boundaries into the right places. Second, this can be achieved by
repeatedly adding a little bit of Hebb’s rule. Third, this algorithm corresponds to
gradient descent on the energy function (5.25).
76 PERCEPTRONS
Legend
x2
t (µ) = 1
3
2 t (µ) = −1
x1
Figure 5.16: Alternative solution of the classification problem shown in Figure 5.12.
.
5.7 Exercises
Non-orthogonal patterns. Show that the rule (5.2) ensures that all patterns are
recognised, for any set of non-orthogonal patterns that gives rise to an invertible
matrix Q. Demonstrate this by showing that the cross-talk term evaluates to zero,
assuming that Q−1 exists.
Piecewise linear decision boundary. Find an alternative solution for the classifica-
tion problem shown in Figure 5.12, where the weight vectors are chosen as depicted
in 5.16.
(b) Now consider Boolean functions with three-dimensional inputs. How many of
the 256 Boolean functions with three-dimensional inputs can be solved by a per-
ceptron with three input terminals and one output unit (no hidden layer)? Describe
how you arrive at the answer. Hint: visualise the functions by colouring the corners
of a cube. Use symmetries to reduce the number of cases. (1.5p).
x1 x2 t
0 0 −1
0 1 +1
1 0 +1
1 1 −1
Table 5.1: Value table for the XOR problem. Question 5.8.1.
(µ)
¨
− θj + w j k xk > 0 ,
P
(µ) 1 if
Vj = Pk (µ) (5.32)
0 if − θj + k w j k xk ≤ 0 ,
(j)
where the weights and thresholds are given by w j k = xk and θ j = 2. The output
(µ) (µ)
is given by O = j Wj V j .
P
x (8)
x (7)
(5)
x
x (6)
x (4)
x3
x (3)
x (1)
x (2)
x2
x1
Figure 5.17: The three-dimensional parity problem. A white ball indicates O (µ) = −1,
and a black ball indicates O (µ) = +1. Question 5.8.2.
x2
x1
Figure 5.18: Classification problem. Input space is the x1 -x2 -plane. Question 5.8.3.
EXAM QUESTIONS 79
Here w j k and Wj are weights, θ j and Θ are thresholds, and θH is the Heaviside
function:
¨
1 for b > 0
θH (b ) = . (5.35)
0 for b ≤ 0
Find weights and thresholds that solve the classification problem. (1 p).
where w j k and θ j are weights and thresholds of the hidden neurons. The output is
given by
(µ)
¨
0 if − Θ + j Wj V j ≤ 0 ,
P
(µ)
O = (µ) (5.37)
1 if − Θ + j Wj V j > 0 .
P
80 PERCEPTRONS
W1 = W3 = W4 = 1 , W2 = −1 , and Θ = 12 . (5.38)
(a) Figure 5.19(left) shows how input space is mapped to the the hidden neurons.
Draw the decision boundary of the network, given the weights and thresholds in
Equation (5.38). (0.5p).
(b) Show that one cannot map the input space to the space of hidden neurons as in
Figure 5.19(right). (0.5 p).
(c) Give values of wi j and θi that yield the pattern in Figure 5.19(left). (1 p).
(0, 0, 1, 1)T (0, 1, 1, 1)T (1, 1, 1, 1)T (0, 0, 1, 1)T (0, 1, 1, 1)T (1, 1, 0, 1)T
(0, 0, 0, 1)T (0, 1, 0, 1)T (1, 1, 0, 1)T (0, 0, 0, 1)T (0, 1, 0, 1)T (1, 1, 1, 1)T
x2 x2
(0, 0, 0, 0)T (0, 1, 0, 0)T (1, 1, 0, 0)T (0, 0, 0, 0)T (0, 1, 0, 0)T (1, 1, 0, 0)T
x1 x1
Figure 5.19: Left: input space with decision boundaries of the hidden neurons V j
(black lines). These decision boundaries divide input space into nine zones, each with a
certain coordinate V = (V1 , V2 , V3 , V4 )T in the space of the hidden neurons. Right: same,
but here the indicated mapping to the space of the hidden neurons is not possible.
Question 5.8.4.
(a) Show that this problem is not linearly separable by constructing a counter-
example using four input patterns. (0.5p).
EXAM QUESTIONS 81
(b) The problem can be solved by a perceptron with one hidden layer with three
neurons ( j = 1 , 2 , 3)
2
(µ) (µ)
X
Vj = sgn − θ j + w j k xk (5.39)
k =1
and output
3
(µ)
X
(µ)
O = sgn − Θ + Wj V j . (5.40)
j =1
Here w j k and Wj are weights and θ j and Θ are thresholds. The weights w j k are
subject to the following constraints. First, the three weights w j 1 are all equal to
one, w11 = w21 = w31 = 1. Second, the three weights w j 2 are such that x (µ) = (-4,-1)T
maps to V (µ) = (1,-1,-1)T and x (µ) = (-1,5)T maps to V (µ) = (-1,-1,1)T . Given these
constraints, find values of w j k , Wj , θ j and Θ that solve the classification problem.
Hint: The constraints uniquely determine the hidden thresholds, the orientations
of the hidden weight vectors, and their order in the weight matrix. (1p).
x2
x1
Figure 5.20: Classification problem. Input space is the x1 -x2 -plane. Question 5.8.5.
82 STOCHASTIC GRADIENT DESCENT
(µ)
O1
O (µ)
O (µ) = . ,
2
(6.1)
..
(µ)
OM
to be matched to the targets t (µ) . The activation functions must be differentiable (or
at least piecewise differentiable), but apart from that there is no need to specify them
further at this point. The network shown in Figure 6.1 performs the computation
(µ) (µ)
X X
Oi =g Wi j g w j k xk − θ j − Θi . (6.3)
j k
| {z }
(µ)
Vj
This is a consequence of the network layout of the perceptron: all incoming con-
nections to a given neuron are from the layer immediately to the left, all outgoing
connections to the layer immediately to the right. The more hidden layers a network
has, the deeper is the nesting of the activation functions.
CHAIN RULE AND ERROR BACKPROPAGATION 83
3
4
2
3 2
1
2 1
1
wjk W1 j
xk Vj Oi
Figure 6.1: Neural network with one hidden layer. Illustrates the notation used in
Section 6.1.
the hidden layer do not depend on Wmn . The neurons V j do not have incoming
connections with these weights, a consequence of the feed-forward layout of the
network. In summary we obtain for the increments of the weights connecting to
the output layer:
p
∂H X
t m(µ) − Om(µ) g 0 Bm(µ) Vn(µ)
δWmn = −η =η (6.7)
∂ Wmn µ=1 | {z }
(µ)
≡∆m
(µ)
The quantities δm are errors associated with the hidden layer (they vanish when
(µ)
the output errors ∆i are zero). Equation (6.9) shows that the errors are determined
recursively, in terms of the errors in the layer to the right:
(µ)
X
(µ)
∆i Wi m g 0 bm(µ) .
δm = (6.10)
i
(µ)
In other words, the error δm for the hidden layer is computed in terms of the output
(µ)
errors ∆i . Equations (6.7) and (6.9) show that the weight increments have the same
form:
X p Xp
(µ) (µ) (µ) (µ)
δWmn = η ∆m Vn and δwmn = η δm xn . (6.11)
µ=1 µ=1
CHAIN RULE AND ERROR BACKPROPAGATION 85
errors
neurons
Figure 6.2: Backpropagation algorithm: the states of the neurons are updated forward
(from left to right) while errors are updated backward (right to left).
The rule (6.11) is sometimes referred to as the δ-rule. It is local: the increments of
the weights feeding into a certain layer are determined by the errors associated with
that layer, and by the states of the neurons in the layer immediately to the left.
If the network has more hidden layers, then their errors are computed recursively
using Equation (6.10), and the formula for the weight increments has the same form
as Equation (6.11) (Algorithm 2). Figure 6.2 illustrates the different ways in which
neurons and errors are updated. The feed-forward structure of the layered network
means that the neurons are updated from left to right (blue arrows). Equation (6.10),
by contrast, implies that the errors are updated from the right to the left (red arrows),
from the output layer to the hidden layer. The term backpropagation refers to this
difference: the neurons are updated forward, the errors are updated backward.
The thresholds are updated in a similar way:
∂H X X
t m(µ) − Om(µ) −g 0 Bm(µ) = −η ∆(µ)
δΘm = −η =η , (6.12a)
∂ Θm µ µ
m
∂H XX
(µ)
X
∆i Wi m −g 0 bm(µ) = −η (µ)
δθm = −η =η δm . (6.12b)
∂ θm µ i µ
The general form for the threshold increments looks like Equation (6.11)
X X
δΘm = −η ∆(µ)
m
and δθm = −η (µ)
δm , (6.13)
µ µ
but without the state variables of the neurons (or the inputs), as expected. A way
to remember the difference between Equations (6.11) and (6.13) is to note that the
formula for the threshold increments looks like the one for the weight increments if
one sets the values of the neurons to −1.
86 STOCHASTIC GRADIENT DESCENT
The backpropagation rules (6.7), (6.9), and (6.12) contain sums over patterns. This
corresponds to feeding all patterns at the same time to compute the increments
of weights and thresholds (batch training). Alternatively one may choose a single
pattern, update the weights by backpropagation, and then continue to iterate these
training steps many times. This is called sequential training. One iteration cor-
responds to feeding a single pattern, p iterations are called one epoch (in batch
training, one iteration corresponds to one epoch). If one chooses the patterns ran-
domly, then sequential training results in stochastic gradient descent. Since the sum
over pattern is absent, the steps do not necessarily point downhill, their directions
fluctuate. This yields a stochastic path through weight and threshold space, less
prone to getting stuck in local minima (Chapters 3 and 7).
The stochastic gradient-descent algorithm is summarised in Section 6.2. It ap-
plies to networks with feed-forward layout, where neurons in a given layer take input
only from the neurons in the layer immediately to the left.
Mini batches
mB
X mB
X
δWmn = η ∆(µ)
m n
V (µ) and δΘm = −η ∆(µ)
m
, (6.14)
µ=1 µ=1
mB
X mB
X
(µ) (µ) (µ)
δwmn = η δm xn and δθm = −η δm .
µ=1 µ=1
Sometimes the mini-batch rule is quoted with prefactors of m B−1 before the sums.
This does not make any fundamental difference, the factors m B−1 can just be ab-
sorbed in the learning rate. But when you compare learning rates for different
implementations, it is important to check whether or not there are factors of m B−1 in
front of the sums in Equation (6.14).
How does one assign inputs to mini batches? This is discussed in Section 6.3.1: at
the beginning of each epoch, one should randomly shuffle the sequence of the input
patterns in the training set. Then the first mini batch contains patterns µ = 1, . . . , m B ,
and so forth.
CHAIN RULE AND ERROR BACKPROPAGATION 87
σ(b ) tanh(b )
b b
Figure 6.3: Saturation of the activation functions (6.15): the derivative g 0 (b ) tends to
zero for large values of |b |.
Activation functions
Common choices for g (b ) are the sigmoid function or tanh:
1
g (b ) = ≡ σ(b ) , (6.15a)
1 + e −b
g (b ) = tanh (b ) . (6.15b)
Their derivatives can be expressed in terms of the function itself:
d
dbσ(b ) = σ(b )[1 − σ(b )] , (6.16a)
tanh (b ) = 1 − tanh2 (b ) .
d
db (6.16b)
Other activation functions are discussed in Chapter 7.
As illustrated in Figure 6.3, the activation functions (6.15) saturate at large values
of |b |, so that the derivative g 0 (b ) tends to zero. Since the backpropagation rules
(6.7), (6.9), and (6.12) contain factors of g 0 (b ), this implies that the algorithm slows
down. It is a good idea to monitor the values of the local fields during training, to
check that they do not become too large.
Training
The training continues until the global minimum of H has been reached, or until H
is deemed sufficiently small. The resulting weights and thresholds are not unique.
88 STOCHASTIC GRADIENT DESCENT
In Figure 5.14 all weights for the Boolean XOR function are equal to ±1. But the
training algorithm (6.7), (6.9), and (6.12) corresponds to repeatedly adding weight
increments. This may cause the weights to grow.
This is illustrated in Figure 6.4. The algorithm is summarised below (Algorithm 2).
STOCHASTIC GRADIENT-DESCENT ALGORITHM 89
(`) (`+1)
δj δi
··· ···
(`+1)
(`) wi k
wjk
(`−1) (`) (`+1)
Vk Vj Vi
x2 w1
θ w2
x1 w2
w1
x1 x2 t
100 101 -1
100 99 1
Figure 6.5: Illustrates the effect of non-zero input mean upon the energy function for
one output neuron with tanh activation function and two input terminals. The graph
plots the contours of H for θ = 0 for the training set on the left. The plot illustrates that
H is close to zero only at the bottom of a very narrow trough (hashed region) with steep
sides.
There are several reasons for this. The first one is illustrated in Figure 6.5. The Figure
shows how the energy function for a single output with tanh activation function
and two input terminals. The classification problem is given in the Table. The input
data has large mean values in both components, x1 and x2 . Since it is difficult to
visualise the dependence of H on both weights and threshold, the graph on the
right shows how the energy function H depends on the weights for zero threshold.
The large mean values of the inputs cause steep cliffs in the energy function that
are difficult to maneuver with gradient descent. Different input-data variances in
different directions have a similar effect. Therefore one usually scales the inputs so
that the input-data distribution has the same variance in all directions (Figure 6.6),
equal to unity for instance:
p
1 X (µ) 2
σk2 = xk − 〈xk 〉 = 1 (6.18)
p µ=1
RECIPES FOR IMPROVING THE PERFORMANCE 91
x2 x2 x2
shift scale
x1 x1 x1
Figure 6.6: Shift and scale the input data to achieve zero mean and unit variance.
Second, to avoid saturation of the neurons connected to the inputs, their local
fields must not be too large. If one initialises the weights in the above example to
Gaussian random numbers with mean zero and unit variance, large activations are
quite likely.
Third, enforcing zero input mean by shifting the input data avoids that the weights
of the neurons in the first hidden layer must decrease or increase together [36].
Equation (6.14) shows that the increments δw m into neuron m are likely to have
the same signs if the inputs have large mean values. This means that the weight
increments have the same signs. This makes it difficult for the network to learn to
differentiate.
In summary, one usually shifts and scales the input-data distribution so that it
has mean zero and unit variance. This is illustrated in Figure 6.6. The same trans-
formation (using the mean values and scaling factors determined for the training
set) should be applied to any new data set that the network is supposed to classify
after it has been trained on the training set.
Figure 6.7 shows a distribution of inputs that falls into two distinct clusters. The
difference between the clusters is sometimes called covariate shift, here covariate is
just another term for input. Imagine feeding first just inputs from one of the clusters
to the network. It will learn local properties of the decision boundary, instead of
its global features. Such global properties are efficiently learned if the network is
more frequently confronted with unfamiliar data. For sequential training (stochastic
gradient descent) this is not a problem, because the sequence of input patterns
presented to the network is random. However, if one trains with mini batches, the
x2
x1
Figure 6.7: When the input data falls into clusters as shown in this Figure, one should
randomly pick data from either cluster, to avoid that patterns become too familiar. The
decision boundary is shown in red.
92 STOCHASTIC GRADIENT DESCENT
x2 u2
u1
1
−2 −1
1 2 x1
−1
mini batches should contain randomly chosen patterns in order to avoid covariate
shifts. To this end one randomly shuffles the sequence of the input patterns in the
training set, at the beginning of each epoch.
It is sometimes recommended [36] to observe the output errors during training. If
the errors are similar for a number of subsequent learning steps, the corresponding
inputs appear familiar to the network. Larger errors correspond to unfamiliar inputs,
and Ref. [36] suggests to feed such inputs more often.
Often the input data is very high dimensional, requiring many input terminals.
This usually means that there are many neurons in the hidden layers, and the large
number of neurons makes the training computationally very expensive. To avoid this
as far as possible, one can reduce the dimensionality of the input data by principal-
component analysis. This method allows to project high-dimensional data to a
lower dimensional subspace. How this can work is illustrated for a simple example
in Figure 6.8. You see that all data points fall onto a one-dimensional subspace,
the solid line with slope 12 (principal direction). The coordinate orthogonal to the
principal direction is not useful in classifying the data, for the example shown. This
coordinate can be removed in the following way. One uses the fact that the principal
direction points in the direction of the leading eigenvector of the data-covariance
matrix C, that is to the eigenvector with the largest eigenvalue. The data-covariance
matrix has elements
p p
1 X (µ)
(µ)
1 X (µ)
Ci j = xi − 〈xi 〉 x j − 〈x j 〉 with 〈xi 〉 = x . (6.19)
p µ=1 p µ=1 i
For the example shown in Figure 6.8, the data-covariance matrix reads
1 10 5
C= 10 . (6.20)
4 5 4
Legend
t (µ) = 1
x2 t (µ) = −1 x2
x1 x1
Figure 6.9: Overfitting. Left: accurate representation of the decision boundary in the
training set, for a network with 15 neurons in the hidden layer. Right: this new data
set differs from the first one just by a little bit of noise. The points in the vicinity of the
decision boundary are not correctly classified.
We see that the leading eigenvector u 1 defines principal direction. Figure 6.8 is
an extreme example. Usually there is noise, causing the data to scatter around
the principal direction. This does not change much. The result is that the smaller
eigenvalue is no longer equal to zero, but still small if the data does not scatter
too much about the principal direction. When there are many dimensions, we
inspect the ordered sequence of eigenvalues. Often there is a gap between the small
eigenvalues (close to zero), and larger ones. Then one can safely throw away the
small eigenvalues. If there is no gap, it is less clear what to do.
6.3.2 Overfitting
The goal of supervised learning is to generalise from a training set to new data. Only
general properties of the training set can be generalised, not specific ones that are
particular to the training set and that could be very different in new data. A network
with more neurons may classify the training data better, because it accurately repre-
sents all specific features of the data. But those specific properties could look quite
different in new data (Figure 6.9). As a consequence, we must look for a compromise:
between accurate classification of the training set and the ability of the network to
generalise. The problem illustrated in Figure 6.9 is also referred to as overfitting:
the network fits too fine details (for instance noise in the training set) that have no
general meaning. The tendency to overfit is larger for networks with more neurons.
One way of avoiding overfitting is to use cross validation and early stopping. One
splits the training data into two sets: a training set and a validation set. The idea
is that these sets share the general features to be learnt. But although training and
validation data are drawn from the same distribution, they differ in details that
are not of interest. The network is trained on the training set. During training
one monitors not only the energy function for the training set, but also the energy
94 STOCHASTIC GRADIENT DESCENT
log H
early stopping
validation set
training set
iterations
Figure 6.10: Progress of training and validation errors. The plot is schematic, and
the data is smoothed. Shown is the natural logarithm of the energy functions for the
training set (solid line) and the validation set (dashed line) as a function of the number
of training iterations. The training is stopped when the validation energy begins to
increase. In Section a precise criterion for this early stopping is introduced, one that
works for fluctuating data.
function evaluated on the validation data. As long as the network learns general
features of the input distribution, both training and validation energies decrease.
But when the network starts to learn specific features of the training set, then the
validation energy saturates, or may start to increase. At this point the training should
be stopped. This scheme is illustrated in Figure 6.10.
Often the possible values of the output neurons are continuous while the targets
assume only discrete values. Then it is important to also monitor the classification
error of the validation set. The definition of the classification error depends on the
type of the classification problem. Assume first that there is one output unit, and
that the targets take the values t = 0/1. Then the classification error can be defined
as
p
1 X (µ)
t − θH (O (µ) − 12 ) .
C= (6.22a)
p µ=1
If, by contrast, the targets take the values t = ±1, then the classification error is
defined as
p
1 X (µ)
t − sgn(O (µ) ) .
C= (6.22b)
2p µ=1
Table 6.1: Illustrates the difference between energy function and classification error.
Each table shows network outputs for three different inputs from the iris data set, as
well as the correct classfications.
problematic, if it means that the local fields become too large, so that the algo-
rithm reaches the plateau of the activation function. Then training slows down, as
explained in Section 6.1.
One solution to this problem is to reduce the weights by some factor during
training, either at each iteration or in regular intervals, wi j → (1 − ")wi j for 0 < " < 1,
or
δwmn = −εwmn for 0 < " < 1 . (6.24)
This is achieved by adding a term to the energy function
1 X (µ) (µ) 2 γ
X
H= t i − Oi + w2 . (6.25)
2 iµ 2 ij ij
| {z }
≡H0
(n)
This Equation shows that δwi j is a weighted average of the gradients encountered
during training. Now assume that the training is stuck in a shallow minimum. Then
(t )
the gradient ∂ H /∂ wi j remains roughly constant through many time steps, so that
we can write
n
(n ) ∂ H X n−t αn+1 − 1 ∂ H
δwi j ≈ −η (n )
α = −η . (6.31)
∂ wi j t =0 α − 1 ∂ wi(n)
j
so that the increments are much smaller. This shows that introducing inertia can
substantially accelerate convergence without sacrificing accuracy. The disadvantage
is, of course, that there is yet another parameter to choose, the momentum constant
α.
98 STOCHASTIC GRADIENT DESCENT
w (t +1) w (t +1)
w (t ) w (t )
w (t −1) w (t −1)
Figure 6.11: Left: Momentum method (6.29). Right: Nesterov’s accelerated gradient
method (6.33)
∂H
(t ) (t −1)
δwi j = −η + αt −1 δwi j . (6.33)
∂ wi j wi(tj ) +αt −1 δwi(tj −1)
6.3.5 Pruning
The term pruning refers to removing unnecessary weights or neurons from the
network, to improve its efficiency. The simplest approach is weight elimination by
weight decay [39]. Weights that tend to remain very close to zero during training are
removed by setting them to zero and not updating them anymore. Neurons that
have zero weights for all incoming connections are effectively removed (pruned). It
has been shown that this method can help the network to generalise [40].
RECIPES FOR IMPROVING THE PERFORMANCE 99
An efficient pruning algorithm is based on the idea to remove the weights that
have least effect upon the energy function (that minimise the increase in H upon
removing them) [41]. Assume that the network was trained, so that the network
reached a (local) minimum of the energy function. One starts by expanding the
energy function around this minimum. To write this expansion down in a convenient
form, one groups all weights in the network in a long weight vector w (as opposed to
grouping them in a weight matrix W as we did in Chapter 2). A particular component
wq is extracted from the vector w as follows:
..
.
wq = êq · w q where êq = 1 ← q . (6.34)
..
.
Here êq is the Cartesian unit vector in the direction q , with components eq j = δq j .
The expansion of H reads:
The term linear in δw vanishes because we expand around a local minimum. The
matrix M is the Hessian, the matrix of second derivatives of the energy function.
Eliminating the weight δwq amounts to setting
δwq + wq = 0 . (6.36)
The idea is to minimise the damage to the network by eliminating the weight that
has least effect upon H
I omitted the constant term Hmin because it does not matter. Now we first minimise
H w.r.t. δw , for a given value of q . The linear constraint is incorporated using a
Lagrange multiplier as in Chapter 4, to form the Lagrangian
6.4 Summary
Backpropagation is an efficient algorithm for stochastic gradient-descent on the
energy function in weight space, because it refers only to quantities that are local to
the weight to be updated. It is sometimes argued that biological neural networks
are local in this way [2].
FURTHER READING 101
Figure 6.12: Patterns detected by the convolutional net of Ref. [43]. After Fig. 13 in
Ref. [43].
6.6 Exercises
Covariance matrix. Show that the covariance matrix C defined in Equation (6.19)
is positive semidefinite.
Pruning. Show that the expression (6.42) for the weight increment δw ∗ minimises
the Lagrangian (6.38) subject to the constraint.
Skipping layers. Show how the backpropagation algorithm can be generalised for
feed-forward networks that allow for connections from the two nearest layers to the
left, not only from the nearest layer to the left.
102 STOCHASTIC GRADIENT DESCENT
is the Heaviside function, and w j k and Wj are weights for the hidden, and the output
layer, respectively. Finally, θ j and Θ are the thresholds assigned to the hidden units,
and to the output unit. One way of solving the classification problem is illustrated in
Fig. 6.13 where the three lines (solid, dashed, and dash-dotted line) are determined
by weights w j k and thresholds θ j assigned to the three hidden units in the hidden
layer. Compute w j k and θ j corresponding to the lines shown in Fig. 6.13. Note
that the point where the dashed and dash-dotted lines intersect has the following
coordinates (0.5, 0.8)T . (0.5p).
(µ)
(c) For each pattern x (µ) write its coordinates V j in the transformed (hidden) space.
(0.5p).
(d) Graphically illustrate the problem in this transformed space. Is the problem
space linearly separable in the transformed space, or not? If yes, illustrate a possible
solution to the problem in this space. (0.5p)
(e) Compute the corresponding weights Wj and the threshold Θ corresponding to
the solution you illustrated in (d). (0.5p).
EXAM QUESTIONS 103
(µ) (µ)
µ x1 x2 t (µ) 1
0.2
8 0.8 0.25 0
0.1
9 0.1 0.65 1
0
10 0.2 0.75 1 0 0.2 0.4 0.6 0.8 1
11 0.7 0.2 1 x1
Figure 6.13: Left: Inputs and target values for a classification problem. The target
output for each pattern µ is either t (µ) = 0 (white circles) or t (µ) = 1 (black circles). Right:
the three lines illustrate a solution to the problem by a multilayer perceptron. Question
6.7.1.
6.7.2 Backpropagation
To train a multi-layer perceptron using stochastic gradient descent one needs up-
date formulae for the weights and thresholds in the network. Derive these update
formulae for sequential training using backpropagation for the network shown in
Fig. 6.14. The weights for the first and second hidden layer, and for the output layer
(1) (2)
are denoted by w j k , wm j , and W1m . The corresponding thresholds are denoted by
(1)
θ j , θm(2) , and Θ1 , and the activation function by g (· · · ). The target value for input
(µ)
pattern x (µ) is t 1 , and the pattern index µ ranges from 1 to p . The energy function
Pp (µ) (µ)
is H = 12 µ=1 (t 1 − O1 )2 . (2p).
104 STOCHASTIC GRADIENT DESCENT
W1m
(1) (2)
wjk wm j
(µ) (1,µ) (µ)
xk Vj Vm(2,µ) O1
Figure 6.14: Multi-layer perceptron with three input terminals, two hidden layers, and
one output unit. Question 6.7.3.
6.7.4 Backpropagation
A multilayer perceptron has L − 1 hidden layers (` = 1, . . . , L − 1 and one output layer
(`,µ)
(l = L ). The state of a neuron V j in layer ` is given by
1 X (µ) (µ) 2
H= t i − Oi , (6.46)
2 iµ
(µ)
where t i is the target value for input pattern x (µ) . Find the batch-mode update rule
(L −2)
for the weight wmn with learning rate η. No momentum regularisation. (1 p).
EXAM QUESTIONS 105
6.7.5 Backpropagation
To train a multi-layer perceptron with stochastic gradient descent one needs update
formulae for the weights and thresholds in the network. Derive these update formu-
lae for the network shown in Fig. 6.15 using the gradient-descent algorithm with
constant learning rate, no momentum and no weight decay. The weights for the
hidden layer and for the output layer are denoted by w j k and W1 j , respectively. The
corresponding thresholds are denoted by θ j , and Θ1 , and the activation function by
(µ) (µ)
g (· · · ). The target value for input pattern x (µ) is t 1 , and the network output is O1 .
Pp (µ) (µ) 2
The energy function is H = 21 µ=1 t 1 − O1 . (2p)
wjk W1 j
(µ) (µ) (µ)
xk Vj O1
Figure 6.15: Multi-layer perceptron. Question 6.7.5.
10. There are 24 Boolean functions with three input units (and output 0/1) where
exactly three input patterns map to 0.
11. When solving a t = ±1-problem in two dimensions using a decision boundary,
the resulting output problem may sometimes not be linearly separable.
12. The training time for stochastic gradient descent may depend on how the
weights are initialised.
13. The number of neurons in the input layer of a perceptron is equal to the
number of input patterns.
107
Figure 7.1: Images of iris flowers. From left to right: iris setosa (copyright T. Monto),
iris versicolor (copyright R. A. Nonemacher), and iris virginica (copyright A. Wester-
moreland). All images are copyrighted under the creative commons license.
7 Deep learning
7.1 How many hidden layers?
In Chapter 5 we saw why it is sometimes necessary to have a hidden layer: this
make it possible to solve problems that are not linearly separable. Under which
circumstances is one hidden layer sufficient? Are there problems that require more
than one hidden layer? Even if not necessary, may additional hidden layers improve
the performance of the network?
The second question is more difficult to answer than the first, so we start with
the first question. To understand how many hidden layers are necessary it is useful
to view the classification problem as an approximation problem [44]. Consider
the classification problem (x (µ) , t (µ) ) for µ = 1, . . . , p . This problem defines a target
function t (x ). Training a network to solve this task corresponds to approximating the
target function t (x ) by the output function O (x ) of the network, from N -dimensional
input space to one-dimensional output space.
How many hidden layers are necessary or sufficient to approximate a given set of
(1)
Vl
xk (1)
wl k (2)
wjl
(2)
Vj
Wj O (x )
g (b ) = b
g (b ) = σ(b )
t (x ) O (x )
x
Figure 7.3: The neural-network output O (x ) approximates the target function t (x ).
In the simplest case the inputs are one-dimensional (Figure 7.3). The training
set consists of pairs (x (µ) , t (µ) ). The task is then to approximate the corresponding
target function t (x ) by the network output O (x ):
O (x ) ≈ t (x ) . (7.2)
Comparison with Equation (7.1) shows that one hidden layer is sufficient to con-
struct the function O (x ) in this way.
Now consider two-dimensional inputs. In this case, suitable basis functions are
b (x ) σ a 1 (x − c1 ) − σ a 2 (x − c2 )
x c1 c2 x
Figure 7.4: (a) basis function. (b) linear combination of two sigmoid functions for a
large value of a 1 = a 2 .
So for two input dimensions two hidden layers are sufficient, with four neurons in
the first layer, and one neuron per basis function in the second hidden layer. In
general, for N inputs, two hidden layers are sufficient, with 2N units in the first
layer, and one unit per basis function in second layer.
Yet it is not always necessary to use two layers for real-valued functions. For
continuous functions, one hidden layer is sufficient. This is ensured by the universal
approximation theorem [2]. This theorem says any continuous function can be
approximated to arbitrary accuracy by a network with a single hidden layer, for
sufficiently many neurons in the hidden layer.
In Chapter 5 we considered discrete Boolean functions. It turns out that any
Boolean function with N -dimensional inputs can be represented by a network with
one hidden layer, using 2N neurons in the hidden layer:
A difference compared with the Boolean networks in Section 5.4 is that here the
inputs take the values ±1. The reason is that this simplifies the proof.
This proof goes by construction [1]. For each hidden neuron one assigns the
weights as follows
¨
δ if the k th digit of binary representation of j is 1 ,
wjk = (7.6)
−δ otherwise,
with δ > 1 (see below). The thresholds θ j of all hidden neurons are the same, equal
to N (δ − 1). The idea is that each input pattern turns on exactly one neuron in the
hidden layer (called the winning unit). This requires that δ is large enough, as we
shall see. The weights feeding into the output neuron are assigned as follows. If the
110 DEEP LEARNING
Figure 7.5: Boolean XOR function. (a) value table, (b) network layout. For the weights
feeding into the hidden layer, dashed lines correspond to w j k = −δ , solid lines to
w j k = δ. For the weights feeding into the output neuron, dashed lines correspond to
W1 j = −γ, and solid lines to W1 j = γ (b) construction principle for the weights of the
hidden layer.
If we choose δ > 1 then the output of the first hidden neuron gives a positive output
(V0 > 0), the other neurons produce negative outputs, V j < 0 for j = 1, 2, 3. Now
consider x (3) = [−1, +1]T . In this case
So in this case the third hidden neuron gives a positive output, while the others
yield negative outputs. It works in the same way for the other two patterns, x (2) and
x (4) . This confirms that there is a unique winning neuron for each pattern. That
pattern µ = k gives the winning neuron j = k − 1 is of no importance, it is just a
HOW MANY HIDDEN LAYERS? 111
1 0
0
1
1
0
1 0
1 0
1
1 0
1
1 0
1 1
1
0
0
0 0
0 0
1
0
0 0
0 0
Figure 7.6: Solution of the parity problem for N -dimensional inputs. The network is
built from XOR units (Figure 5.14). Each XOR unit has a hidden layer with two neurons.
Above only the states of the inputs and outputs of the XOR units are shown, not those
of the hidden neurons. In total, the whole network has O (N ) neurons.
consequence of how the patterns are ordered in the value table in 7.5. Second, the
output neuron computes
with γ > 0, and Θ = j W1 j = 0. For x (1) and x (4) we find the correct result O1 = −1.
P
The same is true for x (2) and x (3) , we obtain O1 = 1. In summary, this example
illustrates how an N -dimensional Boolean function is represented by a network
with one hidden layer, with 2N neurons. The problem is of course that this network
is expensive to train for large N because the number of hidden neurons is very large.
There are more efficient layouts if one uses more than one hidden layer. As
an example, consider the parity function for N binary inputs equal to 0 or 1. The
function measures the parity of the input sequence. It gives 1 if there is an odd
number of ones in the input, otherwise 0. A construction similar to the above yields
a network layout with 2N neurons in the hidden layer. If one instead wires together
the XOR networks shown in Figure 5.14, one can solve the parity problem with O (N )
neurons, as Figure 7.6 demonstrates. When N is a power of two then this network
has 3(N − 1) neurons. To see this, set the number of inputs to N = 2k . Figure 7.6
shows that the number Nk of neurons satisfies the recursion Nk +1 = 2Nk + 3 with
N1 = 3. The solution of this recursion is Nk = 3(2k − 1).
This example also illustrates a second reason why it may be useful to have more
than one hidden layer. To design a network for a certain task it is often convenient
112 DEEP LEARNING
Figure 7.7: Fully connected deep network with five hidden layers. How deep is deep?
Usually one says: deep networks have two or more hidden layers.
to build the network from building blocks. One wires them together, often in a
hierarchical fashion. In Figure 7.6 there is only one building block, the XOR network
from Figure 5.14.
Another example are convolutional networks for image analysis (Section 7.3).
Here the fundamental building blocks are feature maps, they recognise different
geometrical features in the image, such as edges or corners.
2
10
|∇θ H |
(`)
1
10
0
10
-1
10
0 10 20 30 40 50
epochs
(`)
Figure 7.8: Shows how the norm of the gradient of H w.r.t. θ j in layer ` depends on
the number of training epochs. Gradient in hidden layer ` = 1 (◦), ` = 2 (), ` = 3 (◊),
` = 4 (4). The data was obtained by training a network with four fully connected hidden
layers with N = 30 neurons each on the MNIST data set (Section 7.4). The output layer
has 10 neurons. Sigmoid activation functions (6.15a), quadratic energy function (6.4),
learning rate η = 3 × 10−3 . The data was obtained by averaging over an ensemble of 100
independent runs (data by Johan Fries).
∂
(`)
∂ θ1
(`) .
∇θ H = .. H . (7.10)
∂
(`)
∂ θN
Equation (6.12) shows that the errors in layer ` and thus the weight updates are
(`)
proportional to −∇θ H . Figure 7.8 demonstrates that the norm of this gradient
tends to be very small for the first 20 training epochs. In this regime the gradient
(and thus the speed of training) vanishes exponentially as ` → 1. This slowing down
is the result of the diminished effect of the neurons in layer ` upon the output, when
` is small. This is the vanishing-gradient problem.
To explain this phenomenon, consider the very simple case shown in Figure 7.9:
a deep network with only one neuron per layer. To measure the effect of a given
neuron on the output, we calculate how the output of the network changes when
changing the state of a neuron in a particular layer. The output V (L ) is given by the
nested activation functions
V (L ) =g w (L ) g w (L −1) · · · g w (2) g (w (1) x − θ (1) ) − θ (2) . . .−θ (L −1) − θ (L ) .
(7.11)
114 DEEP LEARNING
Figure 7.9: ‘Network’ illustrating the vanishing-gradient problem, with neurons V (`) ,
weights w (`) , and thresholds θ (`) .
The effects of the neurons in Figure 7.9 are computed using the chain rule:
∂ V (L )
= g 0 (b (L ) )w (L )
∂ V (L −1)
∂ V (L ) ∂ V (L ) ∂ V (L −1)
= = g 0 (b (L ) )w (L ) g 0 (b (L −1) )w (L −1)
∂ V (L −2) ∂ V (L −1) ∂ V (L −2)
..
. (7.12)
where b (k ) = w (k ) V (k −1) − θ (k ) is the local field for neuron k . This yields the following
expression for J`,L ≡ ∂ V (L ) /∂ V (`) :
`+1
∂ V (L ) Y 0 (k ) (k )
J`,L = = [g (b )w ] . (7.13)
∂ V (`) k =L
Note that the error δ(`) of the hidden layer ` is determined by a closely related product.
Algorithm 2 shows that the errors are given recursively by δ(`) = δ(`+1) w (`+1) g 0 (b (`) ).
Using δ(L ) = [t − V (L ) (x )]g 0 (b (L ) ) we have
`+1
Y
(`) (L ) (L )
δ = [t − V (x )]g (b
0
) [w (k ) g 0 (b (k −1) )] . (7.14)
k =L
Equations (7.13) and (7.14) are consistent since δ(`) = [t − V (L ) ]∂ V (L ) /∂ (−θ (`) ) =
[t − V (L ) ]∂ V (L ) /∂ V (`) g 0 (b (`) ).
Coming back to the product (7.13), consider first the early stages of training. If one
initialises the weights as described in Chapter 6 to Gaussian random variables with
mean zero and variance σ2w , and the thresholds to zero, then the factors w (k ) g 0 (b (k −1) )
are usually smaller than unity (for the activation functions (6.15), the maximum
of g 0 (b ) is 12 and 1, respectively). The product of these factors vanishes quickly
as ` decreases. So the slowing down is a consequence of multiplying many small
numbers to get something really small (vanishing-gradient problem).
What happens at later times? One might argue that the weights may grow during
training, as a function of `. If that happened, the problem might become worse
TRAINING DEEP NETWORKS 115
still, because g 0 (b ) tends to zero exponentially as |b | grows. This indicates that the
first layers may continue to learn slowly. Figure 7.8 shows that the effect persists
for about 20 epochs. But then even the first layers begin to learn faster. This does
not contradict the above discussion, because it assumed random weights. As the
network learns, the weights are no longer independent random numbers. But there
is to date no mathematical theory describing how this transition occurs.
More fundamentally, Equation (7.13) demonstrates that different layers of the
network learn at different speeds, because their neurons typically have different
effects on the output. This is due to the fact that the product in Equation (7.13) is
unlikely to remain of order unity when L is large. To see this, assume that the weights
are independently distributed random numbers. Taking the logarithm and using
the central-limit theorem shows that the distribution of the product is log normal.
This means that the learning speed can be substantially different in different layers.
This is also referred to the problem of unstable gradients. The example shown in
Figure 7.9 illustrates the origin of this problem: it is due to the fact that multiplying
many small numbers together produces a result that is very small. Multiplying many
numbers that are larger than unity, by contrast, yields a large result.
In networks like the one shown in Figure 7.7 the principle is the same, but instead
of multiplying numbers one multiplies matrices. The product (7.13) of random
numbers becomes of product of random matrices. Assume that all layers ` = 1, . . . , L
(`,L ) (L ) (`)
have N neurons. We denote the matrix with elements Ji j = ∂ Vi /∂ V j by J`,L .
Using the chain rule we find:
∂ Vi
(L ) X ∂ V (L ) ∂ V (L −1) ∂ Vn(`+1)
i l
(`)
= (L −1) (L −2)
··· (`)
. (7.15)
∂ Vj l m···n ∂ Vl ∂ Vm ∂ Vj
Using the update rule
(k ) (k ) (k −1) (k )
X
Vi =g wi j V j − θi (7.16)
j
0.1
0.05
2 4 6 8 10 I b
Figure 7.10: (a) Firing rate of a leaky integrate-and-fire neuron as a function of the
electrical current I through the cell membrane, Equation (7.22) for τ = 25 and Uc /R = 2
(see text). (b) Rectified linear unit, g (b ) = max{0, b }.
log tr JT
λ = lim 0,k
J0,k (7.20)
k →∞ 2k
where the average is over realisations of weights and thresholds. The matrix JT J
0,k 0,k
is called the right Cauchy-Green matrix, and tr denotes the trace of this matrix, the
sum of its diagonal elements. The right Cauchy-Green matrix is symmetric, and it is
positive definite. The eigenvectors of JT J are called forward Lyapunov vectors.
0,k 0,k
They describe how small corrections to the inputs rotate, shrink, or stretch as they
propagate through the network.
If we multiply the matrix Jk ,L from the left with the transpose of the vector δ(L )
of output errors, we see how the errors change as they propagate backwards from
layer k to the leftmost hidden layer, how this vector rotates, shrinks, or stretches.
There are a number of different tricks that help to suppress vanishing gradients,
to some extent at least. First, it is usually argued that it helps to use an activation
function that does not saturate at large b , such as the ReLU function introduced
in Section 7.2.2. But the results of Ref. [46] show that the effect is perhaps not as
strong as originally thought. Second, batch normalisation (Section 7.2.6) may help
against the unstable gradient problem. Third, introducing connections that skip
layers (residual network) can also reduce the unstable-gradient problem. This is
discussed in Section 7.6.
into the neuron cell, and the membrane potential U . The simplest models for the
dynamics of the membrane potential represent the neuron as a capacitor. In the
leaky integrate-and-fire neuron, leakage is added by a resistor R in parallel with the
capacitor C , so that
U dU
I = +C . (7.21)
R dt
For a constant current, the membrane potential grows from zero as a function of
time, U (t ) = R I [1 − exp(−t /τ)], where τ = R C is the time constant of the model.
One says that the neuron produces a spike when the membrane potential exceeds
a critical value, Uc . Immediately after, the membrane potential is set to zero (and
begins to grow again). In this model, the firing rate f (I ) is thus given by t c−1 , where
t c is the solution of U (t ) = Uc . It follows that the firing rate exhibits a threshold
behaviour (the system works like a rectifier):
I ≤ Uc /R ,
¨
0 for
f (I ) = −1 (7.22)
τ log R IR−U
I
c
for I > Uc /R .
This response curve is illustrated in Figure 7.10 (a). The main message is that there
is a threshold below which the response is strictly zero (this is not the case for the
activation function shown in Figure 1.6). The response function looks qualitatively
like the ReLU function
shown in panel (b). Neurons with this activation function are called rectified linear
units. The derivative of the ReLU function is discontinuous at b = 0. A common
convention is to set the derivative to zero at b = 0.
What is the point of using rectified linear units? When training a deep network
with ReLU functions it turns out that many of the hidden neurons (as many as
50%) produce outputs strictly equal to zero. This means that the network of active
neurons (non-zero output) is sparsely connected. It is thought that sparse networks
have desirable properties, and sparse representations of a classification problem are
more likely to be linearly separable (as shown in Section 10.1). Figure 7.11 illustrates
that for a given input pattern only a certain fraction of hidden neurons is active. For
these neurons the computation is linear, yet different input patterns give different
sets of active neurons. The product in Equation (7.18) acquires a particularly simple
structure: the matrices D(k ) are diagonal with 0/1 entries. But while the weight
matrices are independent, the D(k ) -matrices are correlated: which elements vanish
depends on the states of the neurons in the corresponding layer, which in turn
depend on the weights to the right of D(k ) in the matrix product.
118 DEEP LEARNING
Figure 7.11: Sparse network of active neurons with ReLU activation functions. The red
paths correspond to active neurons with positive local fields.
A hidden layer with only one or very few active neurons might act as a bottleneck
preventing efficient backpropagation of output errors which could in principle slow
down training. For the examples given in Ref. [47] this does not occur.
The ReLU function is unbounded for large positive local fields. Therefore, the
vanishing-gradient problem (Section 7.2.1) is thought to be less severe in networks
made of rectified linear units, but see Ref. [46]. Since the ReLU function does not
saturate, the weights tend to increase. Glorot et al. [47] suggested to use L 1 -weight
decay (Section 6.3.3) to make sure that the weights do not grow.
Finally, using ReLU functions instead of sigmoid functions speeds up the training,
because the ReLU function has piecewise constant derivatives. Such function calls
are faster to evaluate than sigmoid functions, for example.
are, first, that 0 ≤ Oi ≤ 1. Second, the values of the outputs sum to one
M
X
Oi = 1 . (7.25)
i =1
This means that the outputs of softmax units can be interpreted as probabilities.
(L )
Third, the outputs are monotonous: when bi increases then Oi increases but the
values Ok of the other output neurons k 6= i decrease.
Softmax output units can simplify interpreting the network output for classifi-
cation problems where the inputs must be assigned to one of M classes. In this
(µ)
problem, the output Oi of softmax unit i represents the probability that the input
(µ) (µ)
x (µ) is in class i (in terms of the targets: t i = 1 while t k = 0 for k 6= i ). Softmax units
are often used in conjunction with a different energy function (or cost function). It
is defined in terms of negative log likelihoods
(µ) (µ)
X
H =− t i log Oi . (7.26)
iµ
Here and in the following log stands for the natural logarithm. The function (7.26)
(µ) (µ)
is minimal when Oi = t i . Since the function (7.26) is different from the energy
function used in Chapter 6, the details of the backpropagation algorithm are slightly
different. To find the correct formula for backpropagation, we need to evaluate
(µ) (µ)
∂H X t
i ∂ Oi
=− (µ)
. (7.27)
∂ wmn iµ O ∂ w mn
i
Here I did not write out the labels L that denote the output layer, and in the following
equations I also drop the index µ that refers to the input pattern. Using the identities
∂ Oi ∂ bl
= Oi (δi l − Ol ) and = δl m Vn , (7.28)
∂ bl ∂ wmn
one obtains
∂ Oi X∂O ∂b
i l
= = Oi (δi m − Om )Vn . (7.29)
∂ wmn l
∂ b l ∂ w mn
So
∂H X
(µ)
X
δwmn = −η =η t i (δi m − Om(µ) )Vn(µ) = η (t m(µ) − Om(µ) )Vn(µ) , (7.30)
∂ wmn iµ µ
PM (µ)
since i =1 t i = 1 for the type of classification problem where each input belongs to
precisely one class. The corresponding expression for the threshold updates reads
∂H X
δθm = −η = −η (t m(µ) − Om(µ) ) . (7.31)
∂ θm µ
120 DEEP LEARNING
O1
softmax
O2
..
.
OM −1
OM
Figure 7.12: The symbol for a softmax layer indicates that the neurons in this layer are
not independent.
Equations (7.30) and (7.31) highlight a further advantage of softmax output neurons
(apart from the fact that they allow the output to be interpreted in terms of probabil-
ities). The weight and threshold increments for the output layer derived in Section 6
[Equations (6.7) and (6.12a)] contain factors of derivatives g 0 (Bm(µ) ). As noted earlier,
these derivatives tend to zero when the activation function saturates, slowing down
the learning. But Equations (7.30) and (7.31) do not contain such factors! Here the
rate at which the neuron learns is simply proportional to the error, (t m(µ) − Om(µ) ), no
small factors reduce this rate.
Softmax units are normally only used in the output layer. First, the derivation
shows that the learning speedup mentioned above is coupled to the use of the log
likelihood function (7.26). Second, one usually tries to avoid dependence between
the neurons in a given hidden layer, but Equation (7.24) shows that the output
of neuron i depends on all local fields in the hidden layer (Figure 7.12). A better
alternative is usually the ReLU activation function discussed in Section 7.2.2.
There is an alternative way of choosing the cost function that is very similar to
the above, but works with sigmoid units:
(µ) (µ) (µ) (µ)
X
H =− t i log Oi + (1 − t i ) log(1 − Oi ) , (7.32)
iµ
with Oi = σ(bi ) where σ is the sigmoid function (6.15a). The function (7.32) is called
cross-entropy function. To compute the weight increments, we apply the chain rule:
∂H X t (µ) (µ)
∂ Ol X t (µ) − O (µ) ∂ O
i 1 − ti i i l
= (µ)
− (µ)
= (µ) (µ)
. (7.33)
∂ wmn iµ O 1 − O ∂ w mn iµ O (1 − O ) ∂ w mn
i i i i
identical to Equation (7.30). The threshold increments are also updated in the same
way, Equation (7.31). Yet the interpretation of the outputs is slightly different, since
TRAINING DEEP NETWORKS 121
the values of the softmax units in the output layers sum to unity, while those of
the sigmoid units do not. In either case you can use the definition (6.23) for the
classification error.
N
X
bi = wi j x j (7.35)
j =1
7.2.5 Regularisation
Deeper networks have more neurons, so the problem of overfitting (Figure 6.9) tends
to be more severe for deeper networks. Therefore regularisation schemes that limit
the tendency to overfit are more important for deeper networks. In Section 6.3.3 sev-
eral regularisation schemes were described, for example L 1 - and L 2 -regularisation.
In training deep networks, a number of other regularisation schemes have proved
useful: drop out, pruning (Section 6.3.5), and expanding the training set.
Drop out
In this scheme some neurons are ignored during training [50]. Usually this reg-
ularisation technique is applied to hidden neurons. The procedure is illustrated
in Figure 7.13. In each step of the training algorithm (for each mini batch, or for
each individual pattern) one ignores at random a fraction p of neurons from each
hidden layer, and updates the weights in the remaining, diluted network in the usual
fashion. The weights coming into the dropped neurons are not updated, and as a
consequence neither are their outputs. For the next step in the training algorithm,
the removed neurons at put back, and another set of hidden neurons is removed.
Once the training is completed, all hidden neurons are activated, but their outputs
are multiplied by p .
Srivastava et al. [50] motivate this method by remarking that the performance
of machine-learning algorithms is usually improved by combining the results of
several learning attempts. In our case this corresponds to separately training several
TRAINING DEEP NETWORKS 123
n 10 8 6 4 2
Training success
without pruning 98.5 96.8 92.5 78.3 49.1
pruned network - - - 97.9 83.3
Figure 7.14: Boolean XOR problem. The network has one hidden layer with n ReLU
neurons. The output neuron has a sigmoid activation function. The network is trained
with stochastic gradient descent for 10 000 iterations. The initial weights were Gaussian
random numbers with mean zero, standard deviation 0.1, and max-norm regularisation
|wi j | < 2. The thresholds were initially zero. Training success was measured in an
ensemble of 1000 independent training realisations. Data from Ref. [51].
networks with different layouts on different inputs, and then to average over their
outputs. However, for deep networks this is computationally very expensive. Drop
out is an attempt to achieve the same goal more efficiently. The idea is that dropout
corresponds to effectively training a large number of different networks. If there
are k hidden neurons, then there are 2k different combinations of neurons that
are turned on or off. The hope is that the network learns more robust features
of the input data in this way, and that this reduces overfitting. In practice the
method is usually applied together with another regularisation scheme, max-norm
regularisation. This means that weights are not allowed to grow larger than a given
constant: |wi j | ≤ c .
Pruning
Figure 5.14 shows that the XOR function can be represented by a hidden layer
with two neurons. Suitable weights and thresholds are given in this Figure. Frankle &
Carbin [51] point out that backpropagation takes a long time to find a valid solution,
124 DEEP LEARNING
for random initial weights. They observe that a network with many more neurons
in the hidden layer usually learns better. Figure 7.14 lists the fraction of successful
trainings for networks with different numbers of neurons in the hidden layer. With
two hidden neurons, only 49.1% of the networks learned the task in 10 000 training
steps of stochastic gradient descent. Networks with more neurons in the hidden
layer ensure better training success. The Figure also shows the training success of
pruned networks, that were initially trained with n = 10 neurons. Then networks
were pruned iteratively during training, removing the neurons with the largest
average magnitude. After training, the weights and threshold were reset to their
initial values, the values before training began. One can draw three conclusions
from this data (from Ref. [51]). First, iterative pruning during training singles out
neurons in the hidden layer that had initial weights and thresholds resulting in the
correct decision boundaries. Second, the pruned network with two hidden neurons
has much better training success than the network that was trained with only two
hidden neurons. Third, despite pruning more than 50% of the hidden neurons, the
network with n = 4 hidden neurons performs almost as well as then one with n = 10
hidden neurons. When training deep networks it is common to start with many
neurons in the hidden layers, and to prune up to 90% of them. This results in small
trained networks that can efficiently and reliably classify.
If one trains a network with a fixed number of hidden neurons on larger training sets,
one observes that the network generalises with higher accuracy (better classification
success). The reason is that overfitting is reduced when the training set is larger.
Thus, a way of avoiding overfitting is to expand or augment the training set. It
is sometimes argued that the recent success of deep neural networks in image
recognition and object recognition is in large part due to larger training sets. One
example is ImageNet, a database of more than 107 hand-classified images, into more
than 20 000 categories [52]. Naturally it is expensive to improve training sets in this
way. Instead, one can expand a training set artificially. For digit recognition (Figure
2.1), for example, one can expand the training set by randomly shifting, rotating,
and shearing the digits.
(Algorithm 4). One calculates the average and variance over each mini batch
mB mB
1 X (µ) 1 X (µ)
Vj= V and σ2B = (V − V j )2 , (7.38)
m B µ=1 j m B µ=1 j
(µ)
Æ
subtracts the mean from the V j , and divides by σ2B + ε. The parameter ε > 0
is added to the denominator to avoid division by zero. There are two additional
parameters in Algorithm 4, namely γ j and β j . If one were to set β j = V j and γ j =
(µ)
Æ
σ2B + ε (algorithm 4), then batch normalisation would leave the V j unchanged.
But instead these two parameters are learnt by backpropagation, just like the weights
and thresholds. In general the new parameters are allowed to differ from layer to
(`) (`)
layer, γ j and β j .
Batch normalisation was originally motivated by arguing that it reduces possible
covariate shifts faced by hidden neurons in layer `: as the parameters of the neurons
in the preceding layer ` − 1 change, their outputs shift thus forcing the neurons
in layer ` to adapt. However in Ref. [54] it was argued that batch normalisation
does not reduce the internal covariate shift. It speeds up the training by effectively
smoothing the energy landscape.
Batch normalisation helps to combat the vanishing-gradient problem because
it prevents local fields of hidden neurons to grow. This makes it possible to use
sigmoid functions in deep networks, because the distribution of inputs remains
normalised. It is sometimes argued that batch normalisation has a regularising
effect, and it has been suggested [53] that batch normalisation can replace drop out
(Section 7.2.5). It is also argued that batch normalisation can help the network to
generalise better, in particular if each mini batch contains randomly picked inputs.
Then batch normalisation corresponds to randomly transforming the inputs to each
hidden neuron (by the randomly changing means and variances). This may help to
make the learning more robust. There is no theory that proves either of these claims,
but it is an empirical fact that batch normalisation often speeds up the training.
(a) (b)
inputs 10 × 10 hidden 8 × 8 inputs 10 × 10 hidden 8 × 8 × 4
V11
inputs 10 × 10 hidden 8 × 8
V12
Figure 7.15: (a) layout of a convolution layer. (b) several convolution layers are con-
nected to the input layer to detect different features.
layers. Connected to the pooling layers, convolutional networks may also contain
several fully connected layers.
The activation function g can be the sigmoid function. Usually one connects several
convolution layers to the input layer, as shown in Figure 7.15(b). Different layers
contain different feature maps, one that detects edges for example, and another
one that detects corners, and so forth.
Figure 7.15 depicts a two-dimensional input array. For colour images there are
usually three colour channels, in this case the input array is three-dimensional, and
the input bits are labeled by three indices: two for position and the last one for
colour, xi j k . If one has several convolution layers that connect to the inputs, one
groups the weights (and thresholds) into still higher-dimensional arrays (tensors).
In this case the convolution takes the form:
X
Vi j k ··· = g wp q k r ··· xp +i −1,q + j −1,r ··· − θk ··· . (7.40)
p q r ···
4×4×4
Figure 7.16: Layout of a convolutional neural network for object recognition and image
classification. The inputs are in a 10 × 10 array. They feed into four convolution layers
representing four different 3 × 3 feature maps. Each convolution layer feeds into its own
max-pooling layer. Between these and the output layer are a couple of fully connected
hidden layers.
If one couples several convolution layers together, the number of neurons in these
layers decreases rapidly as one moves to the right. In this case one can pad the
image (and the convolution layers) by adding rows and columns of bits set to zero. In
Figure 7.15(a), for example, one obtains a convolution layer of the same dimension
as the original image if one pads the image with two rows and columns of bits.
Convolution layers are trained with backpropagation. Consider the simplest
case, Equation (7.39). As usual, we use the chain rule to evaluate the gradients: the
gradients
∂ Vi j X ∂ Vi j ∂ b
rs
= . (7.42)
∂ wmn rs
∂ br s ∂ wmn
The derivatives of the local fields are evaluated by applying rule (5.27) to Equation
(7.39):
∂ br s X
= δmp δnq xp +i −1,q + j −1 . (7.43)
∂ wmn pq
In this way one can train several stacked convolution layers too. It is important to
keep track of the summation boundaries. To that end it helps to pad out the image
and the convolution layers, so that the upper bounds remain the same in different
layers.
Details aside, the fundamental principle of feature maps is that the map is applied
in the same form to different parts of the image (translational invariance). In this
way the learning of parameters is shared between pixels, each weight in a given
feature map is trained on different parts of the image. This effectively increases the
training set for the feature map and combats overfitting.
Figure 7.17: Examples of digits from the MNIST data set of handwritten digits [57]. The
images were produced using MATLAB. But note that by default MATLAB displays the
digits white one black background. Copyright for the data set: Y. LeCun and C. Cortes.
outputs into a single number. Max-pooling units, for example, summarise the
outputs of nearby feature maps (in a 2×2 square for instance) by taking the maximum
over the feature-map outputs. Instead, one may compute the root-mean square of
the map values (L 2 -pooling). There are no weights or thresholds associated with
the pooling layers, they compute the output from the inputs using a pre-defined
prescription. Other ways of pooling are discussed in Ref. [4].
Usually several feature maps are connected to the input. Pooling is performed
separately on each of them. The network layout looks like the one shown schemat-
ically in Figure 7.16. In this Figure, the pooling layers feed into a number of fully
connected hidden layers that connect to the output neurons. There are as many
output neurons as there are classes to be recognised. This layout is qualitatively
similar to the layout used by Krizhesvky et al. [55] in the ImageNet challenge (see
Section 7.5 below).
Algorithm 5 network layout and training options: no hidden layers, softmax output
layer with 10 units. Here net is the network object containing the training data set,
the network layout, and the training options.
layers = [imageInputLayer([28 28 1])
fullyConnectedLayer(10)
softMaxLayer
classificationLayer];
options = trainingOptions(
’sgdm’,...
’MiniBatchSize’, 8192,...
’ValidationData’, {xValid, tValid},...
’ValidationFrequency’, 30,...
’MaxEpochs’,200,...
’Plots’, ’Training-Progress’,...
’L2Regularization’, 0, ...
’Momentum’, 0.9, ...
’ValidationPatience’, 5, ...
’Shuffle’, ’every-epoch’, ...
’InitialLearnRate’, 0.001);
net = trainNetwork(xTrain, tTrain, layers, options);
with 10 000 digits. The latter is used for cross-validation and early stopping. The test
data is used for measuring the classification error after training. For this purpose
one should use a data set that was not involved in the training.
The goal of this Section is to show how the principles described in Chapters 6 and
7 allow to learn the MNIST data with low classification error, as outlined in Ref. [5].
You can follow the steps described below with your own computer program, using
MATLAB 2017b which is available at StuDAT. But if you prefer you can also use other
software packages such as Keras [58], an interface for TensorFlow [56], Theano [59],
or PyTorch [60]. The networks described below use ReLU units (Section 7.2.2) in the
hidden layers and a softmax output layer (Section 7.2.3) with ten output units Oi
and energy function (7.26), so that output Oi is the probability that the pattern fed
to the network falls into category i .
H
10
0.1
mentioned above, early stopping caused the training of the larger network to abort
after 135 epochs, this corresponds to 824 iterations.
Now let us add more hidden layers. Experimenting shows that it is better to use a
slightly higher learning rate, η = 0.01. For two hidden layers we obtain classification
accuracies that are only slightly higher, 97.3%. Adding a third hidden layer does not
help much either. Try adding even more neurons, and/or more layers. You will see
that it is difficult to increase the classification accuracy further. Adding more fully
connected hidden layers does not necessarily improve the classification accuracy,
even if you train for more epochs. One possible reason is that the network overfits
the data (Section 6.3.2). This problem becomes more acute as you add more hidden
neurons. The tendency of the network to overfit is reduced by regularisation (Section
7.2.5). For the network with one hidden layer with 100 ReLU units, L 2 -regularisation
improves the classification accuracy to almost 98%. Here L2Regularization was
set to 0.03 and the learning rate to η = 0.03.
2 6 6 7 7 5 5
0 0 1 1 2 3 3
6 9 9 3 0 9 4
4 4 4 5 6 7 9
Figure 7.19: Some hand-written digits from the MNIST test set, misclassified by a
convolutional net that achieved an overall classification accuracy of 98%. Correct clas-
sification (top right), misclassification (bottom right). Data from Oleksandr Balabanov.
Figure 7.20: Examples of digits drawn on an Ipad. Data from Oleksandr Balabanov.
Figure 7.21: Same digits as in Figure 7.20, but preprocessed like the MNIST digits. Data
from Oleksandr Balabanov.
Figure 7.22: Same digits as in Figure 7.21. The difference is that the thickness of the
stroke was normalised (see text). Data from Oleksandr Balabanov.
that specifies the resulting line thickness. In Figure 7.22 it was taken to be T = 10,
close to the line thickness of the MNIST digits, we measured the latter to T ≈ 9.7
using the method described in Ref. [64].
If we run a MNIST-trained convolutional net (Algorithm 8) on a data set of 60
digits with normalised line thickness, it fails on only two digits. This corresponds to
a classification accuracy of roughly 97%, not so bad – but not as good as the best
results in Section 7.4.2. Note that we can only make a rough comparison. In order to
to obtain a better estimate of the classification accuracy we need to test many more
than 60 digits. A question is of course whether there are perhaps other differences
between our own hand-written digits and those in the MNIST data. It would also be
of interest to try digits that were drawn using Paint, or a similar program. How does
do MNIST-trained convolutional nets perform on computer-drawn digits?
At any rate, the results of this Section show that the way the input data are pro-
cessed can make a big difference. This raises a point of fundamental importance.
We have seen that convolutional nets can be trained to represent a distribution of
input patterns with very high accuracy. But if you test the network on a data set that
has a slightly different distribution, perhaps because it was preprocessed differently,
the network may not work as well.
DEEP LEARNING FOR OBJECT RECOGNITION 137
Figure 7.23: Object recognition using a deep convolutional network. Shown is a frame
from a movie recorded on a telephone. The network was trained on the Pascal VOC
data set [65] using YOLO [66]. Details on how to obtain the weights and how to install
the software are given on the YOLO website.
CNN
Top-5 Error in %
Year
Figure 7.24: Smallest classification error for the ImageNet challenge [67]. The data up
to 2014 comes from Ref. [67]. The data for 2015 comes from Ref. [68], for 2016 from
Ref. [69], and for 2017 from Ref. [70]. From 2012 onwards the smallest error was achieved
by convolutional neural networks (CNN). After Goodfellow et al. [4].
set ran image classification challenges from 2005 to 2012. A more recent challenge
is the ImageNet large-scale visual recognition challenge (ILSVRC) [67], a competition
for image classification and object recognition using the ImageNet database [52].
The challenge is based on a subset of ImageNet. The training set contains more than
106 images manually classified into one of 1000 classes. There are approximately
1000 images for each class. The validation set contains 50 000 images.
The ILSVRC challenge consists of several tasks. One task is image classification,
to list the object classes found in the image. A common measure for accuracy is the
so-called top-5 error for this classification task. The algorithm lists the five object
classes it identified with highest probabilities. The result is considered correct if
the annotated class is among these five. The error equals the fraction of incorrectly
classified images. Why does one not simply judge whether the most probable class
is the correct one? The reason is that the images in the ImageNet database are
annotated by a single-class identifier. Often this is not unique. The image in Figure
7.17, for example, shows not only a car but also trees, yet the image is annotated
with the class label car. This is ambiguous. The ambiguity is significantly smaller
if one considers the top five classes the algorithm gives, and checks whether the
annotated class is among them.
The tasks in the ILSVRC challenge are significantly more difficult than the digit
recognition described in Section 7.4, and also more difficult than the VOC challenges.
One reason is that the ImageNet classes are organised into a deep hierarchy of
subclasses. This results in highly specific sub classes that can be very difficult to
distinguish. The algorithm must be very sensitive to small differences between
similar sub classes. We say that the algorithm must have high inter-class variability
[71]. Different images in the same sub class, on the other hand, may look quite
different. The algorithm should nevertheless recognise them as similar, belonging
to the same class, the algorithm should have small intra-class variability [71].
Since 2012, algorithms based on deep convolutional networks won the ILSVRC
challenge. Figure 7.24 shows that the error has significantly decreased until 2017, the
last year of the challenge in the form described above. We saw in previous Sections
that deep networks are difficult to train. So how can these algorithms work so well?
It is generally argued that the recent success of deep convolutional networks is
mainly due to three factors.
First, there are now much larger and better annotated training sets available.
ImageNet is an example. Excellent training data is now recognised as one of the
most important factors, and companies developing software for self-driving cars
and systems that help to avoid accidents recognise that good training sets is one
of the most important factors, and difficult to achieve: to obtain reliable training
data one must manually collect and annotate the data (Figure 7.25). This is costly,
but at the same time it is important to have as large data sets as possible, to reduce
RESIDUAL NETWORKS 139
Figure 7.25: Reproduced from xkcd.com/1897 under the creative commons attribution-
noncommercial 2.5 license.
overfitting. In addition one must aim for a large variability in the collected data.
Second, the hardware is much better today. Deep networks are nowadays im-
plemented on single or multiple GPUs. There are also dedicated chips, such as the
tensor processing unit [72].
Third, improved regularisation techniques (Section 7.2.5) and weight sharing in
convolution layers help to fight overfitting, and ReLU units (Section 7.2.2) render
the networks less susceptible to the vanishing-gradient problem (Section 7.2.1).
The winning algorithm for 2012 was based on a network with five convolution
layers and three fully connected layers, using drop out, ReLU units, and data-set
augmentation [55]. The algorithm was implemented on GPU processors. The 2013
ILSVRC challenge was also won by a convolutional network [73], with 22 layers.
Nevertheless, the network has substantially fewer free parameters (weights and
thresholds) than the 2012 network: 4×106 instead of 60 × 106 . In 2015, the winning
algorithm [68] had 152 layers. One significant new element in the layout were
connections that skip layers (residual networks, Section 7.6). The 2016 [74] and 2017
[70] winning algorithms used ensembles of convolutional networks.
w (`,`−2)
ReLU
ReLU
ReLU
w (`,`−1)
`−2 `−1 `
Figure 7.26: Schematic illustration of a network with skipping connections.
ordered in the same way as the subscripts: the right index refers to the layer on the
left. Neuron j in layer ` computes
The weights of connections that skip layers are trained in the usual fashion, by
stochastic gradient descent. To illustrate the structure of the resulting formulae
consider a ‘network’ with just one neuron per layer (Figure 7.27). To begin with we
calculate the increments of the weights w (`,`−1) . To update w (L ,L −1) we require
∂ V (L )
= g 0 (b (L ) )V (L −1) . (7.45)
∂ w (L ,L −1)
This gives
as in Algorithm 2. The factor (t − V (L ) ) comes from the outer derivative of the energy
function (6.4). The outputs are O = V (L ) . As in Algorithm 2, I have omitted the sum
over µ (stochastic gradient descent, page 86).
Also the update for w (L −1,L −2) is the same as in Algorithm 2:
δw (L −1,L −2) = ηδ(L −1) V (L −2) with δ(L −1) = δ(L ) w (L ,L −1) g 0 (b (L −1) ) . (7.47)
RESIDUAL NETWORKS 141
But the update for w (L −2,L −3) is different because the short cuts come into play. The
extra connection from layer L − 2 to L gives rise to an extra term:
∂ V (L ) ∂ V (L ) ∂ V (L −1) ∂ V (L ) ∂ V (L −2)
= + . (7.48)
∂ w (L −2,L −3) ∂ V (L −1) ∂ V (L −2) ∂ V (L −2) ∂ w (L −2,L −3)
Evaluating the partial derivatives we find
= g 0 (b (L ) )w (L ,L −1) g 0 (b (L −1) )w (L −1,L −2) + g 0 (b (L ) )w (L ,L −2) g 0 (b (L −2 )V (L −2) .
This implies
δ(L −2) = δ(L −1) w (L −1,L −2) g 0 (b (L −2 ) + δ(L ) w (L ,L −2) g 0 (b (L −2) ) . (7.49)
In general, the error-backpropagation rule reads
δ(`−1) = δ(`) w (`,`−1) g 0 (b (`−1) ) + δ(`+1) w (`+1,`−1) g 0 (b (`−1) ) (7.50)
for ` = L − 1, L − 2, . . .. The first term is the same as in step 9 of Algorithm 2. The
second term is due to the skipping connections.
The update formula for w (`,`−1) is
δw (`,`−1) = ηδ(`) V (`−1) . (7.51)
The updates of the weights w (`+1,`−1) are given by
δw (`+1,`−1) = ηδ(`+1) V (`−1) , (7.52)
with the same errors as in Equation (7.51).
Skipping connections reduce the vanishing-gradient problem. To see this, note
that we can write the error δ(`) as
X
δ(`) = δ(L ) w (L ,`n ) g 0 (b (`n ) ) · · · w (`2 ,`1 ) g 0 (b (`1 ) )w (`1 ,`) g 0 (b (`) ) (7.53)
`1 ,`2 ,...,`n
where the sum is over all paths L > `n > `n−1 > · · · > `1 > ` back through the network.
The smallest gradients are dominated by the product corresponding to the path with
the smallest number of steps (factors), resulting in a smaller probability to get small
gradients. Introducing connections that skip more than one layer tends to increase
the small gradients, as Equation (7.53) shows. Recently it has been suggested to
randomise the layout by randomly short-circuiting the network. Equation (7.53)
remains valid for this case too.
The network described in Ref. [68] used unit weights for the skipping connections,
7.7 Summary
Networks with many hidden layers are called deep networks. It has recently been
shown that such networks can be trained to recognise objects in images with high
accuracy. It is sometimes stated that convolutional networks are now better than
Humans, in that they recognise objects with lower classification errors than Humans
[75]. This statement is problematic for several reasons. To start with, the article
refers to the 2015 ILSVRC competition, and the company mentioned in the Guardian
article was later caught out cheating. At any rate, this and similar statements refer to
an experiment showing that the Human classification error in recognising objects in
the ImageNet database is about 5.1% [76], worse than the most recent convolutional
neural-network algorithms (Figure 7.24).
Yet it is clear that these algorithms learn in quite a different way from Humans.
They can detect local features, but since these convolutional networks rely on trans-
lational invariance, they do not easily understand global features, and can mistake
a leopard-patterned sofa for a leopard [77]. It may help to include more sofas in
the training data set, but the essential difficulty remains: translational invariance
imposes constraints on what convolutional networks can learn [77].
More fundamentally one may argue that Humans learn differently, by abstraction
instead of going through vast training sets. Just try it out for yourself, this website
[78] allows you to learn like a convolutional network. Nevertheless, the examples
described in this Chapter illustrate the tremendous success of deep convolutional
networks.
We have also seen that training deep networks suffers from a number of fun-
damental problems. First, networks with many hidden neurons have many free
parameters (their weights and thresholds). This increases the risk of overfitting.
Overfitting reduces the power of the network to generalise. The tendency of deep
networks to overfit can be reduced by cross-validation (Section 6.3.2) and by reg-
ularisation (weight decay, drop out, pruning, and data set augmentation, Section
7.2.5). In this regard convolutional nets have an advantage because they have fewer
weights, and the weights of a given feature map are trained on different parts of the
input images, effectively increasing the training set.
Second, the examples described in Section 7.4 show that convolutional nets are
sensitive to differences in how the input data are preprocessed. You may run into
problems if you train a network on given training and validation sets, but apply it to
a test set that was preprocessed in a different way – so that the test set corresponds
to a different input distribution. Convolutional nets excel at learning the properties
of a given input distribution, but they may have difficulties in recognising patterns
sampled from a slightly different distribution, even if the two distributions appear
very similar to the Human eye. Note also that this problem cannot be solved by
FURTHER READING 143
cross-validation, because training and validation sets are drawn from the same
input distribution, but here we are concerned with what happens when the network
is applied to a input distribution different from the one that was trained on. Here is
another example illustrating this point: the authors of Ref. [79] trained a convolu-
tional network on perturbed grayscale images from the ImageNet data base, adding
a little bit of noise independently to each pixel (white noise) before training. This
network failed to recognise images that were weakly perturbed in a different way, by
setting a small number of pixels to white or black. When we look at the images we
have no difficulties seeing through the noise.
Third, error backpropagation in deep networks suffers from the vanishing-gradient
problem. This is more difficult to combat. It can be reduced by using ReLU units,
by initialising the weights in certain ways, and by networks with connections that
skip layers. Yet vanishing or exploding gradients remain a fundamental difficulty,
slowing learning down in the initial phase of training. Brute force (computer power)
helps to alleviate the problem. As a consequence, convolutional neural networks
have become immensely successful in object recognition, outperforming other
algorithms significantly.
Fourth, Refs. [80, 81] illustrate intriguing failures of convolutional networks. Sze-
gedy et al. [80] show that the way convolutional nets partition input space can lead
to surprising results. The authors took an image that the network classifies correctly
with high confidence, and it perturbed slightly. The difference between the original
and perturbed images (adversarial images) is undetectable to the Human eye, yet
the network misclassifies the perturbed image with high confidence [80]. This
indicates that decision boundaries are always close in input space, not intuitive but
possible in high dimensions. Figure 1 in Ref. [81] shows images that are completely
unrecognisable to the Human eye. Yet a convolutional network classifies these
images with high confidence. This illustrates that there is no telling what a network
may do if the input is far away from the training distribution. Unfortunately the
network can sometimes be highly confident yet wrong.
To conclude, convolutional networks are very good at recognising objects in
images. But we should not imagine that they understand what they see in the same
way as Humans. The theory of deep learning has somewhat lagged behind the
performance in practice. But some progress has been made in recent years, and
there are many interesting open questions.
Yet it is unclear precisely how hidden convolutional layers help the network to learn.
Therefore it is interesting to visualise the activity of deep layers by asking: which
input patterns maximise the outputs of the neurons in a certain layer [82]?
Another question concerns the structure of the energy landscape. It seems that
local minima are perhaps less important for deep networks, because their energy
functions tend to have more saddle points than minima [83].
Deep networks suffer from catastrophic forgetting: when you train a network on a
new input distribution that is quite different from the one the network was originally
trained on, then the network tends to forget what it learned initially. Recently there
has been much interest in this question. A good starting point is Ref. [84].
The stochastic-gradient descent algorithm (with or without minibatches) samples
the input-data distribution uniformly randomly. As mentioned in Section 6.3.1, it
may be advantageous to sample those inputs more frequently that initially cause
larger output errors. More generally, the algorithm may use other criteria to choose
certain input data more often, with the goal to speed up learning. It may even
suggest how to augment a given training set most efficiently, by asking to specifically
label certain types of input data (active learning) [85].
For connections to Mathematical Statistics (multinomial and binary logistic
regression), start with Ref. [4].
7.9 Exercises
Decision boundaries for XOR problem. Figure 7.5 shows the layout of a network
that solves the Boolean XOR problem. Draw the decision boundaries for the four
hidden neurons in the input plane, and label the boundaries and the regions as in
Figure 5.12.
Vanishing-gradient problem. Train the network shown in Figure 7.7 on the iris data
set, available from the Machine learning repository of the University of California
Irvine. Measure the effects upon of the neurons in the different layers, by calculating
the derivative of the energy function H w.r.t. the thresholds of the neurons in
question.
where bm(L ,µ) denotes the local field of the m th output neuron:
(L ) (L −1,µ)
X
bm(L ,µ) = −θm(L ) + wmk Vk . (7.57)
k
(L ) (L −1,µ)
Here θm(L ) andwmk are thresholds and weights, and Vk is the state of the k th
neuron in layer L − 1, evaluated for pattern µ.
(µ) (L ,µ)
(a) Compute the derivative of output Oi with respect to the local field b j of the
th
j output neuron. (1p).
(b) The network is trained by gradient descent on the negative log-likelihood func-
tion,
(µ) (µ)
X
H =− t i log Oi . (7.58)
iµ
146 DEEP LEARNING
The summation is over all patterns in the training set and over all output neurons,
(µ)
the logarithm is the natural logarithm, and t i denote targets. The targets satisfy
the constraint
(µ)
X
ti = 1 (7.59)
i
(`)
for all patterns µ. When updating, the increment of a weight wnq in layer ` is given
by
(`) ∂H
δwnq = −η (`)
, (7.60)
∂ wnq
(L )
where η denotes the learning rate. Derive the increment for weight wnq in layer L .
(1p).
147
(v x ) (v v )
wi k wi j
(µ) (µ)
x1 1 3 y3
inputs outputs
(µ) (µ)
x2 2 4 y4
feedback
Figure 8.1: Network with a feedback connection. Neurons 1 and 2 are hidden neurons.
(v x )
The weights from the input xk to the neurons Vi are denoted by wi k , the weight from
(v v )
neuron V j to neuron Vi is wi j . Neurons 3 and 4 are output neurons, with prescribed
target values yi . To avoid confusion with the iteration index t , the targets are denoted
by y in this Chapter.
8 Recurrent networks
The layout of the perceptrons analysed in the previous Chapters is special. All
connections are one way, and only to the layer immediately to the right, so that the
update rule for the i -th neuron in layer ` becomes
or continuous
dVi X
(v v )
X
(v x ) (v )
τ = −Vi + g wi j V j (t ) + wi k xk − θi , (8.3)
dt j k
(v )
with time constant τ. The parameters θi are thresholds. We shall see in a moment
why it can be advantageous to use a dynamical network.
Recurrent networks can learn in different ways. One possibility is to use a training
set of pairs (x (µ) , y (µ) ) with µ = 1, . . . , p . To avoid confusion with the iteration index
t , the targets are denoted by y in this Chapter. One feeds a pattern from this set
and runs the dynamics (8.2) or (8.3) for the given x (µ) until the dynamics reaches a
steady state V ∗ (if this does not happen the training fails). Then one updates the
weights by gradient descent using the energy function
¨
1X 2 yk − Vk if Vk is an output unit,
H= Ek where Ek = (8.4)
2 k 0 otherwise,
the energy function in terms of the mean-squared output errors, one could also use
the negative log-likelihood function (7.32). These steps are repeated until the steady-
state outputs yield the correct targets for all input patterns. This is reminiscent of the
algorithms discussed in Chapters 5 to 7, and we shall see that the backpropagation
algorithm can be modified (recurrent backpropagation) to make the networks learn
as described earlier.
Another possibility is that inputs and targets change as functions of time t while
the network dynamics runs. In this way the network can solve temporal association
tasks where it learns to output certain targets in response to the sequence x (t )
of input patterns, and targets y (t ). In this way recurrent networks can translate
written text or recognise speech. Such networks can be trained by unfolding their
dynamics in time as explained in Section 8.2 (backpropagation in time), although
this algorithm suffers from the vanishing-gradient problem discussed in Chapter 7.
(µ)
may be connected to inputs xk . Other units are output units with associated target
(µ)
values yi . We take the dynamics to be continuous in time, Equation (8.3), and
assume that the dynamics runs into a steady state
dVi ∗
V (t ) → V ∗ so that = 0. (8.5)
dt
From Equation (8.3) we deduce
(v v ) (v x ) (v )
X X
Vi ∗ = g w i j V j∗ + wi k xk − θi . (8.6)
j k
In other words we assume that the dynamics (8.3) has a stable steady state, so that
small perturbations δVi away from Vi ∗ decay with time. Equation (8.6) is a nonlinear
self-consistent Equation for Vi ∗ , in general difficult to solve. However, if the fixed
points Vi ∗ are stable then we can use the dynamics (8.3) to automatically pick out the
steady-state solution V ∗ . This solution depends on the pattern x (µ) , but in Equations
(8.5) and (8.6) and also in the following I have left out the superscript (µ).
The goal is to find weights so that the outputs give the correct target values in
the steady state, those associated with x (µ) . To this end we use stochastic gradient
(v v )
descent on the energy function (8.4). Consider first how to update the weights wi j .
We must evaluate
(v v ) ∂H X ∂ Vk∗
δwmn = −η (v v )
=η ∗
Ek (v v )
. (8.7)
∂ wmn k ∂ wmn
To calculate the gradients of V ∗ we use Equation (8.6):
∂ Vi ∗ ∂ bi∗ X
(v v )
∂ V j∗
(v v )
=g0
(bi∗ )
(v v )
=g 0
(bi∗ ) δi m Vn∗ + (v v )
, wi j
(8.8)
∂ wmn ∂ wmn j ∂ wmn
(v v ) (v x ) (v )
where bi∗ = j wi j V j∗ + k wi k xk − θi . Equation (8.8) is a self-consistent equa-
P P
tion for the gradient, as opposed to the explicit equations we found in Chapters 5 to
7. The reason for the difference is that the recurrent network has feedbacks. Since
Equation (8.8) is linear in the gradients, we can solve it by matrix inversion. To this
(v v )
end, define the matrix L with elements L i j = δi j − g 0 (bi∗ )wi j . Equation (8.8) can
be written as
X ∂ V j∗
Li j (v v )
= δi m g 0 (bi∗ )Vn∗ . (8.9)
j ∂ w mn
P
Applying k L−1 k i to both sides we find
∂ V j∗
(v v )
= L−1 km
g 0 (bm∗ )Vn∗ . (8.10)
∂ wmn
150 RECURRENT NETWORKS
Inserting this result into (8.7) we finally obtain for the weight increments:
X
(v v )
δwmn =η Ek∗ L−1 k m g 0 (bm∗ )Vn∗ . (8.11)
k
This learning rule can be written in the form of the backpropagation rule by intro-
ducing the error X
∆∗m = g 0 (bm∗ ) Ek∗ L−1 k m . (8.12)
k
that does not involve the inverse of L. I used the index i here because it makes the
following calculation a bit easier to follow. The first step is to multiply both sides of
Equation (8.14) with L i j /g 0 (bi∗ ) and to sum over i : This gives
X X
∆∗i L i j /g 0 (bi∗ ) = Ek∗ L−1 ki
L i j = E j∗ . (8.15)
i ki
(v v )
Using L i j /g 0 (bi∗ ) = δi j /g 0 (b j∗ ) − wi j we find
(v v )
X
∆∗i δi j − wi j g 0 (b j∗ ) = g 0 (b j∗ )E j∗ . (8.16)
i
The trick is now to write down a dynamical equation for ∆i that has a steady state
at the solution of Equation (8.16):
(v v )
X
τ dtd ∆ j = −∆ j + ∆i wi j g 0 (b j∗ ) + g 0 (b j∗ )E j∗ . (8.17)
i
Compare this with the dynamical rule (8.3). Equations (8.3) and (8.17) exhibit the
same duality as Algorithm 2, between forward propagation of states of neurons
(step 5) and backpropagation of errors (step 9). The sum in Equation (8.17) has the
same form as the recursion for the errors in Algorithm 2 (step 9), except that there
are no layer indices ` here.
RECURRENT BACKPROPAGATION 151
The solution of (8.16) is a fixed point of this Equation. But is it stable? To decide
this we linearise the dynamical equations (8.3) and (8.17). To this end we write
and insert this ansatz into (8.3) and (8.17). To leading order we find:
(v v )
X X
τ dtd δVi = − δVi + g 0 (bi∗ ) wi j δV j = − L i j δV j , (8.19)
j j
(v v )
X X
τ dtd δ∆ j = −δ∆ j + δ∆i wi j g 0 (b j∗ ) = − δ∆i g 0 (bi∗ )L i j /g 0 (b j∗ ) . (8.20)
i i
Since the matrices with elements L i j and g 0 (bi∗ )L i j /g 0 (b j∗ ) have the same eigenvalues,
∆∗i is a stable fixed point of (8.17) if Vn∗ is a stable fixed point of (8.3). This was
assumed in the beginning, Equation (8.5). If this assumption does not hold, the
algorithm does not converge.
(v x )
Now consider the update formula for the weights wmn from the inputs:
(v x ) ∂H X ∂ Vk∗
δwmn = −η (v x )
=η Ek∗ (v x )
, (8.21a)
∂ wmn k ∂ wmn
∂ Vi ∗ X
(v v )
∂ V j∗
(v x )
= g (bi ) δi j xn +
0 ∗
wi j (v x )
. (8.21b)
∂ wmn j ∂ wmn
O (t ) O1 ... OT
w (o v )
w (v v ) V0 V1 VT
V (t ) ...
(v v ) (v v ) (v v ) (v v ) (v v )
w w w w w
(v x )
w
x (t ) x1 ... xT
Figure 8.2: Left: recurrent network with one hidden neuron (green) and one output
neuron (blue). The input terminal is drawn red. Right: same network but unfolded in
time. Here the time arguments are written as subscripts (see page 8.2). The weights
w (v v ) remain unchanged as drawn, also the weights w (v x ) and w (o v ) remain unchanged
(not drawn).
The activation function of the outputs Oi can be different from that of the hidden
neurons V j . Often the softmax function is used for the outputs [86, 87].
To train recurrent networks with time-dependent inputs and targets and with
the dynamics (8.23) one uses backpropagation through time. The idea is to unfold
the network in time to get rid of the feedbacks, at the expense of as many copies
of the original neurons as there are time steps. This is illustrated in Figure 8.2 for
a recurrent network with one hidden neuron, one input, and one output. The
unfolded network has T inputs and outputs. It can be trained in the usual way with
stochastic gradient descent. The errors are calculated using backpropagation as in
Algorithm 2, but here the error is propagated back in time, not from layer to layer.
BACKPROPAGATION THROUGH TIME 153
The energy function is the squared error summed over all time steps
T
1X 2
H= E with E t = yt − Ot . (8.24)
2 t =1 t
T T
X ∂ Ot X
(o v ) ∂ Vt
δw (v v ) = η Et = η ∆ t w . (8.25)
t =1
∂ w (v v ) t =1
∂ w (v v )
∂ Vt
(v v ) ∂ Vt −1
= g 0
(b t ) Vt −1 + w (8.26)
∂ w (v v ) ∂ w (v v )
∂ V1
= g 0 (b1 )V0 ,
∂ w (v v )
∂ V2
= g 0 (b2 )V1 + g 0 (b2 )w (v v ) g 0 (b1 )V0 ,
∂ w (v v )
∂ V3
= g 0 (b3 )V2 + g 0 (b3 )w (v v ) g 0 (b2 )V1 + g 0 (b3 )w (v v ) g 0 (b2 )w (v v ) g 0 (b1 )V0
∂ w (v v )
..
.
∂ VT −1
= g 0 (bT −1 )VT −2 + g 0 (bT −1 )w (v v ) g 0 (bT −2 )VT −3 + . . .
∂w (v v )
∂ VT
= g 0 (bT )VT −1 + g 0 (bT )w (v v ) g 0 (bT −1 )VT −2 + . . .
∂ w (v v )
Equation (8.25) says that we must sum over t . Regrouping the terms in this sum
154 RECURRENT NETWORKS
yields:
∂ V1 ∂ V2 ∂ V3
∆1 + ∆ 2 + ∆ 3 + ...
∂ w (v v ) ∂ w (v v ) ∂ w (v v )
= [∆1 g 0 (b1 ) + ∆2 g 0 (b2 )w (v v ) g 0 (b1 ) + ∆3 g 0 (b3 )w (v v ) g 0 (b2 )w (v v ) g 0 (b1 ) + . . .]V0
+ [∆2 g 0 (b2 ) + ∆3 g 0 (b3 )w (v v ) g 0 (b2 ) + ∆4 g 0 (b4 )w (v v ) g 0 (b3 )w (v v ) g 0 (b2 ) + . . .]V1
+ [∆3 g 0 (b3 ) + ∆4 g 0 (b4 )w (v v ) g 0 (b3 ) + ∆5 g 0 (b5 )w (v v ) g 0 (b4 )w (v v ) g 0 (b3 ) + . . .]V2
..
.
+ [∆T −1 g 0 (bT −1 ) + ∆T g 0 (bT )w (v v ) g 0 (bT −1 )]VT −2
+ [∆T g 0 (bT )]VT −1 .
To write the learning rule in the usual form, we define errors δt recursively:
¨
∆T w (o v ) g 0 (bT ) for t = T ,
δt = (o v ) 0 (v v ) 0
(8.27)
∆t w g (bt ) + δt +1 w g (bt ) for 0 < t < T .
Finally, the update formulae for the weights w (v x ) are obtained in a similar fashion.
Equation (8.23a) yields the recursion
∂ Vt
(v v ) ∂ Vt −1
= g 0
(b t ) x t + w . (8.29)
∂ w (v x ) ∂ w (v x )
This looks just like Equation (8.26), except that Vt −1 is replaced by x t . As a conse-
quence we have
T
X
(v x )
δw =η δt x t . (8.30)
t =1
(o v )
The update formula for w is simpler to derive. From Equation (8.23b) we find by
differentiation w.r.t. w (o v ) :
T
X
δw (o v ) = η E t g 0 (Bt )Vt . (8.31)
t =1
How are the thresholds updated? Going through the above derivation we see that
we must replace Vt −1 and x t in Equations (8.28) and (8.30) by −1. It works in the
same way for the output threshold.
In order to keep the formulae simple, I only described the algorithm for a single
hidden and a single output neuron, so that I could leave out the indices referring to
different hidden neurons and/or different output components. You can add those
indices yourself, the structure of the Equations remains exactly the same, save for a
number of extra sums over those indices:
T
X
(v v ) (t ) (t −1)
δwmn =η δm Vn (8.32)
t =1
(t ) (o v ) (t )
¨P
(t ) ∆i wi j g 0 (b j ) for t = T ,
δj = P i
(t ) (o v ) 0 (t ) (t +1) (v v ) (t )
i ∆i w i j g (b j ) + δi wi j g 0 (b j ) for 0 < t < T .
P
i
(t )
The second term in the recursion for δ j is analogous to the recursion in step 9 of
Algorithm 2. The time index t here plays the role of the layer index ` in Algorithm 2.
A difference is that the weights in Equation (8.32) are the same for all time steps.
In summary you see that backpropagation through time for recurrent networks is
similar to backpropagation for multilayer perceptrons. After the recurrent network is
unfolded to get rid of the feedback connections it can be trained by backpropagation.
The time index t takes the role of the layer index `. Backpropagation through time is
the standard approach for training recurrent nets, despite the fact that it suffers from
the vanishing-gradient problem. The next Section describes how improvements to
the layout make it possible to efficiently train recurrent networks.
156 RECURRENT NETWORKS
Figure 8.3: Schematic illustration of unfolded recurrent network for machine transla-
tion, after Refs. [86, 87]. The green rectangular boxes represent the hidden states in the
form of long short term memory units (LSTM). Otherwise the network layout is like the
one shown in Figure 8.2. Sutskever et al. [86] found that the network translates much
better if the sentence is read in reverse order, from the end. The tag <EOS> denotes the
end-of-sentence tag. Here it denotes the beginning of the sentence.
The networks are trained using backpropagation through time. The vanishing-
gradient problem is dealt with by improved network layouts. Hochreiter and Schmid-
huber [89] suggested to replace the hidden neurons of the recurrent network with
computation units that are specially designed to eliminate the vanishing-gradient
problem. The method is referred to as long short-term memory (LSTM). The basic
ingredient is the same as in residual networks (Section 7.6): short cuts reduce the
vanishing-gradient problem. For our purposes we can think of LSTMs as units that
replace the hidden neurons.
How are the network inputs and outputs represented? For machine translation
one must represent words in the dictionary in terms of a code. The simplest code
is a binary code where 100. . . represents the first word in the dictionary, 010. . . the
second word, and so forth. Each input is a vector with as many components as there
are words in the dictionary. A sentence corresponds to a sequence x 1 , x 2 , . . . , x T .
Each sentence ends with an end-of-sentence tag, <EOS>. Softmax outputs give
the probability p (O 1 , . . . ,O T 0 |x 1 , . . . , x T ) of an output sequence conditional on the
input sequence. The translated sentence is the one with the highest probability
(it also contains the end-of-sentence tag <EOS>). So both inputs and outputs are
RECURRENT NETWORKS FOR MACHINE TRANSLATION 157
Figure 8.4: Schematic illustration of a bidirectional recurrent network . The net con-
sists of two hidden states that are unfolded in different ways. The hidden states are
represented by LSTMs.
T0
Y
p (O 1 , . . . ,O T 0 |x 1 , . . . , x T ) = p (O t |O 1 , . . . ,O t −1 ; x 1 , . . . , x T ) , (8.33)
t =1
where p (Ot |O1 , . . . , Ot −1 ; x 1 , . . . , x T ) is the probability of the next word in the output
sequence give the inputs and the output sequence up to Ot −1 [7].
Advanced layouts
There is a large number of recent papers on machine translation with recurrent neu-
ral nets. Most studies are based on the training algorithm described in Section 8.2,
backpropagation through time. Different algorithms mainly differ in their network
layouts. Google’s machine translation system uses a deep network with layers of
LSTMs [7]. Different hidden states are unfolded forward as well as backwards in
time, as illustrated in Figure 8.4. In this Figure the hidden states are represented by
LSTMs. In the simplest case the hidden states are just encoded in hidden neurons,
as in Figure 8.2 and Equation (8.23). If we represent the hidden states by neurons,
158 RECURRENT NETWORKS
as in Section 8.2, then the corresponding bidirectional network has the dynamics
(v v ) (v x ) (v )
X X
Vi (t ) = g wi j V j (t − 1) + wi k xk (t ) − θi ,
j k
(u u ) (u x ) (u)
X X
Ui (t ) = g wi j U j (t + 1) + wi k xk (t ) − θi , (8.34)
j k
(o v ) (o u) (o )
X X
Oi (t ) = g wi j V j (t ) + wi j U j (t ) − θi .
j j
Scores
Different schemes for scoring the accuracy of a translation are described by Lipton
et al. [87]. One difficulty is that there are often several different valid translations of
a given sentence, and the score must compare the machine translation with all of
them. Recent papers on machine translation usually use the so-called BLEU score
to evaluate the translation accuracy. The acronym stands for bilingual evaluation
understudy. The scheme was proposed by Papieni et al. [90], and it is commonly
judged to score not too differently from how Humans would score.
8.4 Summary
It is sometimes said that recurrent networks learn dynamical systems while multi-
layer perceptrons learn input-output maps. This notion refers to backpropagation
in time. I would emphasise, by contrast, that both networks are trained in similar
ways, by backpropagation. Neither is it given that the tasks must differ: recurrent
networks are also used to learn time-independent data. It is true though that tools
from dynamical-systems theory have been used with success to analyse the dynamics
of recurrent networks [88, 91].
Recurrent neural networks are trained by stochastic gradient descent after unfold-
ing the network in time to get rid of feedback connections. This algorithm suffers
from the vanishing-gradient problem. To overcome this difficulty, the hidden states
in the recurrent network are usually represented by LSTMs. Recent layouts for
machine translation use deep bidirectional networks with layers of LSTMs.
FURTHER READING 159
O (t )
w (o v )
V (t ) w (v v )
w (v x )
x (t )
Figure 8.5: Recurrent network with one input unit x (t ) (red), one hidden neuron V (t )
(green) and one output neuron O (t ) (blue).
8.6 Exercises
Recurrent backpropagation. Show that recurrent backpropagation is a special case
of the backpropagation algorithm for layered feed-forward networks.
Denote the learning rate by η. Hint: the update rule for w (o v ) is much simpler to
derive than those for w (v x ) and w (v v ) . (1p).
(d) Explain how recurrent networks are used for machine translation. Draw the
layout, describe how the inputs are encoded. How is the unstable-gradient problem
overcome? (Max one A4 page). (1p).
161
PART III
UNSUPERVISED LEARNING
162 UNSUPERVISED HEBBIAN LEARNING
Figure 9.1: Supervised learning finds decision boundaries (left). Unsupervised learning
can find clusters in the input data (right).
x1
w1
x2
y =w ·x
wN
xN
Figure 9.2: Network for unsupervised Hebbian learning, with a single linear output
unit that has weight vector w . The network output is denoted by y in this Chapter.
w1
w = ... . (9.2)
wN
The network can detect how familiar certain input patterns are. The idea is that
the output is the larger the more frequently the input pattern occurs in P (x ). This
learning goal is achieved by Hebb’s rule:
w 0 = w + δw with δw = ηy x , (9.3)
where y = w · x is the output. The rule (9.3) is also called Hebbian unsupervised
learning. As usual, η > 0 is small learning rate. How does this learning rule work?
Since we keep adding multiples of the pattern vectors x to the weights, the magni-
tude of the output |y | becomes the larger the more often the input pattern occurs
in the distribution P (x ). So the most familiar pattern produces the largest output. A
problem is potentially that the weight vector may continue to grow as we keep on
adding increments. This usually happens, and this means that the simple Hebbian
learning rule (9.3) does not converge to a steady state. To achieve definite learn-
ing outcomes we require the network to approach a steady state. Therefore the
learning rule (9.3) must be modified. One possibility is to introduce weight decay as
described in Section 6.3.3. This is discussed in the next Section.
δw = ηy (x − y w ) = η x x T w − [w · (x x T )w ]w
(9.4)
ensures that the weights remain normalised. For the second equality I used that the
output is given by y = w · x = w T x = x T w . To see why Equation (9.4) does the trick,
164 UNSUPERVISED HEBBIAN LEARNING
For a general matrix A(t ), the norm |q | may grow or shrink. We can ensure that q
remains normalised by adding a term to Equation (9.5):
d
dt w = A(t )w − [w · A(t )w ]w . (9.6)
The vector w turns in the same way as q , and if we set |w | = 1 initially, then w
remains normalised (w = q /|q |). You can see this by noting that dtd |w 2 | = 2w · dtd w =
0. Equation (9.6) describes the dynamics of the normalised orientation vector of a
small rod in turbulence [93], where A(t ) is the matrix of fluid-velocity gradients.
But let us return to Equation (9.4). It is called Oja’s rule [94]. Oja’s learning algo-
rithm is summarised in Algorithm 10. One draws a pattern x from the distribution
P (x ) of input patterns, applies it to the network, and updates the weights as pre-
scribed in Equation (9.4). This is repeated many PT times. In the following we denote
the average over T input patterns as 〈· · · 〉 = T1 t =1 · · · .
Now we show that a steady state w ∗ of Algorithm 10 has the following properties:
1. |w ∗ | = 1
2. w ∗ is the eigenvector of C0 = 〈x x T 〉 with maximal eigenvalue
3. w ∗ maximises 〈y 2 〉 over all w with |w | = 1.
In particular, the weight vector remains normalised. We first prove statement 1, as-
suming that a steady state w ∗ has been reached. In a steady state the increments δw
must average to zero because the weights would either grow or decrease otherwise:
0 = 〈δw 〉w ∗ . (9.7)
0 = C0 w ∗ − (w ∗ · C0 w ∗ )w ∗ . (9.8)
OJA’S RULE 165
w =w∗+ε (9.9)
the two expressions show that no other direction w gives a larger value of 〈y 2 〉. In
other words: w ∗ maximises 〈y 2 〉. This completes the proof.
166 UNSUPERVISED HEBBIAN LEARNING
x2
u2 u1
x1
Figure 9.3: Maximal eigenvalue direction u 1 of the matrix C0 for input data with non-
zero mean.
For zero-mean inputs, Oja’s rule finds the maximal principal direction of the
input data by maximising 〈y 2 〉 (note that 〈y 〉 = 0 for zero-input data). For inputs
with non-zero means, maximising 〈y 2 〉 still finds the maximal eigenvalue direction
of C0 .
But for inputs with non-zero means, this direction is different from the maximal
principal direction (Section 6.3.1). Figure 9.3 illustrates this difference. The Figure
shows three data points in a two-dimensional input plane. The elements of C0 =
〈x x T 〉 are
1 2 1
C =
0
, (9.15)
3 1 2
with eigenvalues and eigenvectors
1 1 1 −1
λ1 = 1 , u1 = p and λ2 = ,
1
3 u2 = p . (9.16)
2 1 2 1
x x
w2 w 02
w1 w1
strategies to avoid this problem [1]. One possibility is to initialise the weights to
directions found in the inputs.
The solution of the clustering problem is not uniquely defined. One possibility is
to monitor progress by an energy function
1 X (t )
2
H= M i t x j − wi j . (9.20)
2T i j t
Here x (t ) is the pattern fed in iteration number t , T is the total number of iterations,
and ¨
1 for i = i 0 (x (t ) , w 1 , . . . , w M ) ,
Mi t = (9.21)
0 otherwise .
Note that i 0 is a function of the patterns x (µ) and of the weights w 1 , . . . , w M . For
given patterns, the indicator M i µ is a piecewise constant function of the weights.
Gradient descent on the energy function (9.20) gives
∂H ηX (t )
〈δwi j 〉 = −η = M i t (x j − w j ) . (9.22)
∂ wi j T t
Apart from the sum over patterns this is the same as the competitive learning rule
(9.19). The angular brackets on the l.h.s. of this Equation indicate that the weight
increments are summed over patterns.
If we define ¨
1 for i = i 0
yi = δi i 0 = (9.23)
0 otherwise
then the rule (9.19) can be written in the form of Oja’s M -rule:
M
X
δwi j = ηyi x j − yk wk j . (9.24)
k =1
outputs
coordinates r
r1 r2
x (2)
x (1) inputs
coordinates x
Figure 9.5: Spatial map. If patterns x (1) and x (2) are close in input space, then the two
patterns activate neighbouring outputs, r 1 ≈ r 2 .
Here η > 0 is the learning rate, as before. The function Λ(i , i 0 ) is called the neigh-
bourhood function. A common choice is
r i − r i 2
Λ(i , i 0 ) = exp −
0
. (9.26)
2σ2
As a result, nearby output units respond to inputs that are close in input space. Ko-
honen’s rule drags the winning weight vector w i 0 towards x , just as the competitive
learning rule (9.19), but it also drags the neighbouring weight vectors along.
Figure 9.6 illustrates a geometrical interpretation of Kohonen’s rule. We can think
of the weight vectors as pointing to the nodes of an elastic net that has the same
layout as the output array. As one feeds patterns from the input distribution, the
weights are updated, causing the nodes of the network to move. This changes the
shape of the elastic net. In the steady state, this shape resembles the shape defined
by the distribution of input patterns.
Kohonen’s rule has two parameters: the learning rate η, and the width σ of the
neighbourhood function. Usually one adjusts these parameters as the learning
170 UNSUPERVISED HEBBIAN LEARNING
x2 a w2 b w2 c w2 d
x1 w1 w1 w1
Figure 9.6: Learning a shape with Kohonen’s algorithm. (a) Input-pattern distribution.
P (x ) is unity within a parallelogram with unit area, and zero outside. (b) to (d) Illustra-
tion of the dynamics in terms of an elastic net. (b) Initial condition. (c) Intermediate
stage (note the kink). (d) In the steady-state the elastic net resembles the shape defined
by the input-pattern distribution.
proceeds. Typically one begins with large values for η and σ (ordering phase), and
then reduces these parameters as the elastic net evolves (convergence phase): quickly
at first and then in smaller steps, until the algorithm converges. Details are given by
Hertz, Krogh and Palmer [1]. As for competitive learning one can monitor progress
of the learning with an energy function
1 X 2
H= Λ(i , i 0 ) x (t ) − w i . (9.27)
2T i t
∂H ηX (t )
〈δwi j 〉 = −η = Λ(i , i 0 )(x j − wi j ) . (9.28)
∂ wi j T t
Figure 9.6 shows how Kohonen’s network learns by unfolding the elastic net of weight
vectors until the shape of the net resembles the form of the input distribution
¨
1 for x in the parallelogram in Figure 9.6(a) ,
P (x ) = (9.29)
0 otherwise .
∗
w ∗ (r ) + dw
dr δr
x
δr
r r0 (x ) output position
Figure 9.7: To find out how w ∗ varies near r , we expand w ∗ in δr around r . This gives
∗ 1 d2 w ∗
w ∗ (r ) + dw
dr δr + 2 dr 2 δr + . . ..
2
This is a condition for the steady state w ∗i . The condition is more complicated
than it looks at first sight, because i 0 depends on the weights and on the patterns,
as mentioned above. The steady-state condition (9.30) is very difficult to analyse
in general. One of the reasons is that global geometric information is difficult
to learn. It is usually much easier to learn local structures. This is particularly
true in the continuum limit where we can analyse local learning progress using
Taylor expansions. For this reason we assume now that we have a very dense net
of weights, so that we can i → r , i 0 → r 0 , w i → w (r ), Λ(i , i 0 ) → Λ r − r 0 (x ) , and
R
1
d x P (x ). In this continuum approximation, Equation (9.30) reads
P
T t →
Z
dx P (x ) Λ r − r 0 (x ) x − w ∗ (r ) = 0 . (9.31)
This is an Equation for the spatial map w ∗ (r ). Equation (9.31) is still quite difficult
to analyse. So we specialise to one input and one output dimension, with spatial
output coordinate r . This has the added advantage that we can easily draw the
spatial map w ∗ (r ). It is the solution of
Z
dx P (x ) Λ r − r0 (x ) x − w ∗ (r ) = 0 (9.32)
The neighbourhood function is sharply peaked at r = r0 (x ). This means that the con-
dition (9.32) yields the local properties of w ∗ (r ) around r0 , where r0 is the coordinate
of the winning unit, x = w ∗ (r0 ). Equation (9.32) involves an integral over patterns
x . Using x = w ∗ (r0 ), this integral is expressed as an integral over r0 . Specifically we
consider how w (r0 ) changes in the vicinity of a given point r , as r0 (x ) changes. To
this end we expand w ∗ around r :
Here w 0 denotes the derivative dw /dr evaluated at r , and I have dropped the
asterisk. Using this expansion and x = w (r0 ) we express dx in Equation (9.32) in
terms of dδr :
and
1
x − w (r ) = w 0 δr + w 00 (r )δr 2 + . . . . (9.34c)
2
Inserting these expressions into Equation (9.31) we find
Z
0= dδr (w 0 + δr w 00 ) [P + δr w 0 dw
d
P ]Λ(δr )(δr w 0 + 21 δr 2 w 00 )
Z ∞
3
= w [ w 00 P (w ) + w (w 0 )2 dw
0 d
P (w )] dδr δr 2 Λ(δr ) (9.35)
2 −∞
Since the last integral in Equation (9.35) is non-zero, we must either have
3 00
w0 =0 or w P (w ) + (w 0 )2 dw
d
P (w ) = 0 . (9.36)
2
The first solution can be excluded because it corresponds to a singular weight
distribution [see Equation (9.38)] that does not contain any geometrical information
about the input distribution P (x ). The second solution gives
w 00 d
2 w 0 dw P (w )
=− (9.37)
w0 3 P (w )
2
d
In other words, dx log |w 0 | = − 23 dx
d
log P (w ). This means that |w 0 | ∝ P (w )− 3 . So the
distribution % of output weights is
dr 1
2
%(w ) ≡ = = P (w ) 3 . (9.38)
dw |w 0 |
This tells us that the Kohonen net learns the input distribution in the following
way: the distribution of output weights in the steady state reflects the distribution
of input patterns. Equation (9.38) tells us that the two distributions are not equal
(equality would have been a perfect outcome). The distribution of weights is instead
SUMMARY 173
2
proportional to P (w ) 3 . This is a consequence of the fact that the elastic net has dif-
ficulties reaching the corners and edges of the domain where the input distribution
is non-zero.
Let us finally discuss the convergence of Kohonen’s algorithm. The update rule
(9.25) can be rewritten as
For small enough η the factor [1 − ηΛ] is positive. Then it follows from Equation
(9.39) that the order of weights in a monotonically increasing (decreasing) sequence
does not change under an update. What happens at the boundary between two
such regions, where a kink is formed? It turns out that kinks can only disappear
in two ways. Either they move to one of the boundaries, or two kinks (a minimum
and a maximum) annihilate each other if they collide. Both processes are slow.
This means that convergence to the steady state can be very slow. Therefore one
usually starts with a larger learning rate, to get rid of kinks. After this ordering phase,
one continues with a smaller step size to get the details of the distribution right
(convergence phase).
9.4 Summary
The unsupervised learning algorithms described above are based on Hebb’s rule:
certainly the Hebbian unsupervised learning rule (9.3) and Oja’s rule (9.4). Also the
competitive learning rule can be written in this form [Equation (9.24)]. Kohonen’s
algorithm is closely related to competitive learning, although the way in which
Kohonen’s rule learns spatial maps is better described by the notion of an elastic
net that represents the values of the output weights, as well as their spatial location
in the output array. Unsupervised learning rules can learn different features of
the distribution of input patterns. They can discover which patterns occur most
frequently, they can help to reduce the dimensionality of input space by finding
the principal directions of the input distribution, detect clusters in the input data,
compress data, and learn spatial input-output maps. The important point is that
the algorithms learn without training, unlike the algorithms in Chapters 5 to 8.
Supervised-learning algorithms are now widely used for different applications.
This is not really the case yet for unsupervised-learning algorithms, except that
similar (sometimes equivalent) algorithms are used in Mathematical Statistics (k -
means clustering) and Bioinformatics (structure [95]) where large data sets must
be analysed, such as Human sequence data (HGDP) [96]. But the simple algorithms
described in this Chapter provide a proof of concept: how machines can learn
without feedback. In addition there is one significant application of unsupervised
174 UNSUPERVISED HEBBIAN LEARNING
learning: where the network learns from incomplete feedback. This reinforcement
learning is introduced in the next Chapter.
9.5 Exercises
Kohonen net. Write a computer program that implements Kohonen’s algorithm
with a two-dimensional output array, to learn the properties of a two-dimensional
input distribution that is uniform inside an equilateral triangle with sides of unit
length, and zero outside. Hint: to generate this distribution, sample at least 1000
points uniformly distributed over the smallest square that contains the triangle, and
then accept only points that fall inside the triangle. Increase the number of weights
and study how the two-dimensional density of weights near the boundary depends
on the distance from the boundary.
2. w ∗ is the leading eigenvector of the matrix C0 with elements Ci0j = 〈xi x j 〉. Here
〈· · · 〉 denotes the average over P (x ).
3. w ∗ maximises 〈y 2 〉.
(µ)
and the update rule based on this pattern is wi → wi + δwi with
(µ) (µ)
δwi = ηy (µ) (xi − y (µ) wi ) (9.41)
with targets y (µ) . Let 〈δwi 〉 denote the update of wi averaged over the input patterns.
(a) Show that 〈δwi 〉 = 0 implies that the weight vector in the steady state is nor-
malised to unity. (1 p).
(b) Calculate the principal component of the patterns in Figure 9.8. (1 p).
x (4) x (5)
x (3)
x2
x (2)
x (1)
x1
Figure 9.8: Calculate the principal component of this data set. Question 9.6.2.
Show how to draw an input pattern from this distribution using the following steps.
176 UNSUPERVISED HEBBIAN LEARNING
2. Compute x = 〈x 〉 + Lz , where
L 11 0
L= . (9.44)
L 21 L 22
The Figure shows the non-separable problem in input space (in the x1 -x2 plane),
and in the new coordinates u 1 and u 2 . Since the problem is linearly separable in
the u 1 -u 2 plane we can solve it by a single McCulloch-Pitts neuron with weights W
and threshold Θ, parameterising the decision boundary as W · u (x ) = Θ. In fact, one
does not need the threshold Θ because the function u can have a constant part. For
instance, we could choose u 1 (x ) = 2(x2 − x1 )2 − 1. In the following we therefore set
Θ = 0.
We expect that it should be easier to achieve linear separability the higher the
embedding dimension is. This statement is quantified by Cover’s theorem, discussed
x2 Legend
u2
t (µ) = 1
t (µ) = −1
x1 u1
Figure 10.1: Left: input plane for the XOR function (Figure 5.8). The problem is not
linearly separable. Right: in the u 1 -u 2 plane the problem is linearly separable.
178 RADIAL BASIS-FUNCTION NETWORKS
Figure 10.2: Left: 5 points in general position in the plane. Right: these points are not
in general position because three points lie on a straight line.
in Section 10.1. The question is of course how to find the non-linear mapping u (x ).
One possibility is to use radial basis functions. This is a way of parameterising the
functions u j (x ) in terms of weight vectors w j , and to determine suitable weight
vectors iteratively. How this works is summarised in Section 10.2.
1 2 3 λ
Figure 10.3: Probability (10.5) of separability for p = λm as a function of λ for three
different values of the embedding dimension m. Note the pronounced threshold near
λ = 2, for large values of m .
l
Here k = (l −kl !)!k ! are the binomial coefficients. Equation (10.5) is proven by recursion,
starting from a set of p − 1 points in general position. Assume that the number
C (p −1, m ) of homogeneously linearly separable classification problems given these
points is known. After adding one more point, one can compute the C (p , m ) in
terms of C (p − 1, m ). Recursion yields Equation (10.5).
To connect the result (10.5) to the discussion at the beginning of this Chapter, we
take u (µ) = u (x (µ) ) where x (µ) are p patterns in N -dimensional input space
(µ)
x1
x (µ) = ... for µ = 1, . . . , p , (10.6)
(µ)
xN
and we assume that u is a set of m polynomial functions of finite order. Then the
probability that the problem of the p points x (µ) in N -dimensional input space is
separable by a polynomial decision boundary is given by Equation (10.5) [2, 97].
Note that the probability P (p , m ) is independent of the dimension N of the input
space. Figure 10.3 shows this probability for p = λm as a function of λ for different
values of m . Note that P (2m , m) = 12 . In the limit of large m , the function P (λm, m )
approaches a step function. In this limit one can separate at most 2m patterns
(separability threshold).
Now consider a random sequence of patterns x 1 , x 2 , . . . and targets t 1 , t 2 , . . . and
ask [97]: what is the distribution of the largest integer so that the problem x 1 , x 2 , . . . , x n
is separable in embedding dimension m, but x 1 , x 2 , . . . , x n , x n+1 is not? P (n , m) is
the probability that n patterns are linearly separable in embedding dimension m .
We can write P (n + 1, m ) = q (n + 1|n )P (n, m ) where q (n + 1|n ) is the conditional
probability that n + 1 patterns are linearly separable if the n patterns were. Then
the probability that n + 1 patterns are not separable (but n patterns are) reads
(1 − q )P (n , m ) = P (n , m ) − P (n + 1, m ). We can interpret the right-hand side of this
180 RADIAL BASIS-FUNCTION NETWORKS
u1
W1
u2
x1 W2
u3 linear output
x2 W3
Pm
O = j =1 Wj u j
u4
W4
Figure 10.4: Radial basis-function network for N = 2 inputs and m = 4 radial basis
functions (10.8). The output neuron has weights W and zero threshold.
These functions are not of the finite-order polynomial form that was assumed
in Cover’s theorem. This means that the theorem does not strictly apply. The
parameters s j parameterise the widths of the radial basis functions. In the simplest
RADIAL BASIS-FUNCTION NETWORKS 181
x2 x2
x1 x1
Figure 10.5: Comparison between radial-basis function network and perceptron. Left:
the output of a radial basis function is localised in input space. Right: to achieve a
localised output with sigmoid units one needs two hidden layers (Section 7.1). One
layer determines the lightly shaded cross, the second layer localises the output to the
darker square.
version of the algorithm they are set to unity. Other choices for radial basis functions
are given by Haykin [2].
Figure 10.2 shows a radial basis-function network for N = 2 and m = 4. The
four neurons in the hidden layer stand for the four radial basis functions (10.8) that
map the inputs to four-dimensional u -space. The network looks like a perceptron
(Chapter 5). But here the hidden layers work in a different way. Perceptrons have
hidden McCulloch-Pitts neurons that compute non-local outputs σ(w j · x − θ ).
The output of radial basis functions u j (x ), by contrast, is localised in input space
[Figure 10.5(left)]. We saw in Section 7.1 how to make localised basis functions out
of McCulloch-Pitts neurons with sigmoid activation functions σ(b ), but one needs
two hidden layers to do that [Figure 10.5(right)].
Radial basis functions produce localised outputs with a single hidden layer, and
this makes it possible to divide up input space into localised regions, each corre-
sponding to one radial basis function. Imagine for a moment that we have as many
radial basis functions as input patterns. In this case we can simply take w µ = x (µ)
for µ = 1, . . . , p . Then the classification problem can be written as
UW = t , (10.9)
between w j and the centers of the surrounding radial basis functions. Once weights
and widths of the radial basis functions are found, the weights of the output neuron
are determined by minimising
1 X (µ) 2
H= t − O (µ) (10.10)
2 µ
with respect to W . If the problem is linearly separable in the u -plane, then the
minimum is given by Equation (5.23). An approximate solution can be obtained
by stochastic gradient descent on H keeping the parameters of the radial basis
functions fixed. Cover’s theorem indicates that the problem is more likely to be
separable if the embedding dimension m is higher. Figure 10.6 shows an example
for a non-linear decision boundary found by this algorithm.
t = +1 x2 decision
boundary
t = −1 m classification error
5 0.26
20 0.06
x1
10.3 Summary
Radial basis-function networks are similar to the perceptrons described in Chap-
ters 5 to 7, in that they are feed-forward networks designed to solve classification
problems. The outputs work in similar ways.
A fundamental difference is that the parameters of the radial basis functions
are determined by unsupervised learning, whereas perceptrons are trained using
supervised learning for all units.
While McCulloch-Pitts neurons compute weights to minimise their output from
given targets, the radial basis functions compute weights by maximising u j as a
function of j . The algorithm for finding the weights of the radial basis functions is
summarised in Algorithm 12.
Further, as opposed to the deep networks from Chapter 7, radial basis-function
networks have only one hidden layer, and a linear output neuron. Radial-basis
function nets learn using a hybrid scheme: unsupervised learning for the parameters
of the radial basis functions, and supervised learning for the weights of the output
neuron.
10.5 Exercises
Expected maximal number of separable patterns. Show that the sum in Equation
(10.7) sums to 2m.
184 RADIAL BASIS-FUNCTION NETWORKS
x1 x2 t
0 0 0
0 1 1
1 0 1
1 1 0
Table 10.1: Inputs and target values for the XOR problem. Question 10.6.1.
2. Use the transformed input data as inputs to a simple perceptron with sgn(· · · )
activation function. In the plot you drew in the previous step, draw also
a decision boundary that solves the problem when a simple perception is
applied to the transformed data. (0.5 p).
3. Compute a weight vector and a threshold for the simple perceptron corre-
sponding to the decision boundary you drew in the previous step. (0.5p)
11 Reinforcement learning
Reinforcement learning is a modified version of supervised learning. In the latter
case, target patterns t (µ) are available for all input patterns x (µ) . In practice, less
information may be needed. For instance, the feedback may consist only of a single
bit of information (output O1 , . . . , OM right or wrong).
To show how neural networks can learn from incomplete feedback, I explain the
associative reward penalty algorithm in the following. This algorithm is formulated
in terms of stochastic output units. So we must briefly discuss these first.
1 X (µ) (µ) 2
H= t i − Oi . (11.1)
2 iµ
where bi is the local field with activation function equal to the identity function:
(µ) (µ)
X
bi = Wi j V j , (11.2b)
j
and P (b ) = (1 + e −2β b )−1 . The parameter β −1 is the noise level, as in Chapter 3. The
V j represent hidden neurons or inputs.
How does one train the weights Wi j of the stochastic neurons? To formulate a
definite learning rule, we average over the stochastic dynamics of the network, just
as in Chapter 3. But how can one learn on average? One possibility is to compare
STOCHASTIC OUTPUT UNITS 187
(µ)
the average output with the desired output. This corresponds to replacing Oi by
(µ)
〈Oi 〉 in the energy function:
Here we used Equation (3.11). Gradient descent gives for the weight changes in the
output layer:
∂H X
(µ) (µ) (µ)
δWmn = η =η (t i − 〈Oi 〉)β (1 − 〈Oi 〉2 ) δi m Vn(µ) . (11.4)
∂ wmn iµ
This update rule contains a sum over patterns (batch training). The corresponding
sequential rule can be written in the usual form:
(µ) (µ) (µ) (µ)
δWmn = ηδm Vn with δm = (t m(µ) − 〈Om(µ) 〉)β (1 − 〈Oi 〉2 ) . (11.5)
The factor β (1 − 〈Om(µ) 〉2 ) corresponds to the factor g 0 (Bm(µ) ) in Equation (6.7). One
can show that this rule decreases the energy, on average. To this end we use that
(µ) (µ)
Oi = ±1 and t i = ±1. This means that the energy function can be rewritten as
(µ) (µ)
X
H= 1 − t i Oi . (11.6)
iµ
The change in the average energy due to the weight increment (11.5) is
(µ)
(µ) ∂ 〈Oi 〉
X ∂ 〈H 〉 X
δ〈H 〉 = δWmn = − ti δWmn . (11.7)
mn
∂ Wmn mni µ
∂ Wmn
Using
(µ)
∂ 〈Oi 〉 (µ)
= β [1 − tanh2 (β bi )]δi m Vn(µ) (11.8)
∂ Wmn
one finds
X
δ〈H 〉 = −ηβ 2 [1 − tanh2 (β bm(µ) )]2 [1 − t m(µ) tanh(β bm(µ)) ](Vn(µ) )2 < 0 . (11.9)
mn µ
Returning to the learning rule (11.5), note that the formula for the error is some-
times written without the factor β (1 − 〈Om(µ) 〉2 ) (in Ref. [1] for instance):
(µ) (µ) (µ)
δWmn = ηδm Vn with δm = t m(µ) − 〈Om(µ)) 〉 . (11.10)
This choice corresponds to using the energy function (7.32), see Section 7.2.3.
188 REINFORCEMENT LEARNING
(µ)
¨
(µ) Oi , if r (µ) = +1
ti = (µ) (11.12)
−Oi , if r (µ) = −1.
Substituting Equation (11.12) into the stochastic learning rule (11.10) yields the
learning rule for the associative reward-penalty algorithm:
Often η(+) is taken to be larger than η(−) , for example η(+) /η(−) ≈ 10 or 100 [1].
A special case is a network with only one binary output neuron. In this case
reinforcement learning determines correct output pattern (−O (µ) is correct if O (µ) is
wrong), so the target patterns t (µ) are precisely reconstructed in this way. But there
is no general convergence proof for the algorithm for more than one binary output
unit.
11.3 Summary
In Chapters 5 to 8 we discussed how to train networks to correctly associate in-
put/output pairs (x (µ) , t (µ) ) in a training set (supervised learning). Chapters 9 and 10
described unsupervised learning where the network learns without any feedback.
The reinforcement algorithm introduced in the present Chapter is in between. It
learns from incomplete feedback. Not all targets are known to the network. The feed-
back may simply be r = ±1 (good or bad solution). The associative reward-penalty
algorithm uses this feedback to build up a training set that allows the network to
learn.
FURTHER READING 189
Bibliography
[1] HERTZ, J, KROGH, A & PALMER, R 1991 Introduction to the Theory of Neural
Computation. Addison-Wesley.
[5] NIELSEN, M, Neural Networks and Deep Learning, [Online; accessed 13-
August-2018].
[6] LECUN, A, BENGIO, Y & HINTON, G 2015 Deep learning. Nature 521, 463.
[10] GABBIANI, F & METZNER, W 1999 Encoding and processing of sensory infor-
mation in neuronal spike trains. Journal of Experimental Biology 202 (10),
1267.
[11] MCCULLOCH, W & PITTS, W 1943 A logical calculus of the ideas immanent in
nervous activity. Bull. Math. Biophys. 5, 115.
[12] ROSENBLATT, F 1958 A probabilistic model for information storage and orga-
nization in the brain. Psychological Review 65, 386.
[19] STEFFAN, H & KUEHN, R 1994 Replica symmetry breaking in attractor neural
network models arxiv:cond-mat/9404036.
[20] VOLK, D 1998 On the phase transition of Hopfield networks – another Monte
Carlo study. Int. J. Mod. Phys. C 9, 693.
[21] POTVIN, J. Y & SMITH, K. A 1999 Artificial neural networks for combinatorial
optimization. In Handbook of Metaheuristics (ed. G F. & G Kochenberger).
Heidelberg: Springer.
[22] MANDZIUK, J 2002 Neural networks for the n -Queens problem: a review. Con-
trol and Cybernetics 31, 217, Special issue on neural networks for optimization
and control.
[26] KAMPEN, N. V 2007 Stochastic processes in physics and chemistry. North Hol-
land.
[28] BINDER, K, ed. 1986 Monte Carlo Methods in Statistical Physics. Heidelberg:
Springer.
BIBLIOGRAPHY 193
[30] SALAMON, P, SIBANI, P & FROST, R 2002 Facts, conjectures, and improvements
for simulated annealing. SIAM.
[33] LÖWE, M 1998 In the storage capcaity of Hopfield models with correlated
patterns. Ann. Prob. 8, 1216.
[36] LECUN, Y, BOTTOU, L, ORR, G. B & MÜLLER, K.-R 1998 Efficient back prop. In
Neural networks: tricks of the trade (ed. G. B Orr & K.-R Müller). Springer.
[38] SUTSKEVER, I 2013 Training recurrent neural networks. PhD thesis, University
of Toronto, [Online; accessed 27-October-2018].
[39] HANSON, S. J & PRATT, L. Y 1989 Comparing biases for minimal network con-
struction with backpropagation. In Advances in Neural Information Processing
Systems 1.
[40] KROGH, A & HERTZ, J. A 1992 A simple weight decay can improve generalization.
In Advances in Neural Information Processing Systems 4.
[41] HASSIBI, B & G.STORK, D 1993 Second order derivatives for network pruning:
Optimal brain surgeon. In Advances in Neural Information Processing Systems
5.
[42] LECUN, Y, DENKTER, J. S & SOLLA, S 1990 Optimal brain damage. In Advances
in Neural Information Processing Systems 2 (ed. D. S Touretzky), p. 598.
194 BIBLIOGRAPHY
[44] HORNIK, K, STINCHOME, M & WHITE, H 1989 Neural networks are universal
approximators. Neural Networks 2, 359.
[47] GLOROT, X, BORDES, A & BENGIO, Y 2011 Deep sparse rectifier neural networks.
In Proceedings of Machine Learning Research.
[49] GLOROT, X & BENGIO, Y 2010 Understanding the difficulty of training deep
feedforward neural networks. In Proceedings of Machine Learning Research.
[51] FRANKLE, J & CARBIN, M 2018 The lottery ticket hypothesis: Finding small,
trainable neural networks arxiv:1803.03635.
[53] IOFFE, S & SZEGEDY, C 2015 Batch normalization: Accelerating deep network
training by reducing internal covariate shift arxiv:1502.03167.
[54] SANTURKAR, S, TSIPRAS, D, ILYAS, A & MADRY, A 2018 How does batch nor-
malization help optimization? (No, it is not about internal covariate shift)
arxiv:1805.11604.
[61] SMITH, L. N 2015 Cyclical learning rates for training neural networks
arxiv:1506.01186.
[63] PICASSO, J. P, Pre-processing before digit recognition for nn and cnn trained
with mnist dataset, stackexchange, [Online; accessed 26-September-2018].
[65] PASCAL VOC DATA SET, Pascal VOC, [Online; accessed 6-September-2018].
[66] REDMON, J, DIVVALA, S, GIRSHICK, R & FARHADI, A, You only look once: Unified,
real-time object detection, arxiv:1506.02640.
[68] HE, K, ZHANG, X, REN, S & SUN, J 2015 Deep residual learning for image
recognition arxiv:1512.03385.
[69] LI, F, JOHNSON, J & YEUNG, S, CNN architectures, CNN architectures, [Online;
accessed 23-September-2018].
[71] SEIF, G, Deep learning for image recognition: why it’s challenging, where we’ve
been, and what’s next, Towards Data Science, [Online; accessed 26-September-
2018].
196 BIBLIOGRAPHY
[74] ZENG, X, OUYANG, W, YAN, J, LI, H, XIAO, T, WANG, K, LIU, Y, ZHOU, Y, YANG,
B, WANG, Z, ZHOU, H & WANG, X 2016 Crafting GBD-net for object detection
arxiv:1610.02579.
[75] HERN, A, Computers now better than humans at recognising and sorting
images, The Guardian, [Online; accessed 26-September-2018].
[81] NGUYEN, A, YOSINSKI, J & CLUNE, J 2015 Deep neural networks are easily fooled:
high confidence predictions for unrecognisable images arxiv:1412.1897.
[85] SETTLES, B 2009 Active Learning Literature Survey. Tech. Rep. 1648. University
of Wisconsin–Madison.
[86] SUTSKEVER, I, VINYALS, O & LE, Q. V 2014 Sequence to sequence learning with
neural networks arxiv:1409.3215.
[88] PASCANU, R, MIKOLOV, T & BENGIO, Y 2012 On the difficulty of training recur-
rent neural networks arxiv:1211.5063.
[90] PAPINENI, K, ROUKOS, S, WARD, T & ZHU, W.-J 2002 Bleu: a method for auto-
matic evaluation of machine translation. In Proceedings of the 40th annual
meeting on association for computational linguistics, p. 311.
[100] POGGIO, T & GIROSI, F 1990 Networks for approximation and learning. In
Proceedings of the IEEE, , vol. 78, p. 1481.