Download as pdf or txt
Download as pdf or txt
You are on page 1of 206

LECTURE NOTES

arXiv:1901.05639v2 [cs.LG] 1 Feb 2019

Artifical Neural Networks

B. MEHLIG

Department of Physics
University of Gothenburg
Göteborg, Sweden 2019
PREFACE
These are lecture notes for my course on Artificial Neural Networks that I have given
at Chalmers (FFR135) and Gothenburg University (FIM720). This course describes
the use of neural networks in machine learning: deep learning, recurrent networks,
reinforcement learning, and other supervised and unsupervised machine-learning
algorithms.
When I first developed my lectures, my main source was the book by Hertz, Krogh,
and Palmer [1]. Other sources were the book by Haykin [2], as well as the lecture
notes of Horner [3]. My main sources for the Chapter on deep learning were the
deep-learning book by Goodfellow, Bengio & Courville [4], and the online-book by
Nielsen [5].
I am grateful to Martin Čejka who typed the first version of my hand-written lec-
ture notes and made most of the Figures, and to Erik Werner and Hampus Linander
for their interest and their help in preparing Chapter 7. I would like to thank also
Johan Fries and Oleksandr Balabanov for implementing the algorithms described
in Section 7.4. Johan Fries and Marina Rafajlovic made most of the exam questions.
Finally many students – past and present – pointed out misprints and errors and
suggested improvements. I thank them all.
CONTENTS

Preface iii

Contents v

1 Introduction 1
1.1 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 McCulloch-Pitts neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Other models for neural computation . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

I Hopfield networks 9
2 Deterministic Hopfield networks 10
2.1 Associative memory problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Hopfield network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Energy function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Spurious states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Exam questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Stochastic Hopfield networks 33


3.1 Noisy dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Order parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Mean-field theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Storage capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Beyond mean-field theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.7 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.9 Exam questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 Stochastic optimisation 51
4.1 Combinatorial optimisation problems . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Energy functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Simulated annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Monte-Carlo simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

II Supervised learning 61
5 Perceptrons 63
5.1 A classification task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Iterative learning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3 Gradient-descent learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4 Multi-layer perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.8 Exam questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6 Stochastic gradient descent 82


6.1 Chain rule and error backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Stochastic gradient-descent algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3 Recipes for improving the performance . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.5 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.7 Exam questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7 Deep learning 107


7.1 How many hidden layers? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2 Training deep networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3 Convolutional networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.4 Learning to read handwritten digits . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.5 Deep learning for object recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.6 Residual networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.8 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.10 Exam questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8 Recurrent networks 147


8.1 Recurrent backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.2 Backpropagation through time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.3 Recurrent networks for machine translation . . . . . . . . . . . . . . . . . . . . . 156
8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.5 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.7 Exam questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

III Unsupervised learning 161


9 Unsupervised Hebbian learning 162
9.1 Oja’s rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.2 Competitive learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9.3 Kohonen’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
9.6 Exam questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

10 Radial basis-function networks 177


10.1 Separating capacity of a surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
10.2 Radial basis-function networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
10.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
10.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
10.6 Exam questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

11 Reinforcement learning 186


11.1 Stochastic output units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
11.2 Associative reward-penalty algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 188
11.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
11.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
1

1 Introduction
The term neural networks refers to networks of neurons in the mammalian brain.
Neurons are its fundamental units of computation. In the brain they are connected
together in networks to process data. This can be a very complex task, and the
dynamics of neural networks in the mammalian brain in response to external stimuli
can therefore be quite intricate. Inputs and outputs of each neuron vary as functions
of time, in the form of so-called spike trains, but also the network itself changes. We
learn and improve our data-processing capacities by establishing reconnections
between neurons.
Neural-network algorithms are inspired by the architecture and the dynamics
of networks of neurons in the brain. Yet the algorithms use neuron models that
are highly simplified, compared with real neurons. Nevertheless, the fundamental
principle is the same: artificial neural networks learn by reconnection. Such net-
works can perform a multitude of information-processing tasks. They can learn to
recognise structures in a set of training data and generalise what they have learnt to
other data sets (supervised learning). A training set contains a list of input data sets,
together with a list of the corresponding target values that encode the properties of
the input data that the network is supposed to learn. To solve such association tasks
by artificial neural networks can work well – when the new data sets are governed
by the same principles that gave rise to the training data.
A prime example for a problem of this type is object recognition in images, for
instance in the sequence of camera images of a self-driving car. Recently the use
of neural networks for object recognition has exploded. There are several reasons
for this strong interest. It is driven, first, by the acute need for such algorithms in
industry. Second, there are now much better image data bases available for training
the networks. Third, there is better hardware so that networks with many layers
containing many neurons can be efficiently trained (deep learning) [4, 6].
Another task where neural networks excel is machine translation. These networks
are dynamical (recurrent).. They take an input sequence of words or sometimes
single letters. As one feeds the inputs word by word, the network outputs the words
in the translated sentence. Recurrent networks can be efficiently trained on large
training sets of input sentences and their translations. Google translate works in
this way [7].
Artificial neural networks are good at analysing large sets of high-dimensional
data where it may be difficult to determine a priori which properties are of interest.
In this case one often relies on unsupervised learning algorithms where the network
learns without a training set. Instead it determines in terms of which categories the
data can be analysed. In this way, artificial neural networks can detect familiarity
2 INTRODUCTION

axon
neural cell body

dendrites

Figure 1.1: Neurons in the cerebral cortex (outer layer of the cerebrum, the largest and
best developed part of the mammalian brain) of a macaque, an Asian monkey. Repro-
duced by permission of brainmaps.org [9] under the Creative Commons Attribution 3.0
License. The labels were added.

(which input patterns occur most often), clusters, and other structures in the input
data. Unsupervised-learning algorithms work well when there is redundancy in the
input data that is not immediately obvious because the data is high dimensional.

In many problems some information about targets is known, yet incomplete. In


this case one uses algorithms that contain elements of both supervised and unsu-
pervised learning (reinforcement learning). Such algorithms are used, for instance,
in the software AlphaGo [8] that plays the game of go.

The different algorithms have much in common. They share the same building
blocks: the neurons are modeled as linear threshold units (McCulloch-Pitts neurons),
and the learning rules are similar (Hebb’s rule). Closely related questions arise also
regarding the network dynamics. A little bit of noise (not too much!) can improve the
performance, and ensures that the long-time dynamics approaches a steady state.
This allows to analyse the convergence of the algorithms using the central-limit
theorem.

There are many connections to methods used in Mathematical Statistics, such


as Markov-chain Monte-Carlo algorithms and simulated annealing. Certain unsu-
pervised learning algorithms are related to principal component analysis, others to
clustering algorithms such as k -means clustering.
NEURAL NETWORKS 3

dendrites
synapses cell body terminals
axon

input → process → output


Figure 1.2: Schematic image of a neuron. Dendrites receive input in the form of
electrical signals, via synapses. The signals are processed in the cell body of the neuron.
The output travels from the neural cell body to other neurons through the axon.

1.1 Neural networks


The mammalian brain consists of different regions that perform different tasks. The
cerebral cortex is the outer layer of the mammalian brain. We can think of it as a thin
sheet (about 2 to 5 mm thick) that folds upon itself to increase its surface area. The
cortex is the largest and best developed part of the Human brain. It contains large
numbers of nerve cells, neurons. The Human cerebral cortex contains about 1010
neurons. They are linked together by nerve strands (axons) that branch and end
in synapses. These synapses are the connections to other neurons. The synapses
connect to dendrites, branched extensions from the neural cell body designed to
receive input from other neurons in the form of electrical signals. A neuron in the
Human brain may have thousands of synaptic connections with other neurons. The
resulting network of connected neurons in the cerebral cortex is responsible for
processing of visual, audio, and sensory data.
Figure 1.1 shows neurons in the cerebral cortex of the macaque, an Asian monkey.
The image displays a silver-stained cross section through the cerebral cortex. The
brown and black parts are the neurons. One can distinguish the cell bodies of the
neural cells, their axons, and their dendrites.
Figure 1.2 shows a more schematic view of a neuron. Information is processed
from left to right. On the left are the dendrites that receive signals and connect
to the cell body of the neuron where the signal is processed. The right part of the
Figure shows the axon, through which the output is sent to the dendrites of other
neurons.
Information is transmitted as an electrical signal. Figure 1.3 shows an example of
the time series of the electric potential for a pyramidal neuron in fish [10]. The time
series consists of an intermittent series of electrical-potential spikes. Quiescent
4 INTRODUCTION

Figure 1.3: Spike train in electrosensory pyramidal neuron in fish (eigenmannia). Time
series from Ref. [10]. Reproduced by permission of the publisher.

periods without spikes occur when the neuron is inactive, during spike-rich periods
the neuron is active.

1.2 McCulloch-Pitts neurons


In artificial networks, the ways in which information is processed and signals are
transferred are highly simplified. The model we use nowadays for the computational
unit, the artificial neuron, goes back to McCulloch and Pitts [11]. Rosenblatt [12,
13] described how to connect such units in artificial neural networks to process
information. He referred to these networks as perceptrons.
In its simplest form, the model for the artificial neuron has only two states, active
or inactive. The model works as a linear threshold unit: it processes all input signals
and computes an output. If the output exceeds a given threshold then the state of
the neuron is said to be active, otherwise inactive. The model is illustrated in Figure
1.4. Neurons usually perform repeated computations, and one divides up time into
discrete time steps t = 0, 1, 2, 3, . . .. The state of neuron number j at time step t is
denoted by ¨
0 inactive ,
n j (t ) = (1.1)
1 active .
Given the signals n j (t ), neuron number i computes
X ‹
ni (t + 1) = θH wi j n j (t ) − µi . (1.2)
j

As written, this computation is performed for all neurons i in parallel, and the
outputs ni are the inputs to all neurons at the next time step, therefore the outputs
have the time argument t + 1. These steps are repeated many times, resulting in
MCCULLOCH-PITTS NEURONS 5

n1 (t ) wi 1
n2 (t ) wi 2  ‹
PN
wi 3 ni (t + 1) = θH w n (t ) − µi
j =1 i j j
n3 (t )
.. µi
.
nN (t ) wi N
Figure 1.4: Schematic diagram of a McCulloch-Pitts neuron. The index of the neuron
is i , it receives inputs from N other neurons. The strength of the connection from
neuron j to neuron i is denoted by wi j . The function θH (b ) (activation function) is
the Heaviside function. It is equal to zero for b < 0 and equal to unity for b > 0. The
threshold value for neuron i is denoted by µi . The index t = 0, 1, 2, 3, . . . labels the
discrete time sequence of computation steps.

time series of the activity levels of all neurons in the network, referred to as neural
dynamics.

The procedure described above is called synchronous updating. An alternative is


to choose a neuron randomly (or following a prescribed deterministic rule), and to
update only this one, instead of all together. This scheme is called asynchronous
updating. If there are N neurons, then one synchronous step corresponds to N asyn-
chronous steps, on average. This difference in time scales is not the only difference
between synchronous and asynchronous updating. In general the two schemes
yield different neural dynamics.

Now consider the details of the computation step Equation (1.2). The function
θH (b ) isP
the activation function. Its argument is often referred to as the local field,
bi (t ) = j wi j n j (t ) − µi . Since the neurons can only assume the states 0/1, the
activation function is taken to be the Heaviside function, θH (b ) = 0 if b < 0 and
θH (b ) = 1 if b > 0 (Figure 1.5). The Heaviside function is not defined at b = 0. To
avoid problems in our computer algorithms, we usually take θH (0) = 1.

Equation (1.2) shows that the neuron performs a weighted linear average of the
inputs n j (t ). The weights wi j are called synaptic weights. Here the first index, i ,
refers to the neuron that does the computation, and j labels all neurons that connect
to neuron i . The connection strengths between different pairs of neurons are in
general different, reflecting different strengths of the synaptic couplings. When the
value of wi j is positive, we say that the coupling is called excitatory. When wi j is
6 INTRODUCTION

θH (b )
1

0 b
Figure 1.5: Heaviside function.

negative, the connection is called inhibitory. When wi j = 0 there is no connection:



> 0 excitatory connection ,

wi j = 0 no connection from j to i (1.3)
< 0 inhibitory connection .

Finally, the threshold for neuron i is denoted by µi .

1.3 Other models for neural computation


The dynamics defined by Equations (1.1) and (1.2) is just a caricature of the time
series of electrical signals in the cortex. For a start, the neural model described in the
previous Section can only assume two states, instead of a continuous range of signal
strengths. While real neurons produce time series of spikes, the model gives rise to
time sequences of zeros and ones. The two states, 0 and 1, are meant to model the
inactive and active periods shown in Figure 1.3. For many computation tasks this is
quite sufficient, and for our purposes it does not matter that the dynamics of real
neurons is so different. The aim is not to model the neural dynamics in the brain,
but to construct computation models inspired by real neural dynamics.
In the course of these lectures it will become apparent that the simplest model
described above must be generalised to achieve certain tasks. The most important
generalisations are the following. Sometimes it is necessary to allow the neuron to
respond continuously to its inputs. To this end one replaces Eq. (1.2) by
X ‹
ni (t + 1) = g wi j n j (t ) − µi for all i . (1.4)
j

Here g (b ) is a continuous activation function. An example is shown in Figure 1.6.


This dictates that the states assume continuous values too, not just the discrete
values 0 and 1 as given in Equation (1.1).
Equations (1.2) and (1.4) describe synchronous updating schemes, as mentioned
above. At time step t all inputs n j (t ) are stored. All neurons i are simultaneously
SUMMARY 7

g (b )
1

0 b
Figure 1.6: Continuous activation function.

updated using the stored inputs. Sometimes asynchronous updating is preferable.


At each updating step one chooses a single neuron. Say that we chose neuron
number i . Then only this neuron is updated according to the rule:
X ‹
ni (t + 1) = θH wi j n j (t ) − µi for one chosen value of i . (1.5)
j

Different schemes for choosing neurons are used. One possibility is to arrange the
neurons into an array and to update them one them by one, in a certain order (type-
writer scheme). A second possibility is to choose randomly which neuron to update.
This introduces stochasticity into the neural dynamics. This is very important, and
we will see that there are different ways of introducing stochasticity. Random asyn-
chronous updating is one example. In many scientific problems it is advantageous
to avoid stochasticity, when randomness is due to errors (multiplicative or additive
noise) that diminish the performance of the system. In neural-network dynamics,
by contrast, stochasticity is often helpful, as we shall see below.

1.4 Summary
Artificial neural networks use a highly simplified model for the fundamental compu-
tation unit, the neuron. In its simplest form, the model is just a binary threshold unit.
The units are linked together by weights wi j , and each unit computes a weighted av-
erage of its inputs. The network performs these computations in sequence. Usually
one considers discrete sequences of computation time steps, t = 0, 1, 2, 3, . . .. Either
all neurons are updated simultaneously in one time step (synchronous updating), or
only one chosen neuron is updated (asynchronous updating). Most neural-network
algorithms are built using the model described in this Chapter.
8 INTRODUCTION
9

PART I
HOPFIELD NETWORKS
10 DETERMINISTIC HOPFIELD NETWORKS

The Hopfield network [14] is an artificial neural network that can recognise or
reconstruct images. Consider for example the binary images of digits in Figure 2.1.
These images can be stored in the artificial neural network by assigning the weights
wi j in a certain way (called Hebb’s rule). Then one feeds a distorted image of one of
the digits (Figure 2.2) to the network by assigning the initial states of the neurons in
the network to the bits in the distorted image. The idea is that the neural-network
dynamics converges to the correct undistorted digit. In this way the network can
recognise the input as a distorted image of the correct digit (retrieve this digit). The
point is that the network may recognise patterns with many bits very efficiently.
This idea is quite old though. In the past such networks were used to perform
pattern recognition tasks. Today there are more efficient algorithms for this purpose
(Chapter 7).
Yet the first part of these lectures deals with Hopfield networks, for several reasons.
First, Hopfield nets form the basis for more recent algorithms such as Boltzmann
machines [2] and deep-belief networks [2]. Second, all other neural-network algo-
rithms discussed in these lectures are built from the same building blocks and
use learning rules that are closely related to Hebb’s rule. Third, Hopfield networks
can solve optimisation problems, and the resulting algorithm is closely related to
Markov-chain Monte-Carlo algorithms which are much used for a wide range of
problems in Physics and Mathematical Statistics. Fourth, and most importantly, a
certain degree of noise (not too much) can substantially improve the performance
of Hopfield networks, and it is understood in detail why. The reason that so much is
known about the role of noise in Hopfield networks is that they are closely related to
stochastic systems studied in Physics, namely random magnets and spin glasses. The
point is: understanding the effect of noise on the dynamics of Hopfield networks
helps to analyse the performance of other neural-network models.

2 Deterministic Hopfield networks


2.1 Associative memory problem
The pattern-recognition task described above is an example of an associative-memory
problem: there are p images (patterns), each with N bits. Examples for such sets
of patterns are the letters in the alphabet, or the digits shown in Figure 2.1. The
different patterns are labeled by the index µ = 1, . . . , p . The bits of pattern µ are
(µ)
denoted by xi . The index i labels the bits of a given pattern, it ranges from 1
to N . The bits are binary: they can take only the values 0 and 1, as illustrated in
Figure 2.2. To determine the generic properties of the algorithm, one often turns
ASSOCIATIVE MEMORY PROBLEM 11

Figure 2.1: Binary representation of the digits 0 to 4. Each digit has 16 × 10 pixels.

(µ)
to random patterns where each bit xi is chosen randomly. Each bit takes either
value with probability 12 , and different bits (in the same and in different patterns)
are independent. It is convenient to gather the bits of a pattern in a column vector
 (µ) 
x1
 x (µ) 
x (µ) =  .  .
 2 
(2.1)
 .. 
(µ)
xN

In the following, vectors are written in bold math font.


The first part of the problem is to store the patterns x (1) to x (p ) . Second, one feeds
a test pattern x , a distorted version of one of the binary patterns in the problem.
The aim is to determine which one of the stored patterns x (µ) most closely resembles
x . The problem is, in other words, to associate the test pattern with the closest one
of the stored patterns. The formulation of the problem requires to define how close
two given patterns are to each other. One possibility is to use the Hamming distance.
For patterns with 0/1 bits, the Hamming distance hµ between the patterns x and
x (µ) is defined as
N ”
(µ) (µ)
X € Š —
hµ ≡ xi (1 − xi ) + 1 − xi xi . (2.2)
i =1

The Hamming distance equals the number of bits by which the patterns differ.
Two patterns are identical if they have Hamming distance zero. For 0/1 patterns,
Equation (2.2) is equivalent to:
N
hµ 1 X € (µ) Š2
= xi − xi . (2.3)
N N i =1

This means that the Hamming distance is given by the mean-squared error, summed
over all bits. Note that the Hamming distance does not refer to distortions by trans-
lations, rotations, or shearing. An improved version of the distance involves taking
12 DETERMINISTIC HOPFIELD NETWORKS

xi = 1

xi = 0

i = 1, . . . , N

Figure 2.2: Binary image (N = 160) of the digit 0, and a distorted version of the same
image.

the minimum distance between the patterns subject to all possible translations,
rotations, and so forth.
In summary, the association task is to find the index ν for which the Hamming
distance hν is minimal, hν ≤ hµ for all µ = 1, . . . , p . How can one solve this task using
a neural network? One feeds the distorted pattern x with bits xi into the network by
assigning ni (t = 0) = xi . Assume that x is a distorted version of x (ν) . Now the idea is
to find a set of weights wi j so that the network dynamics converges to the correct
stored pattern:
(ν)
ni (t ) → xi as t → ∞ . (2.4)

Which weights to choose depends on the patterns x (µ) , so the weights must be
functions of x (µ) . We say that we store these patterns in the network by choosing
the appropriate weights. If the network converges as in Equation (2.4), the pattern
x (ν) is said to be an attractor in the space of all possible states of this network, the
so-called configuration space or state space.

2.2 Hopfield network


Hopfield [14] used a network of McCulloch-Pitts neurons to solve the associative
memory problem described in the previous Section. The states of neuron i in the
Hopfield network take the values
¨
−1 inactive ,
Si (t ) = (2.5)
1 active ,
HOPFIELD NETWORK 13

sgn(b )

+1

−1 b

Figure 2.3: Signum function.

instead of 0/1, because this simplifies the mathematical analysis as we shall see.
The transformation from ni ∈ {0, 1} to Si ∈ {−1, 1} is straightforward:

Si = 2ni − 1 . (2.6)

The thresholds are transformed accordingly, θi = 2µi −


P
j w i j . For the bits of the
(µ) (µ)
patterns we keep the symbol xi , but it is important to remember that now xi = ±1.
To ensure that the Si can only take the values ±1, the activation function is taken to
be the signum function (Figure 2.3).
¨
−1 b < 0 ,
sgn(b ) = (2.7)
+1, b > 0 .

The signum function is not defined at b = 0. To avoid problems in our computer


algorithms we usually define sgn(0) = 1.
The asynchronous update rule takes the form
X ‹
Si ← sgn wi j S j − θi for one chosen value of i , (2.8)
j
| {z }
≡bi

As before i is the index of the chosen neuron. The arrow indicates that Si (t + 1) is
assigned the r.h.s of this equation evaluated at S j (t ). The synchronous update rule
reads X ‹
Si ← sgn wi j S j − θi , (2.9)
j
| {z }
≡bi

where all bits are updated in parallel. In Eqs. (2.8) and (2.9) the argument of the
activation function is denoted by bi , sometimes called the local field.
Now we need a strategy for choosing the weights wi j , so that the patterns x (µ) are
attractors. If one feeds a pattern x close to x (ν) to the network, we want the network
to converge to x (ν)

S (t = 0) = x ≈ x (ν) ; S (t ) → x (ν) as t → ∞. (2.10)


14 DETERMINISTIC HOPFIELD NETWORKS

This means that the network succeeds in correcting a small number of errors. If the
number of errors is too large, the network may converge to another pattern. The
region in configuration space around pattern x (ν) in which all patterns converge to
x (ν) is called the region of attraction of x (ν) .
However, we shall see that it is in general very difficult to prove convergence
according to Eq. (2.10). Therefore we try to answer a different question first: if one
feeds one of the undistorted patterns x (ν) , does the network recognise that it is one
of the stored, undistorted patterns? The network should not make any changes to
x (ν) because all bits are correct:

S (t = 0) = x (ν) ; S (t ) = x (ν) for all t = 0, 1, 2, . . . . (2.11)

Even this question is in general difficult to answer. We therefore consider a simple


limit of the problem first, namely p = 1. There is only one pattern to recognize, x (1) .
A suitable choice of weights wi j is given by Hebb’s rule

1 (1) (1)
wi j = x x and θi = 0 . (2.12)
N i j
We say that the pattern x (1) is stored in the network by assigning the weights wi j using
the rule (2.12). Note that the weights are symmetric, wi j = w j i . To check that the
(1)
rule (2.12) does the trick, feed the pattern to the network by assigning S j (t = 0) = x j ,
and evaluate Equation (2.8):
N N N
X
(1) 1 X (1) (1) (1) 1 X (1)
wi j x j = x x x = x . (2.13)
j =1
N j =1 i j j N j =1 i

(1)
The last equality follows because x j can only take the values ±1. The empty sum
evaluates to N , so that
XN ‹
(1) (1)
sgn wi j x j = xi . (2.14)
j =1

(µ) (µ) (µ)


Recall that xi = ±1, so that sgn(xi ) = xi . Comparing Equation (2.14) with the
(1)
update rule (2.8) shows that the bits x j of the pattern x (1) remain unchanged under
the update, as required by Eq. (2.11). The network recognises the pattern as a stored
one, so Hebb’s rule (2.12) does what we asked.
But does the network correct small errors? In other words, is the pattern x (1) an
attractor [Eq. (2.10)]? This question cannot be answered in general. Yet in practice
Hopfield models work often very well! It is a fundamental insight that neural net-
works may work well although it is impossible to strictly prove that their dynamics
converges to the correct solution.
HOPFIELD NETWORK 15

1 2
1 2
3 4
(1) (1)
x1 = x4 = 1
3 4 (1) (1)
x2 = x3 = −1
T
(a) Network layout. The network has four (b) Pattern x (1)T = 1, −1, −1, 1 .


neurons, ◦. The arrows indicate symmet- Here T denotes the transpose of the col-
ric connections. umn vector x (1) .
Figure 2.4: Hopfield network with N = 4 neurons.

To illustrate the difficulties consider an example, a Hopfield network with p = 1


and N = 4 (Figure 2.4). Store the pattern x (1) shown in Figure 2.4 by assigning the
weights wi j using Hebb’s rule (2.12). Now feed a distorted pattern x to the network
that has a non-zero distance to x (1) :
4
1X (1) 2

h1 = xi − xi > 0. (2.15)
4 i =1

The factor 14 takes into account that the patterns take the values ±1 and not 0/1 as in
Section 2.1. To feed the pattern to the network, one sets Si (t = 0) = xi . Now iterate the
dynamics using synchronous updating (2.9). Results for different distorted patterns
are shown in Figure 2.5. We see that the first two distorted patterns (distance 1)
converge to the stored pattern, cases (a) and (b). But the third distorted pattern
does not [case (c)].
To understand this behaviour it is most convenient to analyse the synchronous
dynamics using the weight matrix
1 (1) (1)T
W= x x . (2.16)
N
Here x (1)T denotes the transpose of the column vector x (1) , so that x (1)T is a row vector.
The standard rules for matrix multiplication apply also to column and row vectors,
they are just N × 1 and 1 × N matrices. This means that the product on the r.h.s. of
Equation (2.16) is an N × N matrix. In the following, matrices with elements A i j or
Bi j are written as A, B, and so forth. The product in Equation (2.16) is also referred
to as an outer product. The product
N
X
x (1)T x (1) = [x (1) ]2 = N , (2.17)
j =1
16 DETERMINISTIC HOPFIELD NETWORKS

(a)

(b)

(c)

Figure 2.5: Reconstruction of a distorted image (left). Under synchronous updating


(2.9) the first two distorted images (a) and (b) converge to the stored pattern x (1) (right),
but pattern (c) does not.

by contrast, is just a number (equal to N ). The product (2.17) is also called scalar
product. It is denoted by x (1) · x (1) = x (1)T x (1) .
Using Equation (2.17) we see that W projects onto the vector x (1) ,

Wx (1) = x (1) . (2.18)

In the same way we can show that the matrix W is idempotent:

Wn = W for n = 1, 2, 3, . . . . (2.19)

Equations (2.18) and (2.19) mean that the network recognises the pattern x (1) as
the stored one. The pattern is not updated [Eq. (2.11)]. This example illustrates the
general proof, Equations (2.13) and (2.14).
Now consider the distorted pattern (a) in Figure 2.5. We feed this pattern to the
network by assigning
−1
 
−1
S (t = 0) =   . (2.20)
−1
1

To compute one step in the synchronous dynamics (2.9) we simply apply W to


S (t = 0). This is done in two steps, using the outer-product form (2.16) of the weight
HOPFIELD NETWORK 17

matrix. We first multiply S (t = 0) with x (1)T from the left

−1
 
−1
x (1)T S (t = 0) = 1, −1, −1, 1   = 2 ,
 
(2.21)
−1
1

and then we multiply this result with x (1) . This gives:

W S (t = 0) = 21 x (1) . (2.22)

The signum of the i -th component of the vector W S (t = 0) yields Si (t = 1):


N
(1)
€X Š
Si (t = 1) = sgn wi j S j (t = 0) = xi . (2.23)
j =1

This means that the state of the network converges to the stored pattern, in one
synchronous update. Since W is idempotent, the network stays there: the pattern
x (1) is an attractor. Case (b) in Figure 2.5 works in a similar way.
Now look at case (c), where the network fails to converge to the stored pattern.
We feed this pattern to the network by assigning S (t = 0) = [−1, 1, −1, −1]T . For one
iteration of the synchronous dynamics we first evaluate

−1
 
1
x (1)T S (0) = 1, −1, −1, 1   = −2 .
 
(2.24)
−1
−1

It follows that
W S (t = 0) = − 12 x (1) . (2.25)
Using the update rule (2.9) we find

S (t = 1) = −x (1) . (2.26)

Equation (2.19) implies that

S (t ) = −x (1) for t ≥ 1. (2.27)

Thus the network shown in Figure 2.4 has two attractors, the pattern x (1) as well as
the inverted pattern −x (1) . This is a general property of McCulloch-Pitts dynamics
with Hebb’s rule: if x (1) is an attractor, then the pattern −x (1) is an attractor too. But
one ends up in the correct pattern x (1) when more than half of bits in S (t = 0) are
correct.
18 DETERMINISTIC HOPFIELD NETWORKS

In summary we have shown that Hebb’s rule (2.12) allows the Hopfield network
to recognise a stored pattern: if we feed the stored pattern without any distortions
to the network, then it does not change the bits. This does not mean, however, that
the network recognises distorted patterns. It may or may not converge to the correct
pattern. We expect that convergence is more likely when the number of wrong bits
is small. If all distorted patterns near the stored pattern x (1) converge to x (1) then we
say that x (1) is an attractor. If x (1) is an attractor, then −x (1) is too.
When there are more than one patterns then Hebb’s rule (2.12) must be gener-
alised. A guess is to simply sum Equation (2.12) over the stored patterns:
p
1 X (µ) (µ)
wi j = x x and θi = 0 (2.28)
N µ=1 i j

(Hebb’s rule for p > 1 patterns). As for p = 1 the weight matrix is symmetric, W = WT ,
so that wi j = w j i . The diagonal weights are not zero in general. An alternative
version of Hebb’s rule [2] defines the diagonal weights to zero:
p
1 X (µ) (µ)
wi j = x x for i 6= j , wi i = 0 , and θi = 0 . (2.29)
N µ=1 i j

If we store only one pattern, p = 1, this modified rule Hebb’s rule (2.29) satisfies
Equation (2.11). In this Section we use Equation (2.29).
If we assign the weights according to Equation (2.29), does the network recognise
distorted patterns? We saw in the previous Section that this question is difficult
to answer in general, even for p = 1. Therefore we ask, first, whether the network
recognises the stored pattern x (ν) . The question is whether
€ 1 XX
(µ) (µ) (ν) (ν)
Š
sgn xi x j x j = xi . (2.30)
N j 6=i µ
| {z }
(ν)
≡bi

To check whether Equation (2.30) holds, we must repeat the calculation described
on page 14. As a first step we evaluate the argument of the signum function,

(ν)
 (ν) 1 X X (µ) (µ) (ν)
bi = 1 − N1 xi + x x x . (2.31)
N j 6=i µ6=ν i j j

Here we have split the sum over the patterns into two contributions. The first term
corresponds to µ = ν, where ν refers to the pattern that was fed to the network, the
one that we want the network to recognise. The second term in Equation (2.31)
HOPFIELD NETWORK 19

contains the sum over the remaining patterns. For large N we can approximate
(1 − N1 ) ≈ 1. It follows that condition (2.30) is satisfied if the second term in (2.31)
does not affect the sign of the r.h.s. of this Equation. This second term is called
cross-talk term.
Whether adding the cross-talk term to x (ν) changes the signum of the r.h.s. of
Equation (2.31) or not, depends on the stored patterns. Since the cross-talk term
contains a sum over µ we may expect that the cross-talk term does not matter if p is
small enough. If this is true for all i and ν then all p stored patterns are recognised.
Furthermore, by analogy with the example described in the previous Section, we
may expect that the stored patterns are then also attractors, so that slightly distorted
patterns converge to the correct stored pattern, patterns close to x (ν) converge to
x (ν) under the network dynamics (but this is not guaranteed).
For a more quantitative analysis of the effect of the cross-talk term we store
patterns with random bits (random patterns)
(ν)
Prob(xi = ±1) = 21 . (2.32)

Different bits (different values of i and/or µ) are assigned independent random


values. This means that different patterns are uncorrelated because their covariance
vanishes:
(µ) (ν)
〈xi x j 〉 = δi j δµν . (2.33)
Here 〈· · · 〉 denotes an ensemble average over many realisations of random patterns,
and δi j is the Kronecker delta, equal to unity if i = j but zero otherwise. Note that it
(µ)
follows from Equation (2.32) that 〈x j 〉 = 0.
We now ask: what is the probability that the cross-talk term changes the signum
of the r.h.s. of Equation (2.31)? In other words, what is the probability that the
network produces a wrong bit in one asynchronous update, if all bits were initially
correct? The magnitude of the cross-talk term does not matter when it has the same
(ν)
sign as xi . If it has a different sign, then the cross-talk term may matter. It does if
(ν)
its magnitude is larger than unity (the magnitude of xi ). To simplify the analysis
one wants to avoid having to distinguish between the two cases, whether or not the
(ν)
cross-talk term has the same sign as xi . To this end one defines:

(ν) (ν) 1 X X (µ) (µ) (ν)


Ci ≡ −xi x x x . (2.34)
N j 6=i µ6=ν i j j
| {z }
cross-talk term

(ν) (ν)
If Ci < 0 then the cross-talk term has same sign as xi , so that it does not matter.
(ν) (ν)
If 0 < Ci < 1 it does not matter either, only when Ci > 1. The network produces
(ν)
an error in updating neuron i if Ci > 1 for particular values i and ν: if initially
20 DETERMINISTIC HOPFIELD NETWORKS

(ν)
Si (0) = xi , the sign of the bit changes under the update although it should not – so
that an error results.
How frequently does this happen? For random patterns we can answer this
question by computing the one-step (t = 1) error probability:
t =1 (ν)
Perror = Prob(Ci > 1) . (2.35)
(ν)
Since patterns and bits are identically distributed, Prob(Ci > 1) does not depend
t =1
on i or ν. Therefore Perror does not carry any indices.
t =1
How does Perror depend on the parameters of the problem, p and N ? When both p
and N are large we can use the central-limit theorem to answer this question. Since
(ν)
different bits/patterns are independent, we can think of Ci as a sum of independent
random numbers cm that take the values −1 and +1 with equal probabilities,
(N −1)(p
X −1)
(ν) 1 X X (µ) (µ) (ν) (ν) 1
Ci =− xi x j x j xi = − cm . (2.36)
N j 6=i µ6=ν N m =1

There are M = (N − 1)(p − 1) terms in the sum on the r.h.s. because terms with µ = ν
are excluded, and also those with j = i [Equation (2.29)]. If we use Equation (2.28)
instead, then there is a correction to Equation (2.36) from the diagonal weights. For
p  N this correction is small. P
When p and N are large, then the sum m cm contains a large number of inde-
pendently identically distributed random numbers P with mean zero and variance
unity. It follows from the central-limit theorem that N1 m cm is Gaussian distributed
with mean zero, and with variance
M M M
1 ¬€ X Š2 ¶ 1 XX
σC2 = cm = 〈cn cm 〉 . (2.37)
N 2 m =1 N 2 n=1 m =1

Here 〈· · · 〉 denotes an average over the ensemble or realisations of cm . Since the


random numbers cm are independent for different indices, 〈cn cm 〉 = δnm . So only the
diagonal terms in the double sum contribute, summing up to M ≈ N p . Therefore
p
σC2 ≈ . (2.38)
N
P
One way of showing that the distribution of m cm is approximately Gaussian dis-
PM
tributed is to represent it in terms of Bernoulli trials. The sum m =1 cm equals
2k − M where k is the number of occurrences +1 in the sum. Since the probability
of cm = ±1 is 21 , the probability of drawing k times +1 and M − k times −1 is
 
M 1 k 1 M −k
 
Pk ,M = 2 2 . (2.39)
k
HOPFIELD NETWORK 21
(ν)
P (Ci )

t =1
Perror

(ν)
1 Ci
(ν)
Figure 2.6: Gaussian distribution of Ci . The hashed area equals the error probability.

Here  
M M!
= (2.40)
k k !(M − k )!
is the number of ways in which k occurrences of +1 can be distributed over M
places. We expect that the quantity 2k − M is Gaussian distributed with mean
zero and variance
p M . To demonstrate this, it is convenient to use the variable
z = (2k − M )/ M which p is Gaussian with mean zero and unit variance. Therefore
we substitute k = 2 + 2M z into Equation (2.39) and take the limit of large M using
M

Stirling’s approximation
1
n! = en log n−n+ 2 log 2πn . (2.41)
−1
Expanding Pkp ,M to leading order in M assuming that z remains of order unity
gives
p Pk ,M = 2/(πM ) exp (−z /2). From P (z )dz = P (k )dk it follows that P (z ) =
2

( M /2)P (k ), so that P (z ) = (2π)−1/2 exp(−z 2 /2). So the distribution of z is Gaussian


with zero mean and unit variance, as we intended to show.
In summary, the distribution of C is Gaussian

P (C ) = (2πσC2 )−1/2 exp[−C 2 /(2σC2 )] (2.42)


p
with mean zero and variance σC2 ≈ N , as illustrated in Figure 2.6. To determine the
one-step error probability we must integrate this distribution from 1 to ∞:
Z ∞  v 
1 C2 1 tN
t =1
Perror =p dC e − 2σ 2 = 1 − erf . (2.43)
2πσ 1
2 2p

Here erf is the error function defined as


Z z
2 2
erf(z ) = p dx e−x . (2.44)
π 0

This function is tabulated. Since erf(z ) increases as z increases we conclude that


t =1
Perror increases as p increases, or as N decreases. This is expected: it is more difficult
for the network to recognise stored patterns when there are more of them. On the
22 DETERMINISTIC HOPFIELD NETWORKS

t =1
Perror

0.02

0.01

0 0.1 0.2 0.3 α


Figure 2.7: Dependence of the one-step error probability on the storage capacity α
according to Equation (2.43), schematic.

other hand, it is easier to distinguish between stored patterns if they have more bits.
We also see that the one-step error probability depends on p and N only through
the combination
p
α≡ . (2.45)
N
The parameter α is called the storage capacity of the network. Figure 2.7 shows
t =1
how Perror depends on the storage capacity. Take α = 0.185 for example. Then the
one-step error probability (the probability of an error in one asynchronous attempt
to update a bit) is about 1%.
The error probability defined in this Section refers only to the initial update,
the first iteration. What happens in the next iteration, and after many iterations?
Numerical experiments show that the error probability can be much higher in later
iterations, because more errors tend to increase the probability of making another
t =1
error. The estimate Perror is a lower bound.
Also: realistic patterns are not random with independent bits. We nevertheless
t =1
expect that Perror describes the typical one-step error probability of the Hopfield
network when p and N are large. However, it is straightforward to construct counter
examples. Consider for example orthogonal patterns:

x (µ) · x (ν) = 0 for µ 6= ν . (2.46)

t =1
The cross-talk term vanishes in this case, so that Perror = 0.

2.3 Energy function


The energy function is defined as

1X
H =− wi j Si S j . (2.47)
2 ij
ENERGY FUNCTION 23

The name comes from an analogy to spin systems in Physics. An alternative name for
H is Hamiltonian. The energy function (2.47) allows us to analyse the convergence of
the dynamics of the Hopfield model. More generally, energy functions are important
tools in analysing the convergence of different kinds of neural networks. A second
reason for considering the energy function is that it allows us to derive Hebb’s rule
in a different way.
We can can write the energy function as

1X
H =− (wi j + w j i )Si S j + const. (2.48)
2 〈i j 〉

The constant is independent of Si and S j . Further, 〈i j 〉 denotes that the sum is


performed over connections (or bonds) between pairs of neurons. Note that Hebb’s
rule yields symmetric weights, wi j = w j i . For symmetric weights it follows that H
cannot increase under the dynamics of the Hopfield model. In each step H either
decreases, or it remains constant. To show this consider the update
X ‹
Si = sgn
0
wi j S j . (2.49)
j

There are two possibilities, either Si0 = Si or Si0 = −Si . In the first case H remains
unchanged, H 0 = H . Here H 0 refers to the value of the energy function after the
update (2.49). The other case is Si0 = −Si . Then

1X X
H −H =−
0
(wi j + w j i )(Si S j − Si S j ) =
0
(wi j + w j i )Si S j . (2.50)
2 j j

The sum goes over all neurons j that are connected to the neuron i that is updated
in Equation (2.49). Now if the weights are symmetric, H 0 − H equals
X
H0−H =2 wi j Si S j . (2.51)
j

P
Since the sign of j wi j S j is that of Si0 , and since the sign of Si0 differs from that of Si ,
it follows from Equation (2.51) that

H 0 − H < 0. (2.52)

So either H remains constant, or its value decreases in one update step.


Since the energy H cannot increase in the Hopfield dynamics, we see that minima
of the energy function must correspond to attractors, as illustrated schematically in
Figure 2.8. The state space of the network – corresponding to all possible choices of
24 DETERMINISTIC HOPFIELD NETWORKS

dynamics

states
x (2) x (1) x (spurious)
Figure 2.8: Minima in the energy function are attractors in state space. Not all minima
correspond to stored patterns, and stored patterns need not correspond to minima.

(S1 , . . .SN ) – is illustrated schematically by a single axis, the x -axis. But when N is
large, the state space is really very high dimensional.
Not all stored patterns are attractors. Our analysis of the cross-talk term showed
this. If the cross-talk term causes errors for a certain stored pattern that is fed into
the network, then this pattern is not located at a minimum of the energy function.
Conversely there may be minima that do not correspond to stored patterns. Such
states are referred to as spurious states. The network may converge to spurious
states, this is undesirable but inevitable.
We now turn to an alternative derivation of Hebb’s rule that uses the energy
function. To keep things simple we assume p = 1 at first. The idea is to write down
an energy function that assumes a minimum at the stored pattern x (1) :
 N ‹2
1 X (1)
H =− Si xi . (2.53)
2N i =1

(1) (1)
This function is minimal when Si = xi for all i (and also when Si = −xi ). You will
see in a moment why the factor 1/(2N ) is inserted. Now we evaluate the expression
on the r.h.s.:
N 
1 X 1 X 1 (1) (1)
 ‹ X ‹ ‹
(1) (1)
H =− Si xi Sj x j = − xi x j Si S j . (2.54)
2N i j
2 ij |
N
{z }
=wi j

This shows that the function H has the same form as the energy function (2.47) for
the Hopfield model, if we assign the weights wi j according to Hebb’s rule (2.12).
Thus this argument provides an alternative motivation for this rule: we write down
an energy function that has a minimum at the stored pattern x (1) . This ensures that
SPURIOUS STATES 25

this pattern is an attractor. Evaluating the function we see that it corresponds to


choosing the weights according to Hebb’s rule (2.12).
We know that this strategy can fail when p > 1. How can this happen? For p > 1
the analogue of Equation (2.53) is

p  N ‹2
1 X X (µ)
H =− Si xi , (2.55)
2N µ=1 i =1

Here the patterns x (ν) are not necessarily minima of H , because a maximal value
PN (ν) 2

of S x
i =1 i i
may be compensated by terms stemming from other patterns. But
one can hope that this happens rarely when p is small (Section 2.2).

2.4 Spurious states


Stored patterns may be minima of the energy function (attractors), but they need
not be. In addition there can be other attractors (spurious states), different from the
stored patterns. For example, since H is invariant under S → −S , it follows that the
patterns −x (µ) is an attractor if x (µ) is an attractor. We consider the inverted patterns
as spurious states. The network may converge to the inverted patterns, as we saw in
Section 2.2.
There are other types of spurious states. An example are linear combinations
of an odd number n of patterns. Such states are called mixed states. For n = 3, for
example, the bits are given by

(1) (2) (3)


xi(mix) = sgn(±xi ± xi ± xi ) . (2.56)

Mixed states come in large numbers, 22n +1 2n+1 , the more the larger n .
p


It is difficult to determine under which circumstances the network dynamics


converges to a certain mixed state. But we can at least check whether a mixed state
is recognised by the network (although we do not want it to do that). As an example
consider the mixed state

(mix) (1) (2) (3)


xi = sgn(xi + xi + xi ) . (2.57)

To check whether this state is recognised, we must determine whether or not

p N
1 X X (µ) (µ) (mix)
 ‹
(mix)
sgn xi x j x j = xi , (2.58)
N µ=1 j =1
26 DETERMINISTIC HOPFIELD NETWORKS

(1) (2) (3) (mix)


xj xj xj xj s1 s2 s3
1 1 1 1 1 1 1
1 1 -1 1 1 1 -1
1 -1 1 1 1 -1 1
1 -1 -1 -1 -1 1 1
-1 1 1 1 -1 1 1
-1 1 -1 -1 1 -1 1
-1 -1 1 -1 1 1 -1
-1 -1 -1 -1 1 1 1
(µ) (mix)
Table 2.1: Signs of sµ = x j x j .

under the update (2.8) using Hebb’s rule (2.28). To this end we split the sum in the
usual fashion

p N 3 N
1 X X (µ) (µ) (mix) X (µ) 1 X (µ) (mix)
x x x = xi x x + cross-talk term . (2.59)
N µ=1 j =1 i j j µ=1
N j =1 j j

Let us ignore the cross-talk term for the moment and check whether the first term
(mix)
reproduces xi . To make progress we assume random patterns [Equation (2.32)],
(mix)
and compute the probability that the sum on the r.h.s of Equation (2.59) yields xi .
Patterns i and j are uncorrelated, and the sum over j on the r.h.s. of Equation (2.59)
(µ) (mix)
is an average over sµ = x j x j . Table 2.1 lists all possible combinations of bits of
pattern j and the corresponding values of sµ . We see that on average 〈sµ 〉 = 12 , so
that

p N 3
1 X X (µ) (µ) (mix) 1 X (µ)
x x x = x + cross-talk term . (2.60)
N µ=1 j =1 i j j 2 µ=1 i

Neglecting the cross-talk term and taking the sgn-function we see that x (mix) is
reproduced. So mixed states such as (2.57) are recognised, at least for small α, and
it may happen that the network converges to these states, loosely referred to as
superpositions of odd numbers of patterns.
Finally, for large values of p there are local minima of H that are not correlated
(µ)
with any number of the stored patterns x j . Such spin-glass states are discussed
further in the book by Hertz, Krogh and Palmer [1].
SUMMARY 27

Algorithm 1 pattern recognition with deterministic Hopfield model


1: store patterns x (µ) using Hebb’s rule;
2: feed distorted pattern x into network by assigning Si (t = 0) ← xi ;
3: for t = 1, . . . , T do
choose a value of i and update Si (t ) ← j wi j S j (t − 1);
P
4:
5: end for
6: read out pattern S (T );
7: end;

2.5 Summary
We have analysed the dynamics of the Hopfield network as a means of solving the
associative memory problem (Algorithm 1). The Hopfield network is a network of
McCulloch-Pitts neurons. Its layout is defined by connection strengths wi j , chosen
according to Hebb’s rule. These weights wi j are symmetric, and the network is in
general fully connected. Hebb’s rule ensures that stored patterns x (µ) are recognised,
at least most of the time if the storage capacity is not too large. A single-step estimate
for the error probability was given in Section 2.2. If one iterates several steps, the
error probability is generally much larger, and it is difficult to evaluate it. It turns out
that it is much simpler to compute the error probability when noise is introduced
into the network dynamics (not just random patterns).

2.6 Exercises
Modified Hebb’s rule. Show that the modified rule Hebb’s rule (2.29) satisfies Equa-
tion (2.11) if we store only one pattern, for p = 1.

Orthogonal patterns. Show that the cross-talk term vanishes for orthogonal pat-
t =1
terns, so that Perror = 0.

Correction to cross-talk term. Evaluate the magnitude of the correction term in


Equation (2.36) and show that it is negligible if p  N . Show that the correction
term vanishes if we set the diagonal weights to zero, wi i = 0. Compare the error
probabilities for large values of p /N when wi i = 0 and when wi i 6= 0. Explain why
the error probability for large α is much smaller in the latter case.

Mixed states. Explain why there are no mixed states that are superpositions of an
even number of stored patterns.
28 DETERMINISTIC HOPFIELD NETWORKS

w12
1 2
w21

Figure 2.9: Hopfield network with two neurons.

One-step error probability for mixed states. Write a computer program imple-
menting the asynchronous deterministic dynamics of a Hopfield network to deter-
mine the one-step error probability for the mixed state (2.57). Plot how the one-step
error probability depends on α for N = 50 and N = 100. Repeat this exercise for
mixed patterns that are superpositions of the bits of 5 and 7 patterns.

Energy function. For the Hopfield network with two neurons shown in Figure
2.9 demonstrate that the energy function cannot increase under the determinstic
dynamics. Write the energy function as H = − w12 +w 2
21
S1S2 and use the update rule
S1 = sgn(w12S2 ). In which step do you need to assume that the weights are symmetric,
0

to prove that H cannot increase?

2.7 Exam questions

2.7.1 One-step error probability


In a deterministic Hopfield network, the statePNSi of thei -th neuron is updated ac-
cording to the McCulloch Pitts rule Si ← sgn w S , where N is the number of
j =1 i j j
neurons in the model, wi j are the weights, and p patterns x (µ) are stored by assigning
Pp (µ) (µ)
the weights according to Hebb’s rule, wi j = N1 µ=1 xi x j for i 6= j , and wi i = 0.
(ν)
(a) Apply pattern x (ν) to the network. Derive the condition for bit x j of this pattern
to be unchanged after a single asynchronous update. Express this condition in
terms of the cross-talk term. (0.5p).
(µ)
(b) Store p patterns with random bits (x j = ±1 with probability 12 ) in the Hopfield
network using Hebb’s rule. Apply pattern x (ν) to the network. For large p and N ,
(ν)
derive an approximate expression for the probability that bit x j is unchanged after
a single asynchronous update. (1p).
EXAM QUESTIONS 29

2.7.2 Hopfield network with four neurons


The pattern shown in Fig. 2.10 is stored in a Hopfield network using Hebb’s rule
(1) (1)
wi j = N1 xi x j . There are 24 four-bit patterns. Apply each of these to the Hopfield
network, and perform one synchronous update. List the patterns you obtain and
discuss your results. (1p).

i=1 i=2

i=3 i=4

(1) (1)
Figure 2.10: The pattern x (1) has N = 4 bits, x1 = 1, and xi = −1 for i = 2, 3, 4.
Question 2.7.2.

2.7.3 Recognising letters with a Hopfield network


Figure 2.11 shows five patterns, each with N = 32 bits. Store the patterns x (1) and
P2 (µ) (µ)
x (2) in a Hopfield network using Hebb’s rule wi j = N1 µ=1 xi x j with i , j = 1, . . . , N .
PN
Use the update rule Si ← sgn( j =1 wi j S j ). Feed the patterns into the network. To
determine their fate, follow the steps outlined below.
PN (µ) (ν)
(a) Compute j =1 x j x j , for µ = 1, ν = 1, . . . , 5, and also for µ = 2, ν = 1, . . . , 5. Hint:
the result can be read off from the Hamming distances between the patterns shown
in Figure 2.11. (0.5p).
(ν) PN (ν)
(b) Consider the quantity bi = j =1 wi j x j , where wi j are the weights obtained
(ν)
by storing patterns x (1) and x (2) . Compute bi for ν = 1, . . . , 5. Express your result as
(1) (2)
linear combinations of xi and xi . Hint: use your answer to the first part of this
question. (1p).
(c) Feed the patterns in Figure 2.11 to the network. Which of the patterns remain
the same after one synchronous update using (2.49)? (0.5p).

x (1) x (2) x (3) x (4) x (5)


(µ)
Figure 2.11: Each of the five patterns consists of 32 bits xi . A black pixel i in pattern
(µ) (µ)
µ corresponds to xi = 1, a white one to xi = −1. Question 2.7.3.
30 DETERMINISTIC HOPFIELD NETWORKS

2.7.4 Energy function for deterministic Hopfield network


In a deterministic Hopfield network the energy function H is defined as
N N
1 XX
H =− wi j Si S j . (2.61)
2 i =1 j =1

Here N is the number of neurons, wi j are the weights, and the state Si of neuron i
is equal to ±1. The update rule is
N
X 
Si ← sgn wi j S j . (2.62)
j =1

Pp (µ) (µ)
(a) Use Hebb’s rule: wi j = N1 µ=1 xi x j for i = 6 j , and wi i = 0 . Show that H
either decreases or stays constant after a single asynchronous update (2.62). Which
property of weights assures that this is the case? (1p).
Pp (µ) (µ)
(b) Assume that the weights are wi j = N1 µ=1 xi x j for all i , j . In this case, how
does H change after a single asynchronous update according to Eq. (2.62)? Compare
to the result of (a). Discuss. (1p).

2.7.5 Diluted Hopfield network


In the diluted Hopfield network with N neurons, only a fraction NK  1 of the weights
wi j is active:
p
K i j X (µ) (µ)
wi j = x x . (2.63)
K µ=1 i j

Here K i j is an element of a random connectivity matrix K with elements


¨
1, with probability NK ,
Ki j = (2.64)
0, otherwise .
PN
Here K is the average number of connections to neuron i , 〈 j =1 K i j 〉c = K , where
(µ)
〈. . .〉c denotes the average over random realisations of K. The bits xi of the stored
(µ)
pattern x (µ) (µ = 1, . . . , p and i = 1, . . . , N ) are random: xi = 1 or −1 with probability
1
2 . The update rule for Si is the usual one:

N
X
Si ← sgn(bi ) with bi = wi j S j . (2.65)
j =1
EXAM QUESTIONS 31

Following the steps (a)-(c) below, derive an approximate expression for


N
1 X (ν)
mν = x 〈〈Si 〉c 〉 (2.66)
N i =1 i

in the limit of N  1, K  1, and 1  p  N . Here 〈. . .〉c is the average defined above,


and the outer average is over the network dynamics.
(a) Assume that S = (S1 , . . . ,SN )T is equal to x (ν) . Assuming a single synchronous
update [Eq. (2.65)] with weights given by Equation (2.63), derive an expression for
(ν) (ν)
〈bi 〉c . Write the expression using the average 〈Ci 〉c of the “cross-talk term" over
random connections. (0.5p).
(ν)
(b) Show that the distribution P (〈Ci 〉c ) of the cross-talk term is Gaussian in the
limit of K  1, p  1, N  1, and determine the mean and the variance of the
distribution. (1p)
(c) In the limit of N  1, 〈〈Si 〉c 〉 ≈ 〈Si 〉c . Use this and replace 〈Si 〉c on the right-hand
(ν) (ν)
side of Eq. (2.66) by sgn(〈bi 〉c ) where 〈bi 〉c is the expression from (a). Then use that
(ν)

1 PN
K K x S j c ≈ mν for K  1. Finally, on the right-hand-side of the resulting
j =1 i j j
PN (ν)
expression, approximate N1 i =1 by an integral over the distribution P (〈Ci 〉c ) you
obtained in (b). Evaluate the integral to find an approximate expression for mν .
(1.5p)

2.7.6 Mixed states


(µ)
Consider p random patterns x (µ) (µ = 1, . . . , p ) with N bits xi (i = 1, . . . , N ), equal
to 1 or -1 with probability 12 . Store the patterns in a deterministic Hopfield network
Pp (µ) (µ)
using Hebb’s rule wi j = N1 µ=1 xi x j . In the limit of N  1, p  1, p  N , show
(mix)
that the network recognises bit xi of the mixed state x (mix) with bits
(mix) (1) (2) (3)

xi = sgn xi + xi + xi , (2.67)
PN
after a single asynchronous update Si ← sgn( j =1 wi j S j ). Follow the steps outlined
below.
(a) Feed the mixed state (2.67) to the network. Use the weights wi j you obtained by
PN (mix)
applying Hebb’s rule and express j =1 wi j x j in terms of 〈sµ 〉, defined by 〈sµ 〉 =
N (µ) (mix)
1
x x j , for µ = 1 . . . p . (0.5p).
P
N j =1 j
(µ)
(b) Assume that the bits xi are independent random numbers, equal to 1 or -1 with
equal probabilities. What is the value of 〈sµ 〉 for µ = 1, 2 and 3? What is the value for
〈sµ 〉 for µ > 3? (1p).
32 DETERMINISTIC HOPFIELD NETWORKS

(c) Rewrite the expression you derived in (a) as a sum of two terms. The first term
is a sum over µ = 1, 2, 3. The second term is the cross-talk term, a sum over the
remaining values of µ. Explain why the cross-talk term can be neglected in the limit
stated above. (0.5p).
(d) Combine the results of (a), (b) and (c) to show that the network recognises the
mixed state (2.67). (0.5p).
33

3 Stochastic Hopfield networks


Two related problems became apparent in the previous Chapter. First, the Hopfield
dynamics may get stuck in spurious minima. In fact, if there is a local minimum
downhill from a given initial state, between this state and the correct attractor, then
the dynamics gets stuck in the local minimum, so that the algorithm fails to converge
to the correct attractor. Second, the energy function usually is a strongly varying
function over a high-dimensional state space. Therefore it is difficult to predict the
long-time dynamics of the network. Which is the first local minimum encountered
on the down-hill path that the network takes?
Both problems are solved by introducing a little bit of noise into the dynamics.
This is a trick that works for many neural-network algorithms. But in general it
is very challenging to analyse the noisy dynamics. For the Hopfield network, by
contrast, much is known. The reason is that the stochastic Hopfield network is
closely related to systems studied in statistical mechanics, so-called spin glasses.
Like these systems – and like many other physical systems – the stochastic Hopfield
network exhibits an order-disorder transition. This transition becomes sharp in the
limit of large N . It may be that the network produces satisfactory results for a given
number of patterns with a certain number of bits. But if one tries to store just one
more pattern, the network may fail to recognise anything. The goal of this Chapter
is to explain why this occurs, and how it can be avoided.

3.1 Noisy dynamics


The update rule (2.8) can be written as

Si ← sgn(bi ) , (3.1)

where bi is the local field. This rule is called deterministic, because a given set of
states Si determines the outcome of the update. To introduce noise, one replaces
the rule (3.1) by the stochastic rule
¨
+1 with probability P (bi ) ,
Si = (3.2)
−1 with probability 1 − P (bi ) ,

where P (b ) is the function


1
P (b ) = (3.3)
1 + e −2β b
shown in Figure 3.1. The stochastic algorithm is very similar to the deterministic
algorithm for the Hopfield model (Algorithm 1), only step 4 is different. The param-
34 STOCHASTIC HOPFIELD NETWORKS

P (b )
β = 10
1

β =0

-4 -2 2 4 b
Figure 3.1: Probability function (3.3) used in the definition of the stochastic rule (3.2),
plotted for β = 10 and β = 0.

eter β is the noise parameter. Unfortunately it is defined the wrong way around.
When β is large the noise level is small. In particular one obtains the deterministic
dynamics as β tends to infinity (3.1):

β → ∞ deterministic dynamics . (3.4)

In this limit, the function P (b ) tends to zero if b is negative, and to unity if b is


positive. So for β → ∞, the stochastic update rule (3.3) is precisely equivalent to the
deterministic rule (3.1). Conversely, when β = 0, the function P (b ) simply equals 12 .
In this case Si is updated randomly with equal probability to −1 or +1. The dynamics
bears no reference to the stored patterns (contained in the local field bi ).
The idea is to run the network for a small but finite noise level, that is at large
value of β . Then the dynamics is very similar to the deterministic Hopfield dynamics
analysed in the previous Chapter. But the noise allows the system to sometimes
also go uphill, making it possible to escape spurious minima. Since the dynamics is
noisy, it is necessary to rephrase the convergence criterion (2.4). This is discussed
next.

3.2 Order parameters


If we feed one of the stored patterns, x (1) for example, then we want the noisy
dynamics to stay in the vicinity of x (1) . This can only work if the noise is weak
enough. Success is measured by the order parameter mµ

mµ ≡ lim mµ (T ) . (3.5)
T →∞

Here
T  N
1X 1 X
‹
(µ)
mµ (T ) = Si (t )xi (3.6)
T t =1 N i =1
PN (µ)
is the average of 1
N S (t )xi
i =1 i
over the noisy dynamics of the network, for given
(µ) (1)
bits xi . Since we feed pattern x (1) to the network, we have initially Si (t = 0) = xi
MEAN-FIELD THEORY 35

m1 (T )
1
〈m1 〉

T
Figure 3.2: Illustrates how the average m1 (T ) depends upon the total iteration time T .
The light grey lines show different realisations of m1 (T ) for different realisations of the
stored patterns, at a large but finite value of N . The thick red line is the average over
the different realisations of patterns.

and thus
N
1 X (1)
Si (t = 0)xi = 1 . (3.7)
N i =1
PN (1)
After a transient, the quantity N1 i =1 Si (t )xi settles into a steady state, where it
fluctuates around a mean value with a definite distribution that is independent of
(1)
time t . If the network works well, we expect that Si (t ) remains close to xi , so that
m1 converges to a value of order unity as T → ∞. Since there is noise, this mean
value is usually smaller than unity.
Figure 3.2 illustrates how the average (3.6) converges to a definite value when T
becomes large. For finite values of N the mean m1 depends upon the stored patterns.
In this case it is useful to average m1 over different realisations of stored patterns
(thick red line in Figure 3.2). In the limit of N → ∞, the mean m1 is independent of
the stored patterns, we say that the system is self averaging.
The other order parameters are expected to be small, because the bits of the
patterns x (2) to x (p ) are independent from those of x (1) . As a consequence the indi-
(1)
vidual terms in the sum in Equation (3.6) cancel upon summation, if Si (t ) ≈ xi . In
summary, we expect that
¨
1 if µ = 1,
mµ ≈ (3.8)
0 otherwise.

The aim is now to compute how m1 depends on the values of p , N , and β .

3.3 Mean-field theory


Assume that the dynamics of the stochastic Hopfield network reaches a steady state,
as illustrated in Figure 3.2. The order parameter is defined as a time average over the
36 STOCHASTIC HOPFIELD NETWORKS

〈S1 〉
β = 10
1
β =0
-4 -2 2 4
b1
-1

Figure 3.3: Average 〈S1 〉 under the dynamics (3.2), as a function of b1 for different noise
levels.

stochastic dynamics of the network, in the limit of T → ∞. In practice we cannot


choose T to be infinite, but T must be large enough so that the average can be
estimated accurately, and so that any initial transient does not matter.
It is challenging task to compute the average over the dynamics. To simplify the
problem we consider first the case of only one neuron, N = 1. In this case there
are no connections to other neurons, but we can still assume that the network is
defined by a local field b1 , corresponding to the threshold θ1 in Equation (2.11). To
compute the order parameter we must evaluate the time average of S1 (t ). We denote
this time average as follows:
T
1X
〈S1 〉 = lim S1 (t ) . (3.9)
T →∞ T
t =1

Note that this is a different average from the one defined on page 19. The average
there is over random bits, here it is over the stochastic network dynamics (3.2). We
can evaluate the time average 〈S1 〉 using Equation (3.2):

〈S1 〉 = Prob(S1 = +1) − Prob(S1 = −1) = P (b1 ) − [1 − P (b1 )] . (3.10)

Equation (3.3) yields:

e β b1 e −β b1
〈S1 〉 = − = tanh(β b1 ) . (3.11)
e β b1 + e −β b1 e β b1 + e −β b1
Figure 3.3 illustrates how 〈S1 〉 depends upon b1 . For weak noise levels (large β ), the
threshold b1 acts as a bias. When b1 is negative then 〈S1 〉 ≈ −1, while 〈S1 〉 ≈ 1 when
b1 is positive. So the state S1 reflects the bias most of the time. When however the
noise is large (small β ), then 〈S1 〉 ≈ 0. In this case the state variable S1 is essentially
unaffected by the bias, S1 is equally likely to be negative as positive so that its average
evaluates to zero.
How can we generalise this calculation to a Hopfield model with many neurons?
It can be done approximately at least, in the limit of large values of N . Consider
MEAN-FIELD THEORY 37

neuron number i . The fate of Si is determined by bi . But bi in turn depends on all


S j in the network:
N
X
bi (t ) = wi j S j (t ) . (3.12)
j =1

This makes the problem challenging. But when N is large we may assume that bi (t )
remains essentially constant in the steady state, independent of t . The argument
is that fluctuations of S j (t ) average out when summing over j , at least when N is
large. In this case bi (t ) is approximately given by its time average:

bi (t ) = 〈bi 〉 + small fluctuations


in the limit N →∞ . (3.13)

Here the average local field 〈bi 〉 is called the mean field, and theories that neglect the
small corrections in Equation (3.13) are called mean-field theories. Let us assume
that the fluctuations in Equation (3.13) are negligible. Then the local field for neuron
i is approximately constant
bi (t ) ≈ 〈bi 〉 , (3.14)
so that the problem of evaluating the average of Si (t ) reduces to the case discussed
above, N = 1. From Equations (2.28) and (3.11) one deduces that

1 X X (µ) (µ)
〈Si 〉 = tanh(β 〈bi 〉) with 〈bi 〉 = x x 〈S j 〉 . (3.15)
N µ j 6=i i j

These mean-field equations are a set of N non-linear equations for 〈Si 〉, for given
(µ)
fixed patterns xi . The aim is to solve these equations to determine the time aver-
ages 〈Si 〉, and then the order parameters from
N
1 X (1)
mµ = 〈S j 〉x j . (3.16)
N j =1

If we initially feed pattern x (1) we hope that m1 ≈ 1 while mµ ≈ 0 for µ 6= 1. To


determine under which circumstances this has a chance to work we express the
mean field in terms of the order parameters mµ :
p
1 X X (µ) (µ) X
(µ)
〈bi 〉 = xi x j 〈S j 〉 ≈ xi mµ . (3.17)
N µ=1 j 6=i µ

The last equality is only approximate because the j -sum in the definition of mµ
contains the term j = i . Whether or not to include this term makes only a small
difference to mµ , in the limit of large N .
38 STOCHASTIC HOPFIELD NETWORKS

Now assume that m1 ≈ m with m of order unity, and mµ ≈ 0 for µ 6= 1. Then the
first term in the sum over µ dominates, provided that the small terms do not add up
to a contribution of order unity. This is the case if
p
α= (3.18)
N
is small enough. Roughly speaking this works if α is of order N −1 (a more precise
estimate is that α is at most (log N )/N [15]). Now we use the mean-field equations
(3.15) to approximate
(1)
〈Si 〉 = tanh(β 〈bi 〉) ≈ tanh(β m1 xi ) . (3.19)

Applying the definition of the order parameter


N
1 X (1)
m1 = 〈Si 〉xi
N i =1

we find
N
1 X (1)
 (1)
m1 = tanh β m1 xi xi (3.20)
N i =1
(µ)
Using that tanh(z ) = − tanh(−z ) as well as the fact that the bits xi can only assume
the values ±1, we get
m1 = tanh(β m1 ) . (3.21)
This is a self-consistent equation for m1 . We assumed that m1 is of order unity. So
the question is: does this equation admit such solutions? For β → 0 there is one
solution, m1 = 0. This is not the desired one. For β → ∞, by contrast, there are
three solutions, m1 = 0, ±1. Figure 3.4 shows results of the numerical evaluation of
Equation (3.21) for intermediate values of β . Below a critical noise level there are
three solutions, namely for β larger than

βc = 1 . (3.22)

For β > βc , the solution m1 = 0 is unstable (this can be shown by computing the
derivatives of the free energy of the Hopfield network [1]). Even if we were to start
with an initial condition that corresponds to m1 = 0, the network would not stay
there. The other two solutions are stable: when the network is initialised close to
x (1) , then it converges to m1 = +m (with m > 0).
The symmetry of the problem dictates that there must also be a solution with
m1 = −m at small noise levels. This solution corresponds to the inverted pattern
−x (1) (Section 2.4). If we start in the vicinity of x (1) , then the network is unlikely to
MEAN-FIELD THEORY 39

m1
1

β −1
1

-1

Figure 3.4: Solutions of the mean-field equation (3.21). The critical noise level is βc = 1.
The dashed line corresponds to an unstable solution.

converge to −x (1) , provided that N is large enough. The probability of x (1) → −x (1)
vanishes very rapidly as N increases and as the noise level decreases. If this transition
were to happen in a simulation, the network would then stay near −x (1) for a very
long time. Consider the limit where T tends to ∞ at a finite but possibly large value
of N . Then the network would (at a very small rate) jump back and forth between
x (1) and −x (1) , so that the order parameter would average to zero. This shows that
the limits of large N and large T do not commute

lim lim m1 (T ) 6= lim lim m1 (T ) . (3.23)


T →∞ N →∞ N →∞ T →∞

In practice the interesting limit is the left one, that of a large network run for a time
T much longer than the initial transient, but not infinite. This is precisely where the
mean-field theory applies. It correponds to taking the limit N → ∞ first, at finite
but large T . This describes simulations where the transition x (1) → −x (1) does not
occur.
In summary, Equation (3.21) predicts that the order parameter converges to a
definite value, m1 , independent of the stored patterns when N is large enough.
Figure 3.2 shows that the order parameter converges for large but finite values of N .
However, the limiting value does depend on the stored patterns, as mentioned above.
The system is not self averaging. When N is finite we should therefore average the
result over different realisations of the stored patterns.
The value of 〈m1 〉 determines the average number of correctly retrieved bits in
the steady state:
 N  ‹
1X (1) N
〈Ncorrect 〉 = 1 + 〈Si 〉xi = (1 + 〈m1 〉) (3.24)
2 i =1 2

The outer average is over different realisations of random patterns (the inner average
is over the network dynamics). Since m1 → 1 as β → ∞ we see that the number of
40 STOCHASTIC HOPFIELD NETWORKS

correctly retrieved bits approaches N

〈Ncorrect 〉 → N . (3.25)

This is expected since the stored patterns x (µ) are recognised for small enough values
of α in the deterministic limit, because the cross-talk term is negligible. But it is
important to know that the stochastic dynamics slows down as the noise level tends
to zero. The lower the noise level, the longer the network remains stuck in local
minima, so that it takes longer time to reach the steady state, and to sample the
steady-state statistics of H . Conversely when the noise is strong, then

〈Ncorrect 〉 → N2 . (3.26)

In this limit the stochastic network ceases to function. If one assigns N bits entirely
randomly, then half of them are correct, on average.
We define the error probability in the steady state as

t =∞ N − 〈Ncorrect 〉
Perror = . (3.27)
N
From Equation (3.24) we find
t =∞
Perror = 21 (1 − 〈m1 〉) . (3.28)

In the deterministic limit the steady-state error probability approaches zero as m1


t =1
tends to one. Let us compare this result with the one-step error probability Perror
derived in Chapter 2 in the deterministic limit. We should take the limit α = p /N → 0
in Equation (2.43) because the result (3.28) was derived assuming that α is very small.
In this limit we find that the one-step and the steady-state error probabilities agree
(they are both equal to zero).
Above the critical noise level, for β < βc , the order parameter vanishes so that
t =∞
Perror equals 12 . So when the noise is too large the network fails.
It is important to note that noise can also help, because mixed states have lower
(µ)
critical noise levels than the stored patterns xi . This can be seen as follows. To
derive the above mean-field result we assumed that mµ = m δµ1 . Mixed states
correspond to solutions where an odd number of components of m is non-zero, for
example:  
m
m 
m
 
m =  . (3.29)
0
 
..
.
STORAGE CAPACITY 41

Neglecting the cross-talk term, the mean-field equation reads


p
(µ)
€ X Š
〈Si 〉 = tanh β mµ xi . (3.30)
µ=1

In the limit of β → ∞, the 〈Si 〉 converge to the mixed states (2.57) when m is given
by Equation (3.29). Using the definition of mµ and averaging over the bits of the
random patterns one finds:
p
(µ) (ν)
¬ € X Š¶
mµ = xi tanh β mν xi . (3.31)
ν=1

The numerical solution of Equation (3.29) shows that there is a m 6= 0 solution


for β > βc = 1. Yet this solution is unstable close to the critical noise level, more
precisely for 1 < β < 2.17 [16].

3.4 Storage capacity


(1)
The preceding analysis replaced the sum (3.17) by its first term, xi m1 . This cor-
responds to neglecting the cross-talk term. We expect that this can only work if
α = p /N is small enough. The influence of the crosstalk term was studied in Section
2.2, where the storage capacity
p
α=
N
t =1
was defined. When we computed Perror in Section 2.2, only the first initial update
step was considered, because it was too difficult to analyse the long-time limit of
the deterministic dynamics. It is expected that the error probability increases as t
increases, at least when α is large enough so that the cross-talk term matters.
For the stochastic update rule it is easier to compute the long-time dynamics
(if there is a steady state). The remainder of this Section describes the mean-field
analysis of the steady state for larger values of α. It turns out that there is a critical
storage capacity (that depends on the noise level) above which the network ceases
to function.
We store p patterns in the network using Hebb’s rule (2.29) and feed pattern x (1) to
the network. The aim is to determine the order parameter m1 and the corresponding
error probability in the steady state for p ∼ N , so that α remains finite as N → ∞.
In this case we can no longer approximate the sum in Equation (3.17) just by its first
term, because the other terms for µ > 1 can give a contribution of order m1 . Instead
we must evaluate all mµ to compute the mean field 〈bi 〉.
42 STOCHASTIC HOPFIELD NETWORKS

The relevant calculation is summarised in Section 2.5 of Hertz, Krogh and Palmer [1],
and the remainder of this Section follows that outline quite closely. One starts by
rewriting the mean-field equations (3.15) in terms of the order parameters mµ . Using

(µ)
X
〈Si 〉 = tanh(β xi mµ ) (3.32)
µ

we find
1 X (ν) 1 X (ν)
 X ‹
(µ)
mν = x 〈Si 〉 = x tanh β xi mµ . (3.33)
N i i N i i µ

This coupled set of p non-linear equations is equivalent to the mean-field equation


(3.15).
Now feed pattern x (1) to the network. The strategy of solving Equation (3.33) is to
(1)
assume that the network stays close to the pattern xi in the steady state, so that
m1 remains of order unity. But we must also allow that the other order parameters

1 X (µ)
mµ = x 〈S j 〉 for µ 6= 1
N j j

remain finite (yet small). The trick to evaluate these order parameters is to assume
random patterns, so that the mµ become random numbers that fluctuate around
zero with variance 〈mµ2 〉 (this average is an average over the ensemble of random
patterns). We use Equation (3.33) to compute the variance approximately. In the µ-
sum on the r.h.s of Equation (3.33) we must treat the term µ = ν separately (because
the index ν appears also on the l.h.s. of this equation). Also the term µ = 1 must be
treated separately, as before, because µ = 1 is the index of the pattern that was fed
to the network. Therefore it is necessary to distinguish between ν = 1 and ν =6 1. We
start with the second case and write

1 X (ν)
 ‹
(1) (ν) (µ)
X
mν = x tanh β xi m1 + β xi mν + β xi mµ
N i i µ6=1
µ6=ν
1 X (ν) (1)
 ‹
(1) (ν) (µ) (1)
X
= xi xi tanh β m1 + β xi xi mν + β xi xi mµ (3.34)
N i |{z} | {z } µ6=1

1 2 µ6=ν
| {z }
3

Now consider 3 terms in argument of tanh(. . . ). Term 1 is of order unity, that is


independent of N . Term
3 is also large because of the sum contains many terms, of
the order p N terms. However, the term 2 is small for large values of N . Therefore
STORAGE CAPACITY 43

it is a good approximation to Taylor expand the argument of tanh(. . . ):



€ Š € Š d
tanh 1 + 2 + 3 ≈ tanh 1 + 3 + 2 tanh + ... . (3.35)
dx 1 +
3

Using d
dx tanh(x ) = 1 − tanh2 (x ) one gets

1 X (ν) (1)
 ‹
(µ) (1)
X
mν = xi xi tanh β m1 + β xi xi mµ
N |{z} µ6=1

1 µ6=ν
| {z }
3 (3.36)
1 X (ν) (1)
•  ‹˜
(1) (ν) (µ) (1)
X
2
+ x x β x x mν 1 − tanh β m1 + β xi xi mµ .
N i i i | i {zi } µ6=1
2 µ6=ν

(µ)
Using the fact that x (µ) = ±1 and thus [xi ]2 = 1, this expression simplifies somewhat:

1 X (ν) (1)
 ‹
(µ) (1)
X
mν = xi xi tanh β m1 + β xi xi mµ +
N i µ6=1
µ6=ν
(3.37)
1 X
•  ‹˜
(µ) (1)
X
2
+β mν 1 − tanh β m1 + β xi xi mµ .
N i µ6=1
µ6=ν

The situation is similar to our discussion of the sum defining the cross-talk term in
Section 2.2. The sums in Equation (3.37) depend on the stored patterns, through
(µ)
the xi . To estimate the order parameters mµ we proceed as in Section 2.2 and
(µ)
assume that the pattern bits xi are independently randomly distributed. Since the
sums in Equation (3.37) contain many terms, we can estimate the sums using the
central-limit theorem. If the patterns are random, then
(µ) (1)
X
z≡ xi xi mµ (3.38)
µ6=1
µ6=ν

is approximately Gaussian with mean zero. The variance of z is given by an average


(µ)
over a double sum. Since the distributions of xi are independent, only the diagonal
in this double sum contributes:
X
σ2z = 〈mµ2 〉 ≈ p 〈mµ2 〉 for any µ 6= 1, ν . (3.39)
µ6=1
µ6=ν
44 STOCHASTIC HOPFIELD NETWORKS

Now return to Equation (3.37). The sum N1 i in the second line can be approxi-
P

mated as an average over the Gaussian distributed variable z :


Z∞ 2
1 − z2 
1 − tanh2 β m1 + β z .

β mν dz p e 2σz
(3.40)
−∞ 2πσz
We write the expression (3.40) as
Z∞ 2
1
• ˜
− z2 2
β mν 1 − dz p e 2σz
tanh β m1 + β z ≡ β mν (1 − q ) , (3.41a)
−∞ 2πσz
using the following definition of the parameter q :
Z∞ 2
1 − z
e 2σ2z tanh2 β m1 + β z .

q= dz p (3.41b)
−∞ 2πσz
Returning to Equation (3.37) we see that it takes the form
1 X (ν) (1)
 ‹
(µ) (1)
X
mν = x xi tanh β m1 + β xi xi mµ + (1 − q )β mν . (3.42)
N i µ6=1
µ6=ν

Solving for mν we find:


 ‹
(ν) (1) (µ) (1)
1
β m1 + β µ6=1 xi xi mµ
P P
N i x xi tanh
µ6=ν
mν = , (3.43)
1 − β (1 − q )
for ν 6= 1. This expression allows us to compute the variance σz , defined by Equation
(3.39). Equation (3.43) shows that the average 〈mν2 〉 contains a double sum over the
bit index, i . Since the bits are independent, only the diagonal terms contribute, so
that  ‹
2 (µ) (1)
1
β + β
P P
N2 i tanh m 1 x
µ6=1 i x i m µ
µ6=ν
〈mν2 〉 ≈ , (3.44)
[1 − β (1 − q )]2
independent of ν. The numerator is just q /N , from Equation (3.41b). So the variance
evaluates to
αq
σ2z = . (3.45)
[1 − β (1 − q )]2
Up to now it was assumed that ν 6= 1. Now consider ν = 1. One can derive an
Equation for m1 by repeating almost the same steps as for mν with ν 6= 1. The result
is: Z 2
dz − z
m1 = p e 2σ2z tanh(β m1 + β z ) . (3.46)
2πσ2z
STORAGE CAPACITY 45

This is a self-consistent equation for m1 . In summary there are three coupled equa-
tions, for m1 , q , and σz . Equations (3.41a), (3.45), and (3.46). They must be solved
together to determine how m1 depends on β and α.
To compare with the results described in Section 2.2 we must take the determin-
istic limit, β → ∞. In this limit the coupled equations simplify [1]:
v
m2
t 2 − 2σ12z
u
β (1 − q ) = e , (3.47a)
πσ2z
α
σ2z = , (3.47b)
[1 − β (1 − q )]2
 
m1
m1 = erf p . (3.47c)
2σ2z

Note that q approaches unity in the limit β → ∞, yet β (1 − q ) remains finite in


this limit. Recall the definition (3.27) of the steady-state error probability. Inserting
Equation (3.47c) for m1 into this expression we find in the deterministic limit:
  
t =∞ 1 m1
Perror = 1 − erf p . (3.48)
2 2σ2z

Compare this with Equation (2.43) for the one-step error probability in the deter-
ministic limit. That equation was derived for only one step in the dynamics of the
network, while Equation (3.48) describes the long-time limit. Yet it turns out that
Equation (3.48) reduces to (2.43) in the limit of α → 0. To see this one solves Equation
(3.47) by introducing the variable y = m1 / 2σ2z [1]. One obtains a one-dimensional
p

equation for y [17]: p p 2


y ( 2α + (2/ π) e−y ) = erf(y ) . (3.49)
The physical solutions are those satisfying 0 ≤ erf(y ) ≤ 1, because the order parame-
ter is restricted to this range (transitions to −m1 do not occur in the limit N → ∞).
Figure 3.5 shows the steady-state error probability obtained from Equations (3.48)
and (3.49). Also shown is the one-step error probability

1 1
•  ‹˜
t =1
Perror = 1 − erf p
2 2α
t =∞ t =1
derived in Section 2.2. You see that Perror approaches Perror for small α. This means
that the error probability does not increase significantly as one iterates the network,
at least for small α. In this case errors in earlier iterations have little effect on the
probability that later errors occur. But the situation is different at larger values of
t =1
α. In that case Perror significantly underestimates the steady-state error probability.
46 STOCHASTIC HOPFIELD NETWORKS

Figure 3.5: Error probability as a function of the storage capacity α in the deterministic
t =1
limit. The one-step error probability Perror [Equation (2.43)] is shown as a black line,
t =∞
the steady-state error probability Perror [Equation (3.27)] is shown as a red line. In the
hashed region error avalanches increase the error probability. Similar to Figure 1 in
Ref. [17].

In the hashed region, errors in the dynamics increase the probability of errors in
subsequent steps, giving rise to error avalanches.
Figure 3.5 illustrates that there is a critical value αc where the steady-state error
probability tends to 21 . Solution of the mean-field Equations gives

αc ≈ 0.1379 . (3.50)

When α > αc the steady-state error probability equals 12 , in this region the network
produces just noise. When α is small, the error probability is small, here the network
works well. Figure 3.5 shows that the steady-state error probability changes very
abruptly near αc . Assume you store 137 patterns with 1000 bits in a Hopfield network.
Figure 3.5 demonstrates that the network can reliably retrieve the patterns. However,
if you try to store one or two more patterns, the network fails to produce any output
meaningfully related to the stored patterns. This rapid change is an example of a
phase transition. In many physical systems one observes similar transitions between
ordered and disordered phases.
What happens at higher noise levels? The numerical solution of Equations (3.41a),
(3.46), and (3.45) shows that the critical storage capacity αc decreases as the noise
level increases (smaller values of β ). This is shown schematically in Figure 3.6. In-
side the hashed region the error probability is smaller than 12 so that the network
operates reliably (although less so as one approaches the phase-transition bound-
ary). Outside this region the the error probability equals 12 . In this region the network
fails. In the limit of small α the critical noise level is βc = 1. In this limit the network
is described by the theory explained in Section 3.3, Equation (3.21).
Often these two different phases of the Hopfield network are characterised in
terms of the order parameter m1 . We see that m1 > 0 in the hashed region, while
BEYOND MEAN-FIELD THEORY 47

β −1
1

t =∞
Perror = 21
0.5
t =∞
Perror < 21

0.05 0.1
αc0.15 α
Figure 3.6: Phase diagram of the Hopfield network in the limit of large N (schematic).
t =∞ t =∞
The region with Perror < 12 is the ordered phase, the region with Perror = 12 is the disor-
dered phase. After Figure 2 in Ref. [17].

m1 = 0 outside.

3.5 Beyond mean-field theory


The theory summarised in this Chapter rests on a mean-field approximation for the
local field, Equation (3.14). The main result is the phase diagram shown in Figure
3.6. It is important to note that it was derived in the limit N → ∞. For smaller
values of N one expects the transition to be less sharp, so that m1 is non-zero for
values of α larger than αc .
But even for large values of N the question remains how accurate the mean-field
theory is. To answer this question, one must take into account fluctuations. The
corresponding calculation is more difficult than the ones outlined earlier in this
Chapter, and it requires several steps. One starts from the steady-state distribution
of S for fixed patterns x (µ) . In Chapter 4 we will see that it takes the form of a
Boltzmann distribution
P (S ) = Z −1 e−β H (s ) .
β (3.51)

The normalisation factor Z is called the partition function

e−β H (s ) .
X
Z= (3.52)
s

One can compute the order parameter by adding a threshold term to the energy
function (2.47)
1X X
H =− wi j Si S j − λSi . (3.53)
2 ij i
48 STOCHASTIC HOPFIELD NETWORKS

Then the order parameter mµ is obtained by taking a derivative w.r.t λ:


¬1 X
(µ)
¶ 1 ∂

mµ = xi 〈Si 〉 = − log Z . (3.54)
N i Nβ ∂λ

The outer average is over different realisations of random patterns. Since the log-
arithm of Z is difficult to average, one resorts to the replica trick. The idea is to
represent the average of the logarithm as

〈log Z 〉 = lim n1 (〈Z n 〉 − 1) , (3.55)


n→0

The function Z n looks like the partition function of n copies of the system, thus the
name replica trick. It is still debated when the replica trick works and when not [18].
Nevertheless, the most accurate theoretical result for the critical storage capacity is
obtained in this way [19]
αc = 0.138187 . (3.56)
The mean-field result (3.50) is different from (3.56), but it is very close. Most of the
time, mean-field theories do not give such good results, they are usually used to gain
merely a qualitative understanding of phase transitions. In the Hopfield model the
mean-field theory works so well because the connections are global: every neuron
is connected with every other neuron. This helps to average out the fluctuations in
Equation (3.14). The most precise Monte-Carlo simulations (Section 4.4) for finite
values of N [20] yield upon extrapolation to N = ∞

αc = 0.143 ± 0.002 . (3.57)

This is close to, yet significantly different from the best theoretical estimate, Equation
(3.56), and also different from the mean-field result (3.50).

3.6 Summary
In this Chapter the use of Hopfield networks for pattern recognition was discussed.
Hopfield networks share many properties with the networks discussed later on in
these lectures. The most important point is perhaps that introducing noise in the
dynamics allows to study the convergence and performance of the network: in the
presence of noise there is a well-defined steady state that can be analysed. Without
noise, in the deterministic limit, the network dynamics can get stuck in local minima
of the energy function, and may not reach the stored patterns. Naturally the noise
must be small enough for the network to function reliably. Apart from the noise level
there is a second significant parameter, the storage capacity α, equal to the ratio
FURTHER READING 49

of the number of patterns to the number of bits per pattern. When α is small the
network is reliable. A mean-field analysis of the N → ∞-limit shows that there is a
phase transition in the parameter plane (phase diagram) of the Hopfield network,
Figure 3.6.
The building blocks of Hopfield networks are McCulloch-Pitts neurons and Hebb’s
rule for the weights. These elements are fundamental to the networks discussed in
the coming Chapters.

3.7 Further reading

The statistical mechanics of Hopfield networks is explained in the book by Hertz,


Krogh, and Palmer [1]. Starting from the Boltzmann distribution, Chapter 10 in
this book explains how to compute the order parameters, and how to evaluate the
stability of the corresponding solutions. For more details on the replica trick, refer
to Ref. [15].

3.8 Exercises

Mixed states. Write a computer program that implements the stochastic dynamics
of a Hopfield model. Compute how the order parameter for mixed states that are
superpositions of the bits of three stored patterns depends on the noise level for
0.5 ≤ β ≤ 2.5. Compare your numerical results with the predictions in Section 3.3.
Repeat the exercise for mixed states that consist of superpositions of the bits of five
stored patterns. To this end, first derive the equation for the order parameter and
solve this equation numerically. Second, perform your computer simulations and
compare.

Phase diagram of the Hopfield network. Derive Equation (3.49) from Equation
(3.47). Numerically solve (3.49) to find the critical storage capacity αc in the deter-
ministic limit. Quote your result with three-digit accuracy. To determine how the
critical storate capacity depends on the noise level, numerically solve the three cou-
pled Equations (3.46), (3.41b), and (3.45). Compare your result with the schematic
Figure 3.6.
50 STOCHASTIC HOPFIELD NETWORKS

3.9 Exam questions


3.9.1 Stochastic Hopfield network
In the stochastic Hopfield network the state Si of the i -th neuron is updated asyn-
chronously according to
¨
+1, with probability P (bi ) ,
Si ← (3.58)
−1, with probability 1 − P (bi ) ,

N
X 1
bi = wi j S j , P (bi ) = , (3.59)
j =1
1 + e−2β bi

where β is the noise parameter. The weights wi j are given by Hebb’s rule wi j =
Pp (µ) (µ)
N
1
x x j for i 6= j , and wi i = 0 . Whether the stochastic Hopfield network can
µ=1 i
reliably retrieve stored patterns depends upon

• the level of noise (measured by β ),


• the number of bits per pattern (N ),
• the number of stored patterns (p ),
• correlations between stored patterns.

Explain and discuss how each of these factors influences the reliability of retrieval
of stored patterns in the stochastic Hopfield model. In the discussion, refer to and
explain the following terms: “storage capacity", “order parameter", “phase diagram",
“local minima", “attractors", “spurious states". Your answer must not be longer than
one A4 page. (1p).
51

4 Stochastic optimisation
Hopfield networks were introduced as a means of solving the associative mem-
ory problem. In Section 2.3 it was shown that this corresponds to minimizing an
energy function H . In this Chapter we see how this can be used to solve combinato-
rial optimisation problems using neural networks. Such problems admit 2k or k !
configurations - too many to list and check in a serial approach when k is large.
The idea is to write down an energy function that is at most quadratic in the
state variables, like Equation (2.47). Then one can use the Hopfield dynamics to
minimize H . The problem is of course that the deterministic network dynamics
gets stuck in first local minimum encountered, usually not the desired optimum
(Figure 4.1). Therefore it is important to introduce a little bit of noise. As discussed
in Chapter 3, the noise helps the network to escape local minima.
A common strategy is to lower the noise level on the fly. In the beginning of the
simulation the noise level is high, so that the network explores the rough features of
the energy landscape. When the noise level is lowered, the network can see finer
and finer features, and the hope is that it ends up in the global minimum. This
method is called simulated annealing.

4.1 Combinatorial optimisation problems


A well-known combinatorial optimisation problem is the traveling-salesman prob-
lem. Given the coordinates of k cities, the goal is to determine the shortest journey
visiting each city exactly once before returning to the starting point. The coordinates

spurious minima x (µ) states

Figure 4.1: Spurious local minima in the energy function H . The global minimum
corresponds to one of the stored patterns, x (µ) .
52 STOCHASTIC OPTIMISATION

(a) (b)
city coordinates
A ( 0.1,0.15 )
B ( 0.4 0.2 )
C ( 0.5 0.7 )
D ( 0.2 0.1 )
E ( 0.1 0.8 )
F ( 0.8 0.9 )
G ( 0.9 0.3 )
   
1 1
 1   1
 1   1 
M=
 1
 M= 1 
1 1 
  
 
1 1
1 1

Figure 4.2: Traveling-salesman problem for k = 7. Given are the coordinates of k cities
as points in the unit square. The problem is to find the shortest connected path that joins
all cities, visits each city exactly once, and returns to the starting point. (a) best solution,
(b) a second solution with a longer path. Also given are the matrix representations of
these solutions.

of seven cities A,. . . ,F are given in Figure 4.2 (this Figure illustrates the problem for
k = 7). The Figure shows two different solutions. Denoting the distance between
city A and B by d AB and so forth, the length of the path in panel (a) is

L = d AD + d D B + d B G + d G F + d F C + d C E + d E A . (4.1)

The Figure also demonstrates how paths are represented in terms of k × k matrices.
Path (a) corresponds to
1
 
 1 
1 
M= 1 . (4.2)

 
1
1
Each row corresponds to a city, and the j -th element in this row has the entry 1
if the city is the j -th stop in the tour. The other entries are 0. Since each city is
visited only once, there can be only one 1 in each row. Since each visit corresponds
to exactly one city, there can be only one 1 in each column. Any permutation of
the elements that satisfies these constraints is an allowed path. There are k ! such
permutations. They are 2k -fold degenerate (there are k paths of the same length
COMBINATORIAL OPTIMISATION PROBLEMS 53

Figure 4.3: One solution of the k -queens problem for k = 8.

that differ by which city is visited first, and each path can be traveled clockwise or
anti-clockwise). Therefore there are k !/(2k ) possible paths to consider in trying
to determine the shortest one. This makes the problem hard. Note that integer
linear-programming methods for solving the travelling salesman problem usually
use a different representation of the paths [21].
The k -queens problem is derived from the game of chess. The question is how to
arrange k queens on a k × k chess board so that they cannot take each other out.
This means that each row and column as well as each diagonal can have only one
queen. The problem is illustrated in Figure 4.3. This Figure shows one solution
for k = 8. The task is to find all solutions. Each arrangement of queens can be
represented as a matrix M, where one sets M i j = 1 if there is a queen on site (i , j ).
All other elements are set to zero. To represent valid solutions, M must satisfy the
following constraints [22]
k X
X k
Mi j = k , (4.3)
i =1 j =1

If Mi j = Mp q = 1 for (i , j ) 6= (p , q ) then (4.4)


i 6= p and j 6= q and i − j 6= p − q and i + j 6= p + q .

The double-digest problem. The Human Genome sequence was first assembled
by piecing together overlapping DNA segments in the right order by making sure
that overlapping segments share the same DNA sequence. To this end it is necessary
to uniquely identify the DNA segments. The actual DNA sequence of a segment is a
unique identifier. But it is sufficient and more efficient to identify a DNA segment
by a fingerprint, for example the sequence of restriction sites. These are short subse-
quences (four or six base pairs long) that are recognised by enzymes that cut (digest)
the DNA strand precisely at these sites. A DNA segment is identified by the types
and locations of restriction sites that it contains, the so-called restriction map.
54 STOCHASTIC OPTIMISATION

When a DNA segment is cut by two different enzymes one can experimentally
determine the lengths of the resulting fragments. Is it possible to determine how
the cuts were ordered in the DNA sequence of the segment from the fragment
lengths? This is the double-digest problem [23]. The order of cut sites is precisely
the restriction map. In a double-digest experiment, a given DNA sequence is first
digested by one enzyme (A say). Assume that this results in n fragments with lengths
a i (i = 1, . . . , n ). Second, the DNA sequence is digested by another enzyme, B . In this
case m fragments are found, with lengths b1 , b2 , . . . , bm . Third, the DNA sequence
is digested with both enzymes A and B , yielding l fragments with lengths c1 , . . . , cl .
The question is now to determine all possible ordering of the a - and b -cuts that
result in l fragments with lengths c1 , c2 , . . . , cl ?

4.2 Energy functions


The first two problems introduced in the previous Section are similar in that possible
solutions can be represented in terms of k ×k matrices M with 0/1 entries and certain
constraints. It turns out that one can represent these problems in terms of a Hopfield
network with N = k 2 neurons ni j that take the values 0/1 [24].
As an example, consider the travelling-salesman problem. We label the cities A
to G by the integer m = 1, . . . , k , and denote their distances by d mn . Then the path
length to be minimised is
X
L = 12 d mn M m j (M n j −1 + M n j +1 ) . (4.5)
mn j

The column- and row-constraints upon the matrix M


X
M m j = 1 row constraint , (4.6a)
j
X
Mm j = 1 column constraint (4.6b)
m

are incorporated using Lagrange multipliers A and B (both positive), so that the
function to be minimised becomes
X€ X Š2 X€ X Š2
H = L + A2 1− M m j + B2 1− Mm j . (4.7)
m j j m

When the constraints (4.6) are satisfied, their contributions to H vanish, otherwise
they are positive. We conclude that H has a global minimum at the desired solution.
If we use a stochastic method to minimise H , it is not guaranteed that the algorithm
SIMULATED ANNEALING 55

finds the global minimum, either because the constraints are not exactly satisfied,
or because the path found is not the shortest one. The magnitude of the Lagrange
multipliers A and B determines how strongly the constraints are enforced, during
the search and in sub-optimal solutions.
The expression (4.7) is a quadratic function of M m j . This suggests that we can
write H as the energy function of a Hopfield model with 0/1 neurons ni j :
X X
H = − 21 wi j k l ni j nk l + µi j ni j + const. (4.8)
i jkl ij

The weights wi j k l and thresholds µi j are determined by comparing Equations (4.5),


(4.7), and (4.8). Note that the neurons carry two indices (not one as in Section 1.2).
The first term in Equation (4.8) is familiar [Equation (2.47)]. The second term is
a threshold term. In Sections 2.2 and 2.3 all thresholds were set to zero (because
they were not needed). If there are thresholds in the update rule [as in Equation
(1.5)] then the energy function must have a threshold term to ensure that the energy
cannot increase when the network is updated. The constant term is irrelevant, it
does not affect the minimisation. This term can be left out. We write:
X
H =− bi j n i j (4.9a)
ij

where
1X
bi j = w i j k l n k l − µi j (4.9b)
2 kl
are local fields including thresholds.

4.3 Simulated annealing


Under the deterministic dynamics (1.2)
ni0 j = θH (bi j ) (4.10)
only updates satisfying H (ni0 j ) − H (ni j ) ≤ 0 occur, so that the network gets stuck in
local minima. To avoid this, one introduces noise, as explained in Section 3.1. The
analogue of Equation (3.2) is
¨
1 with prob. P (bi j )
ni0 j = (4.11)
0 with prob. 1 − P (bi j )

Here we take P (b ) = 1+e1−β b . This form differs slightly from the definition (3.3). For
β → ∞ the stochastic rule reduces to the deterministic dynamics where H (ni0 j ) −
H (ni j ) ≤ 0.
56 STOCHASTIC OPTIMISATION

Now consider large but finite values of β , corresponding to small noise levels. In
this case the energy function may increase in an update, yet down-moves are much
more likely than up-moves. This becomes clear by considering an alternative (yet
equivalent) formulation of the network dynamics:

1. Choose at random a unit i j .

2. Change ni j to ni0 j 6= ni j with probability

1
Prob(ni j → ni0 j ) = (4.12a)
1 + e β ∆Hi j

with
∆Hi j = H (ni0 j ) − H (ni j ) = −bi j (ni0 j − ni j ) . (4.12b)

Equation (4.12) shows that moves with ∆Hi j > 0 are less likely when β is large. To
demonstrate that this scheme is equivalent to stochastic dynamics (4.11) we break
the prescription (4.12) up into different cases. Changes occur with the following
probabilities:

1
if ni j = 0 obtain ni0 j = 1 with prob. = P (bi j ) , (4.13a)
1 + e −β bi j
1
if ni j = 1 obtain ni0 j = 0 with prob. = 1 − P (bi j ) . (4.13b)
1 + e β bi j
−1 −β b
In the second row we used that 1 − P (b ) = 1 − 1+e1−β b = 1+e
1+e −β b
= 1
1+e β b
. The state
remains unchanged with the complimentary probabilities:

1
if ni j = 0 obtain ni j = 0 with prob. = 1 − P (bi j ) , (4.13c)
1 + e −β bi j
1
if ni j = 1 obtain ni j = 1 with prob. = P (bi j ) . (4.13d)
1 + e β bi j

Comparing with Equation (4.11) we see that the two schemes are equivalent.
How does the network find a solution of the optimisation problem? We let it run
with stochastic dynamics and compute 〈ni j 〉. If 〈ni j 〉 ' 1, we set M i j = 1, otherwise
0. When the noise is weak, we expect that all 〈ni j 〉 are either be close to zero or to
one. One strategy is to change the noise level as the simulation proceeds. One starts
with larger noise, so that the network explores first the rough features of the energy
landscape. As the simulation proceeds, one reduces the noise level, so that the
network can learn finer features of the landscape. This is called simulated annealing.
See Section 10.9 in Numerical Recipes [25].
MONTE-CARLO SIMULATION 57

4.4 Monte-Carlo simulation


The algorithms described in the previous Section are equivalent to the Markov-chain
Monte-Carlo algorithm, and it is closely related to the Metropolis algorithm. This
method is widely used in Statistical Physics and in Mathematical Statistics. It is
therefore important to understand the connections between the different formula-
tions.
The Markov-chain Monte-Carlo algorithm is a method to sample from a distri-
bution that is too expensive to compute. This is not a contradiction in terms! The
distribution in question is the Boltzmann distribution
P (n ) = Z −1 e−β H (n )
β (4.14)
Here Z = n e−β H (n ) is a normalisation factor, called partition function. The vector
P

n represents the state vector of the system, as in the Hopfield network. The distri-
bution (4.14) plays an important role in the equilibrium statistical mechanics of
systems with energy function (Hamiltonian) H . In this context β −1 = kB T where kB
is the Boltzmann constant and T is the temperature of the system.
For systems with a large number of degrees of freedom, the function Pβ (n ) can be
very expensive to compute, because of the sum over n in the normalisation factor
Z . Is there a way of sampling this distribution without actually evaluating it? The
answer is yes, by constructing a Markov chain [26] of states n that are distributed
according to Pβ (n ), after an initial transient. A Markov chain is a memoryless random
sequence of states defined by transition probabilities pl →k from state n l to n k . The
transition probability pl →k connects arbitrary states, allowing for local moves (as in
the previous Section where only one element of n was changed) or global moves.
The Monte-Carlo algorithm proceeds in two steps. First, given the state n l a new
state n k is suggested, with probability pls→k . Second, n k is accepted with probability
pla→k , so that the transition probability is given by pl →k = pls→k pla→k . These steps
are repeated many times, creating a sequence of states. If the process satisfies the
detailed-balance condition
Pβ (n l )pl →k = Pβ (n k )pk →l , (4.15)
then the Markov chain of states n l has the steady-state distribution Pβ (n ). Usually
this means that the distribution of states generated in this way converges to Pβ (n )
(see Ref. [26] for details).
Now consider the formulation of the network dynamics on page 56. The dynamics
is local: a single neuron is picked randomly so that p s is a constant, independent of
l or k , and then accepted with probability
1
pla→k = , (4.16)
1 + eβ ∆H
58 STOCHASTIC OPTIMISATION

where ∆H = H (n k ) − H (n l ). Equation (4.16) shows that downhill steps are more


frequently accepted than uphill steps. To prove that condition (4.15) holds, we use
that p a is symmetric (because it is independent of l or k ), and we use Equations
(4.14) and (4.16):
p s e−β H (n l ) ps p s e−β H (n k )
= = . (4.17)
1 + eβ [H (n k )−H (n l )] eβ H (n l ) + eβ H (n k ) 1 + eβ [H (n l )−H (n k )]
This demonstrates that the Boltzmann distribution is a steady state of the Markov
chain. If the simulation converges to the steady state (as it usually does), then states
visited by the Markov chain are distributed according to the Boltzmann distribution.
While this distribution may be difficult to evaluate (Z involves the sum over all
possible states), ∆H is cheap to compute for local moves. The sequence of states
are correlated, in particular when the moves are local, because then subsequent
configurations are similar.
Returning for a moment to Chapter 3, the above reasoning implies that the steady-
state distribution for the Hopfield model is the Boltzmann distribution, as stated
in Section 3.5. But note that the algorithm applies to energy functions of arbitrary
form, not just the particular form (4.8) for the Hopfield network.
It is important to stress that the detailed-balance condition must hold for the
transition probability pl →k = pls→k pla→k , not just for the acceptance probability pla→k .
For the local moves discussed above and in the previous Chapters, p s is a constant, so
that pl →k ∝ pla→k . In this case it is sufficient to check the detailed-balance condition
for the accpetance probability. In general, and in particular for global moves, it is
necessary to include pls→k in the detailed-balance check [27].
In practice one uses a slightly different form of the transition probabilities (Metropo-
lis algorithm). Assuming that p s is constant one takes:
e−β ∆H when ∆H > 0 ,

pl →k = p s (4.18)
1 when ∆H ≤ 0 ,
with ∆H = H (n k ) − H (n l ). Equation (4.18) has the advantage that the transition
probabilities are higher than in (4.16) so that moves are more frequently accepted.
That the Metropolis rates obey the detailed-balance condition (4.15) can be seen
using Equations (4.14) and (4.18):
e−β [H (n k )−H (n l )] if H (n k ) > H (n l )

Pβ (n l )pl →k = Z −1 p s e−β H (n l )
1 otherwise
−1 s −β max{H (n k ),H (n l )}
=Z p e (4.19)
−β [H (n l )−H (n k )]

e if H (n l ) > H (n k )
= Z −1 p s e−β H (n k )
1 otherwise
= Pβ (n k )pk →l .
SUMMARY 59

The fact that the algorithm produces states distributed according to Equation (4.14)
offers a different perspective upon the idea of simulated annealing. Slowly lowering
the temperature through the simulation mimics slow cooling of a physical system.
It passes through a sequence of quasi-equilibrium Boltzmann distributions with
lower and lower temperatures, until the system finds the global minimum Hmin of
the energy function at zero temperature, where Pβ (n ) = 0 when H (n ) > Hmin , but
Pβ (n ) > 0 when H (n ) = Hmin .

4.5 Summary
In this Chapter it was shown how Hopfield networks can perform optimisation tasks,
exploring the energy function with stochastic Hopfield dynamics. This approach
is equivalent to the Markov-chain Monte-Carlo algorithm. In simulated annealing
one gradually reduces the noise level as the simulation proceeds. This mimics the
slow cooling of a physical system, an efficient way of bringing the system into its
global optimum.

4.6 Further reading


An older but still good reference for Monte-Carlo methods in Statistical Physics is
the book Monte Carlo methods in Statistical Physics edited by Binder [28]. A more
recent source is the book by Newman and Barkema [29]. For simulated annealing,
you can refer to Ref. [30].

4.7 Exercises
Travelling-salesman problem. Derive Equation (4.5) for the path length in the
travelling-salesman problem.

Double-digest problem. Implement the Metropolis algorithm for the double-digest


problem. Denote the ordered set of fragment lengths produced by digesting with
enzyme A by a = {a 1 , . . . , a n }, where a 1 ≥ a 2 ≥ . . . ≥ a n . Similarly b = {b1 , . . . , bm }
(b1 ≥ b2 ≥ . . . ≥ bm ) for fragment lengths produced by digesting with enzyme B ,
and c = {c1 , . . . , cl } (c1 ≥ c2 ≥ . . . ≥ cl ) for fragment lengths produced by digesting
first with A and then with B . Given permutations σ and µ of the sets a and b
correspond Pto a set of c -fragments we denote it by cˆ (σ, µ). Use the energy function
H (σ, µ) = j c j−1 [c j − cˆj (σ, µ)]2 . Configuration space is the space of all permutation
pairs (σ, µ). Local moves correspond to inversions of short subsequence of σ and/or
60 STOCHASTIC OPTIMISATION

µ. Check that the scheme of suggesting new states is symmetric. This is necessary
for the algorithm to converge. The solutions of the double-digest problem are
degenerate. Determine the degeneracy of the solutions for the fragment sets shown
in Table 4.1.
L = 10000
a = [5976, 1543, 1319, 1120, 42]
b = [4513, 2823, 2057, 607]
c = [4513, 1543, 1319, 1120, 607, 514, 342, 42]

L = 20000
a = [8479, 4868, 3696, 2646, 169, 142]
b = [11968, 5026, 1081, 1050, 691, 184]
c = [8479, 4167, 2646, 1081, 881, 859, 701, 691, 184, 169, 142]

L = 40000
a = [9979, 9348, 8022, 4020, 2693, 1892, 1714, 1371, 510, 451]
b = [9492, 8453, 7749, 7365, 2292, 2180, 1023, 959, 278, 124, 85]
c = [7042, 5608, 5464, 4371, 3884, 3121, 1901, 1768, 1590, 959, 899, 707, 702, 510, 451, 412,

278, 124, 124, 85]

Table 4.1: Example configurations for the double-digest problem for three different
chromosome lengths L . For each example, three ordered fragment sets are given,
corresponding to the result of digestion with A, with B, and with both A and B.
61

PART II
SUPERVISED LEARNING
62 SUPERVISED LEARNING

The Hopfield networks described in Part I solve auto-association tasks, where


the neurons of the network act as inputs and outputs. In the pattern-recognition
problem, a distorted pattern is fed into the network, the recursive network dynamics
is run until a steady state is reached. The aim is that the steady-state values of the
neurons converge to those of the correct pattern associated with the distorted one.

A related type of problem that is very common are classification tasks. The
machine-learning repository [31] at the University of California Irvine contains a
large number of such problems. A well-known example is the iris data set. It lists
attributes of 150 iris plants. The data set was described by the geneticist R. A. Fisher
[32]. For each plant four attributes are given (Figure 5.1): its sepal length, sepal width,
petal length, and petal width. Also, each plant is classified into one of three classes:
iris setosa, iris versicolor, or iris virginica. The task is to program a neural network
that determines the class of a plant from its attributes. To each input (attributes of
an iris plant) the network should associate the correct output, the class of the plant.
The correct output is referred to as the target.

In supervised learning one uses a training data set of correct input/output pairs.
One feeds an input from the training data into the input terminals of the network
and compares the states of the output neurons to the target values. The weights and
thresholds are changed to minimise the differences between network outputs and
targets for all input patterns in the training set. In this way the network learns to
associate input patterns in the training set with the correct target values. A crucial
question is whether the trained network can generalise: does it find the correct
targets for input patterns that were not in the training set?

The networks used for supervised learning are called perceptrons [12, 13]. They
consist of layers of McCulloch-Pitts neurons: an input layer, a number of hidden
layers, and an output layer. The layers are usually arranged from the left (input)
to the right (output). All connections are one-way, from neurons in one layer to
neurons in the layer immediately to the right. There are no connections between
neurons in a given layer, or back to layers on the left. This arrangement ensures
convergence of the training algorithm (stochastic gradient descent). During training
with this algorithm the weights are updated iteratively. In each step, an input is
applied and the weights of the network are updated to reduce the error in the output.
In a sense each step corresponds to adding a little bit of Hebb’s rule to the weights.
This is repeated until the network classifies the training set correctly.

Stochastic gradient descent for of multi-layer perceptrons has received much


attention recently, after it was realised that networks with many hidden layers can
be trained to reliably recognise and classify image data, for self-driving cars for
instance but also for other applications (deep learning).
63

sepal sepal petal classification


length width length width
6.3 2.5 5.0 1.9 virginica
5.1 3.5 1.4 0.2 setosa
petal 5.5 2.6 4.4 1.2 versicolor
4.9 3.0 1.4 0.2 setosa
6.1 3.0 4.6 1.4 versicolor
6.5 3.0 5.2 2.0 virginica

Figure 5.1: Left: petals and sepals of the iris flower. Right: six entries of the iris data set
[31]. All lengths in cm. The whole data set contains 150 entries.

5 Perceptrons
Perceptrons [12, 13] are trained by iteratively updating their weights and thresholds.
In the Hopfield networks described in Part I, by contrast, the weights were always
assigned using Hebb’s rule. To motivate the idea of updating weights iteratively,
consider Hebb’s rule, Equation (2.29). We estimated in Section 2.2 how frequently
neurons are erroneously updated because the cross-talk term in Equation (2.31)
changed the sign of the bit in question. To this end we assumed that all bits of
all patterns were independently identically randomly distributed, and we used
the central-limit theorem. For correlated patterns the effect of the cross-talk term
is different from the results calculated in Chapter 3. It has been argued that the
storage capacity increases when the patterns are more strongly correlated, others
have claimed that the capacity decreases in this limit (see Ref. [33] for a discussion).
When we must deal with a definite set of patterns (no randomness to average
over), the situation seems to be even more challenging. Is there a way of modifying
Hebb’s rule to deal with this problem? Yes there is! We simply incorporate the
overlaps
1
Qµν = x (µ) · x (ν) . (5.1)
N
into Hebb’s rule. To this end, define the p × p overlap matrix Q with elements Qµν .
The modified Hebb’s rule reads:
1 X (µ) −1  (ν)
wi j = x Q µν x j . (5.2)
N µν i

For orthogonal patterns (Qµν = δµν ) this rule is identical to Equation (2.28). For
non-orthogonal patterns, the rule (5.2) ensures that all patterns are recognised,
provided that the matrix Q is invertible. In this case one can find the weights wi j
64 PERCEPTRONS

iteratively, by successive improvement from an arbitrary starting point. We can say


that the network learns the task through a sequence of weight changes. This is the
idea used to solve classification tasks with perceptrons (Section 5.3). You will see in
the following that this usually works even when Equation (5.2) fails.
A perceptron is a layered feed-forward network (Figure 5.2). The leftmost layer
contains input terminals (black in Figure 5.2). To the right follows a number of
layers of McCulloch-Pitts neurons. The right-most layer of neurons is the output
layer where the output of the network is read out. The other neuron layers are called
hidden layers, because they feed into other neurons, their states are not read out.
All connections wi j are one-way: every neuron (or input terminal) feeds only to
neurons in the layer immediately to the right. There are no connections within
layers, or back connections, or connections that jump over a layer.
There are N input terminals. As in Part I we denote the input patterns by
 (µ) 
x1
 x (µ) 
x (µ) =  .  .
 2 
(5.3)
 .. 
(µ)
xN

The index µ labels the different input patterns in the training set. It ranges from 1 to
p . All neurons are McCulloch-Pitts neurons. The output neurons in the network on
the left of Figure 5.2, for example, perform the computation:
X
Oi = g (Bi ) with Bi = Wi j x j − Θi (5.4)
j

The index i labels the output neurons, it ranges from 1 to M . Each output neuron
has a threshold, Θi . In the literature on deep learning the thresholds are sometimes
referred to as biases, defined as −Θi . The function g is an activation function as

inputs outputs inputs hiddenoutputs

Figure 5.2: Feed-forward network without hidden layer (left), and with one hidden
layer (right). The input terminals are coloured black.
A CLASSIFICATION TASK 65

Legend
x2 t (µ) = 1
t (µ) = −1

x (µ) x2
x1 O1
x1

Figure 5.3: Left: classification problem with two-dimensional real-valued inputs and
target values equal to ±1. The red line is the decision boundary (see text). Right:
corresponding perceptron.

described in Section (1.2). Now consider the network on the right of Figure 5.2. The
states of the neurons in the hidden layer are denoted by V j , with thresholds θ j and
weights w j k . In summary:
 X
V j = θH b j with bj = w j k xk − θ j , (5.5a)
k
X
Oi = θH (Bi ) with Bi = Wi j V j − Θi . (5.5b)
j

A classification task is given by a training set of input patterns x (µ) and corresponding
target values  (µ) 
t1
t (µ) 
t (µ) =  .  .
 2 
(5.6)
 .. 
(µ)
tM
The perceptron is trained by choosing its weights and thresholds so that the network
produces the desired output.
(µ) (µ)
Oi = ti for all i and µ. (5.7)
(µ) (µ)
Remark: if we take t i = xi for i = 1, . . . , N the task is the associative memory
problem discussed in Part I.

5.1 A classification task


To illustrate how perceptrons can solve classification tasks, we consider a very simple
example (Figure 5.3). There are ten patterns, each has two real-valued components:
66 PERCEPTRONS
Legend
x2 t (µ) = 1
t (µ) = −1

x
ϕ x1
w

Figure 5.4: The perceptron classifies the patterns correctly for the weight vector w
shown, orthogonal to the decision boundary.

(µ)
 
(µ) x1
x = (µ) . (5.8)
x2
In Figure 5.3 the patterns are drawn as points in the x1 -x2 plane, the input plane.
There are two classes of patterns, with targets ±1

t (µ) = 1 for „ and t (µ) = −1 for ƒ . (5.9)

The activation function consistent with the possible target values is the signum
function, g (b ) = sgn(b ). The perceptron has two input terminals connected to a
single output neuron. Since there is only one neuron, we can arrange the weights
into a weight vector  
w
w= 1 . (5.10)
w2
The network performs the computation

O = sgn(w1 x1 + w2 x2 − θ ) = sgn(w · x − θ ) . (5.11)

Here w · x = w1 x1 + w2 x2 is the scalar product between the vectors w and x . This


allows us to find a geometrical interpretation of the classification task. We see in
Figure 5.3 that the patterns fall into two clusters: ƒ to the right and „ to the left. We
can classify the patterns by drawing a line that separates the two clusters, so that
everything on the right of the line has t = 1, while the patterns on the left of the
line have t = −1. This line is called the decision boundary. To find the geometrical
significance of Equation (5.11), let us ignore the threshold for a moment, so that

O = sgn(w · x ) . (5.12)

The classification task takes the form

sgn w · x (µ) = t (µ) .



(5.13)
A CLASSIFICATION TASK 67

x2 x2 • ˜
θ 1
• ˜
1 w=
w= w2 1
1
w · x (µ) = 0
w · x (µ) = θ
x1
x1
Figure 5.5: Decision boundaries without and with threshold.

To evaluate the scalar product we write the vectors as


   
cos β cos α
w = |w | and x = |x | . (5.14)
sin β sin α
Æ
Here |w | = w12 + w22 denotes the norm of the vector w , and α and β are the angles
of the vectors with the x1 -axis. Then w · x = |w ||x | cos(α−β ) = |w ||x | cos ϕ, where ϕ
is the angle between the two vectors. When ϕ is between −π/2 and π/2, the scalar
product is positive, otherwise negative. As a consequence, the network classifies
the patterns in Figure 5.3 correctly if the weight vector is orthogonal to the decision
boundary drawn in Figure 5.4.
What is the role of the threshold θ ? Equation (5.11) shows that the decision
boundary is parameterised by w · x = θ , or

x2 = −(w1 /w2 ) x1 + θ /w2 . (5.15)

Therefore the threshold determines the intersection of the decision boundary with
the x2 -axis (equal to θ /w2 ). This is illustrated in Figure 5.5.
The decision boundary – the straight line orthogonal to w – should divide inputs
with positive and negative targets. If no such line can be found, then the problem
cannot be solved with a single neuron. Conversely, if such a line exists, the problem
can be solved (and it is called linearly separable). Otherwise the problem is not
linearly separable. This can occur only when p > N . Examples of problems that are
linearly separable and not linearly separable are shown in Figure 5.6.

t = +1
x2 x2
t = −1

x1 x1

Figure 5.6: Linearly separable and non-separable data.


68 PERCEPTRONS

Legend
x2 t (µ) = 1
3
2 t (µ) = −1
x1 x2 t
w 1 θ = 32
0 0 -1 1
0 1 -1 1
1
1 0 -1 2 w · x (µ) = θ
1 1 +1 1 3
2 1 2 x1

Figure 5.7: Boolean AND function: value table, geometrical representation, and net-
work layout. The weight values are written next to the connections.

Other examples are Boolean functions. A Boolean function takes N binary inputs
and has one binary output. The Boolean AND function (two inputs) is illustrated
in Figure 5.7. The value table of the function is shown on the left. The graphical
representation is shown in the centre of the Figure (ƒ corresponds to t = −1 and
„ to t = +1). Also shown is the decision boundary, the weight vector w , and the
network layout with the corresponding values of the weights and the threshold. It is
important to note that the decision boundary is not unique, neither are the weight
and threshold values that solve the problem. The norm of the weight vector, in
particular, is arbitrary. Neither is its direction uniquely specified.

Figure 5.8 shows that the Boolean XOR function is not linearly separable [34].
There are 16 different Boolean functions of two variables. Only two are not linearly
separable, the XOR and the NOT XOR function.

Up to now we discussed only one output unit. If the classification task requires
several output units, each has its own weight vector w i and threshold θi . We can
group the weight vectors into a weight matrix as in Part I, so that the w i are the rows
of W.

Legend
x2 t (µ) = 1
x1 x2 t t (µ) = −1
0 0 -1
0 1 +1
1 0 +1
1 1 -1 x1
Figure 5.8: The Boolean XOR function is not linearly separable.
ITERATIVE LEARNING ALGORITHM 69

Legend
x (8) x (8)
(a) x2 (b) x2 t (µ) = 1
t (µ) = −1

x1 w 0 x1
w

(c) x2 (d) x2

x1 w 0 x1
w
x (4) x (4)

Figure 5.9: Illustration of the learning algorithm. In panel (a) the t = 1 pattern x (8) is on
the wrong side of the decision boundary. To correct the error the weight must be rotated
anti-clockwise [panel (b)]. In panel (c) the t = −1 pattern x (4) is on the wrong side of
the decision boundary. To correct the error the weight must be rotated anti-clockwise
[panel (d)].

5.2 Iterative learning algorithm


In the previous Section we determined the weights and threshold for the XOR prob-
lem by inspection. Now we discuss an algorithm that allows a computer to find
the weights iteratively. How this works is illustrated in Figure 5.9. In panel (a), the
pattern x (8) (t (8) = 1) is on the wrong side of the decision boundary. To turn the
decision boundary anti-clockwise one adds a small multiple of the pattern vector
x (8) to the weight vector

w 0 = w + δw with δw = ηx (8) . (5.16)

The parameter η > 0 is called the learning rate. It must be small, so that the decision
boundary is not rotated too far. The result is shown in panel (b). Panel (c) shows
another case, where pattern x (4) (t (4) = −1) is on the wrong side of the decision
boundary. In order to turn the decision boundary in the right way, anti-clockwise,
one subtracts a small multiple of x (4) :

w 0 = w + δw with δw = −ηx (4) . (5.17)

Note the minus sign. These two learning rules combine to

w 0 = w + δw (µ) with δw (µ) = ηt (µ) x (µ) . (5.18)


70 PERCEPTRONS

For more than one output unit the rule reads


(µ) (µ) (µ) (µ)
wi0 j = wi j + δwi j with δwi j = ηt i x j . (5.19)
This rule is of the same form as Hebb’s rule (2.12). One applies (5.19) iteratively
for a sequence of randomly chosen patterns µ, until the problem is solved. This
corresponds to adding a little bit of Hebb’s rule in each iteration. To ensure that the
algorithm stops when the problem is solved, one can use
(µ) (µ) (µ) (µ)
δwi j = η(t i − Oi )x j . (5.20)

5.3 Gradient-descent learning


In this Section the learning algorithm (5.20) is derived in a different way, by min-
imising an energy function using gradient descent. This requires differentiation,
therefore we must choose a differentiable activation function. The simplest choice
is g (b ) = b , so that the network computes:
(µ) (µ)
X
Oi = wi k xk (5.21)
k
(µ)
(linear unit). The outputs Oi assume continuous values, but not necessarily the
(µ)
targets t i . For linear units, the classification problem
(µ) (µ)
Oi = ti for i = 1, . . . , N and µ = 1, . . . , p (5.22)
has the formal solution
1 X (µ) € −1 Š (ν)
wi k = t Q x , (5.23)
N µν i µν k

as you can verify by inserting Equation (5.23) into (5.21). Here Q is the overlap matrix
with elements
Qµν = N1 x (µ) · x (ν) (5.24)
(page 63). For the solution (5.23) to exist, the matrix Q must be invertible. This
requires that p ≤ N , because otherwise the pattern vectors are linearly dependent,
and thus also the columns (or rows) of Q. If the matrix Q has linearly dependent
columns or rows it cannot be inverted.
Let us assume that the patterns are linearly independent, so that the solution
(5.23) exists. In this case we can find the solution iteratively. To this end one defines
the energy function
1 X € (µ) (µ) 2
Š 1 X € (µ) X (µ) 2
Š
H ({wi j }) = t i − Oi = ti − wi j x j . (5.25)
2 iµ 2 iµ j
GRADIENT-DESCENT LEARNING 71

Here H is regarded as a function of the weights wi j , unlike the energy function in


Part I which is a function of the state-variables of the neurons. The energy function
(5.25) is non-negative, and it vanishes for the optimal wi j if the pattern vectors x (µ)
are linearly independent. This solution of the classification problem corresponds
to the global minimum of H .
To find the global minimum of H one uses gradient descent: one repeatedly
updates the weights by adding increments

∂H
0
wmn = wmn + δwmn with δwmn = −η . (5.26)
∂ wmn

The small parameter η > 0 is the learning rate. The negative gradient points in the
direction of steepest descent of H . The idea is to take many downhill steps until
one hopefully (but not necessarily) reaches the global minimum.
To evaluate the derivatives one uses the chain rule together with

∂ wi j
= δi m δ j n . (5.27)
∂ wmn

Here δk l is the Kronecker delta, δk l = 1 if k = l and zero otherwise. So δi m δ j n = 1


only if i = m and j = n . Otherwise the product of Kronecker deltas equals zero.
Illustration. The linear function, x , and the constant function are going for a walk.
When they suddenly see the derivative approaching, the constant function gets wor-
ried. "I’m not worried" says the function x confidently, "I’m not put to zero by the
derivative." When the derivative comes closer, it says "Hi! I’m ∂ /∂ y . How are you?"
Moral: when i = 6 m or j =
6 n then wi j and wmn are independent variables, so that
the derivative (5.27) vanishes.
Equation (5.27) gives
X
t m(µ) − Om(µ) xn(µ) .

δwmn = η (5.28)
µ

This learning rule is very similar to Equation (5.20). The difference is that Equa-
tion (5.28) contains a sum over all patterns (batch training). An advantage of the
rule (5.28) is that it is derived from an energy function. This allows to analyse the
convergence of the algorithm.
Linear units [Equation (5.21)] are special. You cannot solve the Boolean AND
problem (Figure 5.7) with a linear unit – although the problem is linearly separable
– because the pattern vectors x (µ) are linearly dependent. This means that the
solution (5.23) does not exist. Shifting the patterns or introducing a threshold does
not change this fact. Linear separability does not imply linear independence (but
the converse is true).
72 PERCEPTRONS
Legend Legend
t (µ) = 1 x2 t (µ) = 1
x2
t (µ) = −1 t (µ) = −1

x1 x1
(a) Linearly separable problem. (b) Problem not linearly separable.

Figure 5.10: Problems that are not linearly separable can be solved by a piecewise
linear decision boundary.

Therefore we usually use non-linear units, McCulloch-Pitts neurons with non-


linear activation functions g (b ). There are four important points to keep in mind.
First, for non-linear units it matters less whether or not the patterns are linearly
dependent, but it is important whether the problem is linearly separable or not.
Second, if the problem is linearly separable then we can use gradient descent to
determine suitable weights (and thresholds). Third, for gradient descent we must
require that the activation function g (b ) is differentiable, or at least piecewise differ-
entiable. Fourth, we calculate the gradients using the chain rule, resulting in factors
of derivatives g 0 (b ) = db
d
g (b ). This is the origin of the vanishing-gradient problem
(Chapter 7).

5.4 Multi-layer perceptrons


In Sections 5.1 and 5.2 we discussed how to solve linearly separable problems [Figure
5.10(a)]. The aim of this Section is to show that non-separable problems like the one
in Figure 5.10(b) can be solved by a perceptron with one hidden layer. A network
that does the trick for the classification problem in Figure 5.10(b) is depicted in
Figure 5.11. Here the hidden neurons are 0/1 units, but the output neuron gives ±1,
as in the previous Section. The network computes with the following rules:
(µ) (µ) (µ) (µ)
€ Š X
V j = θH b j with b j = w j k xk − θ j ,
k
(µ)
€
(µ)
Š
(µ)
X
(µ)
(5.29)
O1 = sgn B1 with B1 = W1 j V j − Θ1 .
j

Here θH (b ) is the Heaviside function. Each of the three neurons in the hidden layer
has its own decision boundary. The idea is to choose weights and thresholds in such
a way that the three decision boundaries divide the input plane into distinct regions,
MULTI-LAYER PERCEPTRONS 73

so that each region contains either only t = 0 patterns or t = 1 patterns. We shall


see that the values of the hidden neurons encode the different regions. Finally, the
output neuron associates the correct target value with each region.
How this construction works is shown in Figure 5.12. The left part of the Figure
shows the three decision boundaries. The indices of the corresponding hidden
neurons are drawn in blue. Also shown are the weight vectors. The regions are
encoded with a three-digit binary code. The value of the j -th digit is the value of the
j -th hidden neuron: V j = 1 if the pattern is on the weight-vector side of the decision
boundary, and V j = 0 on the other side. The Table shows the targets associated with
each region, together with the code of the region.
A graphical representation of the output problem is shown in Figure 5.13. The
problem is linearly separable. The following function computes the correct output
for each region:
(µ) (µ) (µ) (µ)
€ Š
O1 = sgn V1 + V2 + V3 − 32 . (5.30)
This completes the construction of a solution. It is not unique.
In summary, one can solve non-linearly separable problems by adding a hidden
layer. The neurons in the hidden layer define segments of a piecewise linear decision
boundary. More neurons are needed if the decision boundary is very wiggly.
Figure 5.14 shows another example, how to solve the Boolean XOR problem with
a perceptron that has two 0/1 neurons in a hidden layer, with thresholds 12 and
2 , and all weights equal to unity. The output neuron has weights +1 and −1 and
3

threshold 21 :
O1 = sgn(V1 − V2 − 12 ) . (5.31)
Minsky and Papert [34] proved in 1969 that all Boolean functions can be rep-
resented by multilayer perceptrons, but that at least one hidden neuron must be
connected to all input terminals. This means that not all neurons in the network are
locally connected (have only a few incoming weights). Since fully connected net-
works are much harder to train, Minsky and Papert offered a somewhat pessimistic

wjk 1 W1 j O1
xk Vj

Figure 5.11: Hidden-layer perceptron to solve the problem shown in Figure 5.10 (b).
The three hidden neurons are 0/1 neurons, the output neuron produces ±1.
74 PERCEPTRONS

V1 V2 V3 target
000 Legend 0 0 0 -1
010
t (µ) = 1
x2 1 0 0 -
001 t (µ) = −1
0 1 0 -1
011 0 0 1 -1
110
1 1 0 +1
1
3 2 101 1 0 1 +1
111 0 1 1 +1
1 1 1 +1
x1
Figure 5.12: Left: decision boundaries and regions. Right: encoding of the regions and
corresponding targets. The region 100 does not exist.

view of learning with perceptrons, resulting in a controversy [35]. Now, almost 50


years later, the perspective has changed. Convolutional networks (Chapter 7) have
only local connections to the inputs and can be trained to recognise objects in
images with high accuracy.
In summary, perceptrons are trained on a training set (x (µ) , t (µ) ) with µ = 1, . . . , p
by moving the decision boundaries into the correct positions. This is achieved by
repeatedly applying Hebb’s rule to adjust all weights. This corresponds to using
gradient-descent learning on the energy function (5.25). We have not discussed how
to update the thresholds yet, but it is clear that they can be updated with gradient-
descent learning (Section 5.3).
Once all decision boundaries are in the right place we must ask: what happens
when we apply the trained network to a new dataset? Does it classify the new inputs
correctly? In other words, can the network generalise? An example is shown in
Figure 5.15. Panel (a) shows the result of training the network on a training set.
The decision boundary separates t = −1 patterns from t = 1 patterns, so that the
network classifies all patterns in the training set correctly. In panel (b) the trained

V3
011
111
101
001
V2
010 110
000 V1
Figure 5.13: Graphical representation of the output problem for the classification
problem shown in Figure 5.12.
SUMMARY 75

Legend
x2 t (µ) = 1
w2 t (µ) = −1 1
0.5
V1 V2 t 1 +1 0.5
w1
0 0 -1 1
1 0 +1 11 1
0 1 +1 10 2 −1
00 1
1.5
1 1 -1
x1
Figure 5.14: Boolean XOR function: value table, geometrical representation, and net-
work layout. The two hidden neurons as 0/1 neurons, the output produces ±1.
Legend
(a) x2 (b) x2 t (µ) = 1
t (µ) = −1

x1 x1

error

Figure 5.15: (a) Result of training the network on a training set. (b) Validation by
feeding the patterns of a validation set.

network is applied to patterns in a validation set. We see that most patterns are
correctly classified, save for one error. This means that the energy function (5.25)
is not exactly zero for the validation set. Nevertheless, the network does quite a
good job. Usually it is not a good idea to try to precisely classify all patterns near the
decision boundary, because real-world data sets are subject to noise. It is a futile
effort to try to learn and predict noise.

5.5 Summary
Perceptrons are layered feed-forward networks that can learn to classify data in
a training set (x (µ) , t (µ) ). For each input pattern x (µ) the network finds the correct
targets t (µ) . We discussed the learning algorithm for a simple example: real-valued
patterns with just two components, and one binary target. This allowed us to
represent the classification problem graphically. There are three different ways of
understanding how the perceptron learns. First, geometrically, to learn means to
move the decision boundaries into the right places. Second, this can be achieved by
repeatedly adding a little bit of Hebb’s rule. Third, this algorithm corresponds to
gradient descent on the energy function (5.25).
76 PERCEPTRONS

Legend
x2
t (µ) = 1
3
2 t (µ) = −1

x1
Figure 5.16: Alternative solution of the classification problem shown in Figure 5.12.
.

5.6 Further reading


A short account of the history of perceptron research is the review by Kanal [35]. He
discusses the work of Rosenblatt [12, 13], McCulloch and Pitts [11], as well as the
early controversy around the book by Minsky and Papert [34].

5.7 Exercises
Non-orthogonal patterns. Show that the rule (5.2) ensures that all patterns are
recognised, for any set of non-orthogonal patterns that gives rise to an invertible
matrix Q. Demonstrate this by showing that the cross-talk term evaluates to zero,
assuming that Q−1 exists.

Boolean functions. How many Boolean functions with three-dimensional inputs


are there? How many of them are linearly separable?

Piecewise linear decision boundary. Find an alternative solution for the classifica-
tion problem shown in Figure 5.12, where the weight vectors are chosen as depicted
in 5.16.

5.8 Exam questions

5.8.1 Linear separability of Boolean functions


(a) The value table for the Boolean XOR problem is shown in Table 5.1. Show that
this problem cannot be solved by a simple perceptron, with two input terminals
and one output unit (no hidden layer) (0.5p).
EXAM QUESTIONS 77

(b) Now consider Boolean functions with three-dimensional inputs. How many of
the 256 Boolean functions with three-dimensional inputs can be solved by a per-
ceptron with three input terminals and one output unit (no hidden layer)? Describe
how you arrive at the answer. Hint: visualise the functions by colouring the corners
of a cube. Use symmetries to reduce the number of cases. (1.5p).

x1 x2 t
0 0 −1
0 1 +1
1 0 +1
1 1 −1

Table 5.1: Value table for the XOR problem. Question 5.8.1.

5.8.2 Boolean functions


Any N -dimensional Boolean function can be represented using a perceptron with
one hidden layer consisting of 2N neurons. Here we consider N = 3 and N = 2. The
(µ)
three-dimensional parity problem is specified in Figure 5.17. The input bits xk for
k = 1 , 2 , 3 are either +1 or -1. The output O (µ) of the network is +1 if there is an odd
number of positive bits in x (µ) , and -1 if the number of positive bits are even. In one
(µ)
solution, the state V j of neuron j = 1 , . . . , 2N in the hidden layer is given by:

(µ)
¨
− θj + w j k xk > 0 ,
P
(µ) 1 if
Vj = Pk (µ) (5.32)
0 if − θj + k w j k xk ≤ 0 ,

(j)
where the weights and thresholds are given by w j k = xk and θ j = 2. The output
(µ) (µ)
is given by O = j Wj V j .
P

(a) Determine the weights Wj . (1 p).


(b) In two dimensions the problem in (a) is the XOR problem. Specialise the solu-
tion from (a) to the XOR problem and draw the decision boundaries of the hidden
neurons. (0.5 p)
78 PERCEPTRONS

x (8)

x (7)
(5)
x
x (6)

x (4)
x3
x (3)

x (1)
x (2)
x2
x1

Figure 5.17: The three-dimensional parity problem. A white ball indicates O (µ) = −1,
and a black ball indicates O (µ) = +1. Question 5.8.2.

x2

x1
Figure 5.18: Classification problem. Input space is the x1 -x2 -plane. Question 5.8.3.
EXAM QUESTIONS 79

5.8.3 Linearly inseparable problem


A classification problem is specified in Figure 5.18. The aim is to map input patterns
x (µ) to outputs O (µ) as follows: if a point x (µ) lies inside the triangle it is mapped to
O (µ) = 1, but if x (µ) is outside the triangle it is mapped to O (µ) = 0. How patterns on
the triangle boundary are classified is not important.
(a) Show that this problem is not linearly separable. (0.5 p).
(b) The problem can be solved by a perceptron with one hidden layer with three
neurons ( j = 1, 2, 3)
2
(µ) (µ)
X 
Vj = θH − θ j + w j k xk . (5.33)
k =1

The output is computed as


3
(µ)
X
O (µ) = θH − Θ +

Wj V j . (5.34)
j =1

Here w j k and Wj are weights, θ j and Θ are thresholds, and θH is the Heaviside
function:
¨
1 for b > 0
θH (b ) = . (5.35)
0 for b ≤ 0

Find weights and thresholds that solve the classification problem. (1 p).

5.8.4 Perceptron with one hidden layer


A perceptron has one input layer, one layer of hidden neurons, and one output unit.
(µ) (µ)
It receives two-dimensional input patterns x (µ) = (x1 , x2 )T . They are mapped to
(µ)
four hidden neurons Vi according to
(µ)
¨
0 if − θ j + k w j k xk ≤ 0 ,
P
(µ)
Vj = (µ) (5.36)
1 if − θ j + k w j k xk > 0 ,
P

where w j k and θ j are weights and thresholds of the hidden neurons. The output is
given by
(µ)
¨
0 if − Θ + j Wj V j ≤ 0 ,
P
(µ)
O = (µ) (5.37)
1 if − Θ + j Wj V j > 0 .
P
80 PERCEPTRONS

Weights Wj and threshold Θ of the output are given by:

W1 = W3 = W4 = 1 , W2 = −1 , and Θ = 12 . (5.38)

(a) Figure 5.19(left) shows how input space is mapped to the the hidden neurons.
Draw the decision boundary of the network, given the weights and thresholds in
Equation (5.38). (0.5p).
(b) Show that one cannot map the input space to the space of hidden neurons as in
Figure 5.19(right). (0.5 p).
(c) Give values of wi j and θi that yield the pattern in Figure 5.19(left). (1 p).

(0, 0, 1, 1)T (0, 1, 1, 1)T (1, 1, 1, 1)T (0, 0, 1, 1)T (0, 1, 1, 1)T (1, 1, 0, 1)T

(0, 0, 0, 1)T (0, 1, 0, 1)T (1, 1, 0, 1)T (0, 0, 0, 1)T (0, 1, 0, 1)T (1, 1, 1, 1)T

x2 x2
(0, 0, 0, 0)T (0, 1, 0, 0)T (1, 1, 0, 0)T (0, 0, 0, 0)T (0, 1, 0, 0)T (1, 1, 0, 0)T

x1 x1

Figure 5.19: Left: input space with decision boundaries of the hidden neurons V j
(black lines). These decision boundaries divide input space into nine zones, each with a
certain coordinate V = (V1 , V2 , V3 , V4 )T in the space of the hidden neurons. Right: same,
but here the indicated mapping to the space of the hidden neurons is not possible.
Question 5.8.4.

5.8.5 Linearly inseparable problem


A classification problem is specified in Figure 5.20, where a grey triangle in input
space is shown. The aim is to map input patterns x (µ) to outputs O (µ) as follows: if a
point x (µ) lies inside the triangle it is mapped to O (µ) = +1, but if x (µ) is outside the
triangle it is mapped to O (µ) = −1. How patterns on the boundary of the triangle are
classified is not important.

(a) Show that this problem is not linearly separable by constructing a counter-
example using four input patterns. (0.5p).
EXAM QUESTIONS 81

(b) The problem can be solved by a perceptron with one hidden layer with three
neurons ( j = 1 , 2 , 3)
2
(µ) (µ)
€ X Š
Vj = sgn − θ j + w j k xk (5.39)
k =1

and output
3
(µ)
€ X Š
(µ)
O = sgn − Θ + Wj V j . (5.40)
j =1

Here w j k and Wj are weights and θ j and Θ are thresholds. The weights w j k are
subject to the following constraints. First, the three weights w j 1 are all equal to
one, w11 = w21 = w31 = 1. Second, the three weights w j 2 are such that x (µ) = (-4,-1)T
maps to V (µ) = (1,-1,-1)T and x (µ) = (-1,5)T maps to V (µ) = (-1,-1,1)T . Given these
constraints, find values of w j k , Wj , θ j and Θ that solve the classification problem.
Hint: The constraints uniquely determine the hidden thresholds, the orientations
of the hidden weight vectors, and their order in the weight matrix. (1p).

x2

x1
Figure 5.20: Classification problem. Input space is the x1 -x2 -plane. Question 5.8.5.
82 STOCHASTIC GRADIENT DESCENT

6 Stochastic gradient descent


In Chapter 5 we discussed how a hidden layer helps to classify problems that are
not linearly separable. We explained how the decision boundary in Figure 5.12 is
represented in terms of the weights and thresholds of the hidden neurons, and
introduced a training algorithm based on gradient descent. In this Section, the
training algorithm is discussed in more detail. It is explained how it is implemented,
why it converges, and how its performance can be improved. Figure 6.1 shows
the layout of the network to be trained. There are p input patterns x (µ) with N
components each, as before. The output of the network has M components:

 (µ) 
O1
O (µ) 
O (µ) =  .  ,
 2 
(6.1)
 .. 
(µ)
OM

to be matched to the targets t (µ) . The activation functions must be differentiable (or
at least piecewise differentiable), but apart from that there is no need to specify them
further at this point. The network shown in Figure 6.1 performs the computation

(µ) (µ) (µ) (µ)


€ Š X
Vj = g bj with bj = w j k xk − θ j ,
k
(µ)
€
(µ)
Š
(µ)
X
(µ)
(6.2)
Oi =g Bi with Bi = Wi j V j − Θi .
j

So the outputs are computed in terms of nested activation functions:

   
(µ) (µ)
X X
Oi =g Wi j g w j k xk − θ j − Θi . (6.3)
j k
| {z }
(µ)
Vj

This is a consequence of the network layout of the perceptron: all incoming con-
nections to a given neuron are from the layer immediately to the left, all outgoing
connections to the layer immediately to the right. The more hidden layers a network
has, the deeper is the nesting of the activation functions.
CHAIN RULE AND ERROR BACKPROPAGATION 83

3
4
2
3 2
1
2 1

1
wjk W1 j
xk Vj Oi

Figure 6.1: Neural network with one hidden layer. Illustrates the notation used in
Section 6.1.

6.1 Chain rule and error backpropagation


The network is trained by gradient-descent learning on the energy function (5.25),
in the same way as in Section 5.3:
1 X € (µ) (µ) 2
Š
H= t i − Oi . (6.4)
2 µi

The weights are updated using the increments


∂H ∂H
δWmn = −η and δwmn = −η . (6.5)
∂ Wmn ∂ wmn
As in Section 5.3, the small parameter η > 0 is the learning rate. The derivatives of
the energy function are evaluated with the chain rule. For the weights connecting
to the output layer we apply the chain rule once
(µ)
∂H (µ) (µ) ∂ Oi
X€ Š
=− t i − Oi , (6.6a)
∂ Wmn µi
∂ Wmn

and then once more:


(µ)
∂ Oi ∂ €X
(µ)
Š
(µ)
= g Wi j V j − Θi = g 0 (Bi )δi m Vn(µ) . (6.6b)
∂ Wmn ∂ Wmn j

Here g 0 (B ) = dg /dB is the derivative of the activation function with respect to


the local field B . An important point here is that the values V j of the neurons in
84 STOCHASTIC GRADIENT DESCENT

the hidden layer do not depend on Wmn . The neurons V j do not have incoming
connections with these weights, a consequence of the feed-forward layout of the
network. In summary we obtain for the increments of the weights connecting to
the output layer:
p
∂H X
t m(µ) − Om(µ) g 0 Bm(µ) Vn(µ)
 
δWmn = −η =η (6.7)
∂ Wmn µ=1 | {z }
(µ)
≡∆m

The quantity ∆(µ)


m
is a weighted error : it vanishes when Om(µ) = t m(µ) .
The weights connecting to the hidden layer are updated in a similar fashion, by
applying the chain rule three times:
(µ)
∂H (µ) (µ) ∂ Oi
X€ Š
=− t i − Oi , (6.8a)
∂ wmn µi
∂ wmn
(µ) (µ)
∂ Oi ∂ €X
(µ)
Š
(µ)
X ∂ Vj
= g Wi j V j − Θi = g 0 (Bi ) Wi j (6.8b)
∂ wmn ∂ wmn j j
∂ wmn
(µ)
∂ Vj ∂ €X
(µ)
Š
(µ)
= g w j k xk − θ j = g 0 (b j ) δ j m xn(µ) . (6.8c)
∂ wmn ∂ wmn k

Here δ j m is the Kronecker delta (δ j m =1 if j = m and zero otherwise). Taking these


results together, one has
(µ)
XX
∆i Wi m g 0 bm(µ) xn(µ)

δwmn = η (6.9)
µ i
| {z }
(µ)
=δm

(µ)
The quantities δm are errors associated with the hidden layer (they vanish when
(µ)
the output errors ∆i are zero). Equation (6.9) shows that the errors are determined
recursively, in terms of the errors in the layer to the right:
(µ)
X
(µ)
∆i Wi m g 0 bm(µ) .

δm = (6.10)
i

(µ)
In other words, the error δm for the hidden layer is computed in terms of the output
(µ)
errors ∆i . Equations (6.7) and (6.9) show that the weight increments have the same
form:
X p Xp
(µ) (µ) (µ) (µ)
δWmn = η ∆m Vn and δwmn = η δm xn . (6.11)
µ=1 µ=1
CHAIN RULE AND ERROR BACKPROPAGATION 85

errors

neurons
Figure 6.2: Backpropagation algorithm: the states of the neurons are updated forward
(from left to right) while errors are updated backward (right to left).

The rule (6.11) is sometimes referred to as the δ-rule. It is local: the increments of
the weights feeding into a certain layer are determined by the errors associated with
that layer, and by the states of the neurons in the layer immediately to the left.
If the network has more hidden layers, then their errors are computed recursively
using Equation (6.10), and the formula for the weight increments has the same form
as Equation (6.11) (Algorithm 2). Figure 6.2 illustrates the different ways in which
neurons and errors are updated. The feed-forward structure of the layered network
means that the neurons are updated from left to right (blue arrows). Equation (6.10),
by contrast, implies that the errors are updated from the right to the left (red arrows),
from the output layer to the hidden layer. The term backpropagation refers to this
difference: the neurons are updated forward, the errors are updated backward.
The thresholds are updated in a similar way:

∂H X X
t m(µ) − Om(µ) −g 0 Bm(µ) = −η ∆(µ)
 
δΘm = −η =η , (6.12a)
∂ Θm µ µ
m

∂H XX
(µ)
X
∆i Wi m −g 0 bm(µ) = −η (µ)
 
δθm = −η =η δm . (6.12b)
∂ θm µ i µ

The general form for the threshold increments looks like Equation (6.11)
X X
δΘm = −η ∆(µ)
m
and δθm = −η (µ)
δm , (6.13)
µ µ

but without the state variables of the neurons (or the inputs), as expected. A way
to remember the difference between Equations (6.11) and (6.13) is to note that the
formula for the threshold increments looks like the one for the weight increments if
one sets the values of the neurons to −1.
86 STOCHASTIC GRADIENT DESCENT

Stochastic gradient descent

The backpropagation rules (6.7), (6.9), and (6.12) contain sums over patterns. This
corresponds to feeding all patterns at the same time to compute the increments
of weights and thresholds (batch training). Alternatively one may choose a single
pattern, update the weights by backpropagation, and then continue to iterate these
training steps many times. This is called sequential training. One iteration cor-
responds to feeding a single pattern, p iterations are called one epoch (in batch
training, one iteration corresponds to one epoch). If one chooses the patterns ran-
domly, then sequential training results in stochastic gradient descent. Since the sum
over pattern is absent, the steps do not necessarily point downhill, their directions
fluctuate. This yields a stochastic path through weight and threshold space, less
prone to getting stuck in local minima (Chapters 3 and 7).
The stochastic gradient-descent algorithm is summarised in Section 6.2. It ap-
plies to networks with feed-forward layout, where neurons in a given layer take input
only from the neurons in the layer immediately to the left.

Mini batches

In practice, the stochastic gradient-descent dynamics may be too noisy. It is often


better to average over a small number of randomly chosen patterns. Such a set is
called mini batch, of size m B say. In stochastic gradient descent with mini batches
one replaces Equations (6.11) and (6.13) by

mB
X mB
X
δWmn = η ∆(µ)
m n
V (µ) and δΘm = −η ∆(µ)
m
, (6.14)
µ=1 µ=1
mB
X mB
X
(µ) (µ) (µ)
δwmn = η δm xn and δθm = −η δm .
µ=1 µ=1

Sometimes the mini-batch rule is quoted with prefactors of m B−1 before the sums.
This does not make any fundamental difference, the factors m B−1 can just be ab-
sorbed in the learning rate. But when you compare learning rates for different
implementations, it is important to check whether or not there are factors of m B−1 in
front of the sums in Equation (6.14).
How does one assign inputs to mini batches? This is discussed in Section 6.3.1: at
the beginning of each epoch, one should randomly shuffle the sequence of the input
patterns in the training set. Then the first mini batch contains patterns µ = 1, . . . , m B ,
and so forth.
CHAIN RULE AND ERROR BACKPROPAGATION 87

σ(b ) tanh(b )

b b

Figure 6.3: Saturation of the activation functions (6.15): the derivative g 0 (b ) tends to
zero for large values of |b |.

Activation functions
Common choices for g (b ) are the sigmoid function or tanh:
1
g (b ) = ≡ σ(b ) , (6.15a)
1 + e −b
g (b ) = tanh (b ) . (6.15b)
Their derivatives can be expressed in terms of the function itself:
d
dbσ(b ) = σ(b )[1 − σ(b )] , (6.16a)
tanh (b ) = 1 − tanh2 (b ) .
d
 
db (6.16b)
Other activation functions are discussed in Chapter 7.
As illustrated in Figure 6.3, the activation functions (6.15) saturate at large values
of |b |, so that the derivative g 0 (b ) tends to zero. Since the backpropagation rules
(6.7), (6.9), and (6.12) contain factors of g 0 (b ), this implies that the algorithm slows
down. It is a good idea to monitor the values of the local fields during training, to
check that they do not become too large.

Initialisation of weights and thresholds


The initial weights and thresholds must be chosen so that the local fields are not
too large (but not too small either). A standard procedure is to take all weights
to be initially randomly distributed, for example Gaussian with mean zero and a
suitable variance. The performance of networks with many hidden layers (deep
networks) can be very sensitive to the initialisation of the weights (Section 7.2.4).
The thresholds are usually set to zero. The initial values of the thresholds are not so
critical, because thresholds are often learned more rapidly than the weights, at least
initially.

Training
The training continues until the global minimum of H has been reached, or until H
is deemed sufficiently small. The resulting weights and thresholds are not unique.
88 STOCHASTIC GRADIENT DESCENT

In Figure 5.14 all weights for the Boolean XOR function are equal to ±1. But the
training algorithm (6.7), (6.9), and (6.12) corresponds to repeatedly adding weight
increments. This may cause the weights to grow.

6.2 Stochastic gradient-descent algorithm


The algorithm described in the previous Section applies to networks with any
number of layers. We label the layers by the index ` = 0, . . . , L . The layer of in-
put terminals has label ` = 0, while the ` = L denotes the layer of output neu-
(`)
rons. The state variables for the neurons in layer ` are V j , the weights connecting
(`) (`)
into these neurons from the left are w j k , the local fields needed to update V j are
(`) (`) (`−1) (`) (`)
b j = k w j k Vk − θ j , and the errors associated with layer l are denoted by δk .
P

This is illustrated in Figure 6.4. The algorithm is summarised below (Algorithm 2).
STOCHASTIC GRADIENT-DESCENT ALGORITHM 89

layer ` = 0 ··· `−1 ` `+1 ··· `=L


(inputs) (`−1)
δk (outputs)

(`) (`+1)
δj δi

··· ···

(`+1)
(`) wi k
wjk
(`−1) (`) (`+1)
Vk Vj Vi

Figure 6.4: Illustrates notation for stochastic gradient-descent algorithm.

Algorithm 2 stochastic gradient descent


(`) (`)
1: initialise weights wk j to Gaussian random numbers, thresholds to zero, θ j = 0;
2: for t = 1, . . . , T do
3: Choose a value of µ and apply pattern x (µ) to input layer, V (0) ← x (µ) ;
4: for ` = 1, . . . , L do
(`) (`) (`−1) (`)
€P Š
5: propagate forward: Vk ← g j w V
kj j − θ k ;
6: end for
(L ) (L ) (L )
7: compute errors for output layer: δi ← g 0 (bi )(t i − Vi );
8: for ` = L , . . . , 2 do
(`−1) P (`) (`) (`−1)
9: propagate backward: δ j ← i δi wi j g 0 (b j );
10: end for
11: for ` = 1, . . . , L do
(`) (`) (`) (`−1) (`) (`) (`)
12: update: wi j ← wi j + ηδi V j and θi ← θi − ηδi ;
13: end for
14: end for
15: end;
90 STOCHASTIC GRADIENT DESCENT

x2 w1
θ w2
x1 w2

w1
x1 x2 t
100 101 -1
100 99 1

Figure 6.5: Illustrates the effect of non-zero input mean upon the energy function for
one output neuron with tanh activation function and two input terminals. The graph
plots the contours of H for θ = 0 for the training set on the left. The plot illustrates that
H is close to zero only at the bottom of a very narrow trough (hashed region) with steep
sides.

6.3 Recipes for improving the performance


6.3.1 Preprocessing the input data
It can be useful to preprocess the input data, although any preprocessing may
remove information from the data. Nevertheless, it is usually advisable to rigidly
shift the data so that its mean vanishes
p
1 X (µ)
〈xk 〉 = x = 0. (6.17)
p µ=1 k

There are several reasons for this. The first one is illustrated in Figure 6.5. The Figure
shows how the energy function for a single output with tanh activation function
and two input terminals. The classification problem is given in the Table. The input
data has large mean values in both components, x1 and x2 . Since it is difficult to
visualise the dependence of H on both weights and threshold, the graph on the
right shows how the energy function H depends on the weights for zero threshold.
The large mean values of the inputs cause steep cliffs in the energy function that
are difficult to maneuver with gradient descent. Different input-data variances in
different directions have a similar effect. Therefore one usually scales the inputs so
that the input-data distribution has the same variance in all directions (Figure 6.6),
equal to unity for instance:
p
1 X (µ) 2
σk2 = xk − 〈xk 〉 = 1 (6.18)
p µ=1
RECIPES FOR IMPROVING THE PERFORMANCE 91

x2 x2 x2

shift scale
x1 x1 x1

Figure 6.6: Shift and scale the input data to achieve zero mean and unit variance.

Second, to avoid saturation of the neurons connected to the inputs, their local
fields must not be too large. If one initialises the weights in the above example to
Gaussian random numbers with mean zero and unit variance, large activations are
quite likely.
Third, enforcing zero input mean by shifting the input data avoids that the weights
of the neurons in the first hidden layer must decrease or increase together [36].
Equation (6.14) shows that the increments δw m into neuron m are likely to have
the same signs if the inputs have large mean values. This means that the weight
increments have the same signs. This makes it difficult for the network to learn to
differentiate.
In summary, one usually shifts and scales the input-data distribution so that it
has mean zero and unit variance. This is illustrated in Figure 6.6. The same trans-
formation (using the mean values and scaling factors determined for the training
set) should be applied to any new data set that the network is supposed to classify
after it has been trained on the training set.
Figure 6.7 shows a distribution of inputs that falls into two distinct clusters. The
difference between the clusters is sometimes called covariate shift, here covariate is
just another term for input. Imagine feeding first just inputs from one of the clusters
to the network. It will learn local properties of the decision boundary, instead of
its global features. Such global properties are efficiently learned if the network is
more frequently confronted with unfamiliar data. For sequential training (stochastic
gradient descent) this is not a problem, because the sequence of input patterns
presented to the network is random. However, if one trains with mini batches, the

x2

x1

Figure 6.7: When the input data falls into clusters as shown in this Figure, one should
randomly pick data from either cluster, to avoid that patterns become too familiar. The
decision boundary is shown in red.
92 STOCHASTIC GRADIENT DESCENT

x2 u2
u1
1
−2 −1
1 2 x1
−1

Figure 6.8: Illustration of principal-component analysis.

mini batches should contain randomly chosen patterns in order to avoid covariate
shifts. To this end one randomly shuffles the sequence of the input patterns in the
training set, at the beginning of each epoch.
It is sometimes recommended [36] to observe the output errors during training. If
the errors are similar for a number of subsequent learning steps, the corresponding
inputs appear familiar to the network. Larger errors correspond to unfamiliar inputs,
and Ref. [36] suggests to feed such inputs more often.
Often the input data is very high dimensional, requiring many input terminals.
This usually means that there are many neurons in the hidden layers, and the large
number of neurons makes the training computationally very expensive. To avoid this
as far as possible, one can reduce the dimensionality of the input data by principal-
component analysis. This method allows to project high-dimensional data to a
lower dimensional subspace. How this can work is illustrated for a simple example
in Figure 6.8. You see that all data points fall onto a one-dimensional subspace,
the solid line with slope 12 (principal direction). The coordinate orthogonal to the
principal direction is not useful in classifying the data, for the example shown. This
coordinate can be removed in the following way. One uses the fact that the principal
direction points in the direction of the leading eigenvector of the data-covariance
matrix C, that is to the eigenvector with the largest eigenvalue. The data-covariance
matrix has elements
p p
1 X € (µ) Š€
(µ)
Š 1 X (µ)
Ci j = xi − 〈xi 〉 x j − 〈x j 〉 with 〈xi 〉 = x . (6.19)
p µ=1 p µ=1 i

For the example shown in Figure 6.8, the data-covariance matrix reads
 
1 10 5
C= 10 . (6.20)
4 5 4

Its eigenvalues and eigenvectors are:


   
25 1 2 1 1
λ1 = ,u1 = p and λ2 = 0 , u 2 = p . (6.21)
8 5 1 5 −2
RECIPES FOR IMPROVING THE PERFORMANCE 93

Legend
t (µ) = 1
x2 t (µ) = −1 x2

x1 x1

Figure 6.9: Overfitting. Left: accurate representation of the decision boundary in the
training set, for a network with 15 neurons in the hidden layer. Right: this new data
set differs from the first one just by a little bit of noise. The points in the vicinity of the
decision boundary are not correctly classified.

We see that the leading eigenvector u 1 defines principal direction. Figure 6.8 is
an extreme example. Usually there is noise, causing the data to scatter around
the principal direction. This does not change much. The result is that the smaller
eigenvalue is no longer equal to zero, but still small if the data does not scatter
too much about the principal direction. When there are many dimensions, we
inspect the ordered sequence of eigenvalues. Often there is a gap between the small
eigenvalues (close to zero), and larger ones. Then one can safely throw away the
small eigenvalues. If there is no gap, it is less clear what to do.

6.3.2 Overfitting
The goal of supervised learning is to generalise from a training set to new data. Only
general properties of the training set can be generalised, not specific ones that are
particular to the training set and that could be very different in new data. A network
with more neurons may classify the training data better, because it accurately repre-
sents all specific features of the data. But those specific properties could look quite
different in new data (Figure 6.9). As a consequence, we must look for a compromise:
between accurate classification of the training set and the ability of the network to
generalise. The problem illustrated in Figure 6.9 is also referred to as overfitting:
the network fits too fine details (for instance noise in the training set) that have no
general meaning. The tendency to overfit is larger for networks with more neurons.
One way of avoiding overfitting is to use cross validation and early stopping. One
splits the training data into two sets: a training set and a validation set. The idea
is that these sets share the general features to be learnt. But although training and
validation data are drawn from the same distribution, they differ in details that
are not of interest. The network is trained on the training set. During training
one monitors not only the energy function for the training set, but also the energy
94 STOCHASTIC GRADIENT DESCENT

log H
early stopping

validation set

training set
iterations
Figure 6.10: Progress of training and validation errors. The plot is schematic, and
the data is smoothed. Shown is the natural logarithm of the energy functions for the
training set (solid line) and the validation set (dashed line) as a function of the number
of training iterations. The training is stopped when the validation energy begins to
increase. In Section a precise criterion for this early stopping is introduced, one that
works for fluctuating data.

function evaluated on the validation data. As long as the network learns general
features of the input distribution, both training and validation energies decrease.
But when the network starts to learn specific features of the training set, then the
validation energy saturates, or may start to increase. At this point the training should
be stopped. This scheme is illustrated in Figure 6.10.
Often the possible values of the output neurons are continuous while the targets
assume only discrete values. Then it is important to also monitor the classification
error of the validation set. The definition of the classification error depends on the
type of the classification problem. Assume first that there is one output unit, and
that the targets take the values t = 0/1. Then the classification error can be defined
as
p
1 X (µ)
t − θH (O (µ) − 12 ) .

C= (6.22a)
p µ=1

If, by contrast, the targets take the values t = ±1, then the classification error is
defined as
p
1 X (µ)
t − sgn(O (µ) ) .

C= (6.22b)
2p µ=1

Now consider a classification problem where inputs must be classified into M


mutually exclusive classes. An example is the MNIST data set of hand-written digits
(Section 7.4) where M = 10. Another example is given in Table 6.1, with M = 3. In
(µ)
both examples one of the targets t i = 1 while the others equal zero, for a given input
M (µ)
x (µ) . As a consequence, i t i = 1. Assume that the network has sigmoid outputs,
P
(µ) (µ) (µ)
Oi = σ(bi ). To classify input x (µ) from the network outputs Oi we compute for
RECIPES FOR IMPROVING THE PERFORMANCE 95

output targets correct?


0.4 0.4 0.55 0 0 1 setosa yes
0.4 0.55 0.4 0 1 0 versicolor yes
0.1 0.2 0.8 1 0 0 virginica no

output targets correct?


0.1 0.2 0.8 0 0 1 setosa yes
0.1 0.8 0.2 0 1 0 versicolor yes
0.4 0.4 0.55 1 0 0 virginica no

Table 6.1: Illustrates the difference between energy function and classification error.
Each table shows network outputs for three different inputs from the iris data set, as
well as the correct classfications.

the given value of µ:


(µ)
¨
(µ) 1 if Oi is the largest of all outputs i = 1, . . . , M ,
yi = (6.23a)
0 otherwise.

Then the classification errors is defined as


p M
1 X X (µ) (µ)

C= t i − yi . (6.23b)
2p µ=1 i =1

In all cases, the classification accuracy is defined as (1 − C ) 100%, it is usually quoted


in percent.
While the classification error is designed to show the fraction of inputs that are
classified wrongly, it contains less information than the energy function (which is in
fact a mean-squared error of the outputs). This is illustrated by the two problems in
Table 6.1. Both problems have the same classification error, but the energy function
is much lower for the second problem, reflecting the better quality of its solution. Yet
another measure of classification success is the cross-entropy error. It is discussed
in Chapter 7.

6.3.3 Weight decay and regularisation


Figure 5.14 shows a solution of the classification problem defined by the Boolean
XOR function. All weights are of unit modulus, and also the thresholds are of order
unity. If one uses the backpropagation algorithm to find a solution to this prob-
lem, one may find that the weights continue to grow during training. This can be
96 STOCHASTIC GRADIENT DESCENT

problematic, if it means that the local fields become too large, so that the algo-
rithm reaches the plateau of the activation function. Then training slows down, as
explained in Section 6.1.
One solution to this problem is to reduce the weights by some factor during
training, either at each iteration or in regular intervals, wi j → (1 − ")wi j for 0 < " < 1,
or
δwmn = −εwmn for 0 < " < 1 . (6.24)
This is achieved by adding a term to the energy function
1 X € (µ) (µ) 2 γ
Š X
H= t i − Oi + w2 . (6.25)
2 iµ 2 ij ij
| {z }
≡H0

Gradient descent on H gives:


∂ H0
δwmn = −η − εwmn (6.26)
∂ wmn
with " = ηγ. One can add a corresponding term for the thresholds, but this is usually
not necessary. The scheme summarised here is sometimes called L 2 -regularisation.
An alternative scheme is L 1 -regularisation. It amounts to
1 X € (µ) (µ) 2
Š γX
H= t i − Oi + |wi j | . (6.27)
2 iµ 2 ij

This gives the update rule


∂ H0
δwmn = −η − εsgn(wmn ) . (6.28)
∂ wi j
The discontinuity of the update rule at wmn = 0 is cured by defining sgn(0) = 0.
Comparing Equations (6.26) and (6.28) we see that L 1 -regularisation reduces small
weights much more than L 2 -regularisation. We expect therefore that the L 1 -scheme
puts more weights to zero, compared with the L 2 -scheme.
These two weight-decay schemes are referred to as regularisation schemes be-
cause they tend to help against overfitting. How does this work? Weight decay
adds a constraint to the problem of minimising the energy function. The result is a
compromise, depending upon the value γ, between a small value of H and small
weight values. The idea is that a network with smaller weights is more robust to
the effect of noise. When the weights are small, then small changes in some of the
patterns do not give a substantially different training result. When the network
has large weights, by contrast, it may happen that small changes in the input give
significant differences in the training result that are difficult to generalise (Figure
6.9). Other regularisation schemes are discussed in Chapter 7.
RECIPES FOR IMPROVING THE PERFORMANCE 97

6.3.4 Adaptation of the learning rate


It is tempting to choose larger learning rates, because they enable the network to
escape more efficiently from shallow minima. But clearly this causes problems
when the energy function varies rapidly. As a result the training may fail because the
training dynamics starts to oscillate. This can be avoided by changing the learning
rule somewhat
(t ) ∂H (t −1)
δwi j = −η (t )
+ αδwi j . (6.29)
∂ wi j
(0) (0)
Here t = 0, 1, 2, . . . , n labels the iteration number, and δwi j = −η∂ H /∂ wi j . You
see that the increment at step t depends not only on the instantaneous gradient,
(t −1)
but also on the weight increment δwi j of the previous iteration. We say that the
dynamics becomes inertial, the weights gain momentum. The parameter α ≥ 0 is
called momentum constant. It determines how strong the inertial effect is. Obviously
α = 0 corresponds to the usual backpropagation rule. When α is positive, then how
does inertia change the learning rule? Iterating Equation (6.29) gives
n
(n)
X ∂H
δwi j = −η αn−t (t )
. (6.30)
t =0 ∂ wi j

(n)
This Equation shows that δwi j is a weighted average of the gradients encountered
during training. Now assume that the training is stuck in a shallow minimum. Then
(t )
the gradient ∂ H /∂ wi j remains roughly constant through many time steps, so that
we can write
n
(n ) ∂ H X n−t αn+1 − 1 ∂ H
δwi j ≈ −η (n )
α = −η . (6.31)
∂ wi j t =0 α − 1 ∂ wi(n)
j

In this situation, convergence is accelerated when α is close to unity. We also see


that it is necessary that α < 1 for the sum in Equation (6.31) to converge. The other
limit to consider is that the gradient changes rapidly from iteration to iteration. How
is the learning rule modified in this case? To make the point let us assume that the
gradient remains of the same magnitude, but that its sign oscillates. Then we get
n n +1
∂ H X α + (−1) ∂

n

(n) H
δwi j ≈ −η
(n )
(−1) t
αn−t
= −η ,
(n)
(6.32)
∂ wi j t =0 α+1
∂ wi j

so that the increments are much smaller. This shows that introducing inertia can
substantially accelerate convergence without sacrificing accuracy. The disadvantage
is, of course, that there is yet another parameter to choose, the momentum constant
α.
98 STOCHASTIC GRADIENT DESCENT

w (t +1) w (t +1)

w (t ) w (t )

w (t −1) w (t −1)

Figure 6.11: Left: Momentum method (6.29). Right: Nesterov’s accelerated gradient
method (6.33)

Nesterov’s accelerated gradient method [37] is another way of implementing


momentum. The algorithm was developed for smooth optimisation problems, but
it is often used in stochastic gradient descent when training deep neural networks.
The algorithm can be summarised as follows [38]:

∂H

(t ) (t −1)
δwi j = −η + αt −1 δwi j . (6.33)
∂ wi j wi(tj ) +αt −1 δwi(tj −1)

A suitable sequence of coefficients αt is defined by recursion [38]. The coefficients


αt approach unity from below as t increases.
Nesterov’s accelerated-gradient method is more efficient than the simple momen-
tum method, because the accelerated-gradient method evaluates the gradient at an
extrapolated point, not at the initial point. Figure 6.11 illustrates why this better. In
practice, Nesterov’s method often works better than the simple momentum scheme.
Since it is not much more difficult to implement, it is now used quite frequently.
There are other ways of adapting the learning rate during training, see Section 4.10
in the book by Haykin [2].
Finally, the learning rate need not be the same for all neurons. It is often the
case that the weights feeding into neurons in the layers close to the output layer
experience large energy gradients than the weights close to the input layer [2]. To
make all neurons learn at approximately the same rate, one can reduce the learning
rate for the layers that are further to the right in the network layout.

6.3.5 Pruning
The term pruning refers to removing unnecessary weights or neurons from the
network, to improve its efficiency. The simplest approach is weight elimination by
weight decay [39]. Weights that tend to remain very close to zero during training are
removed by setting them to zero and not updating them anymore. Neurons that
have zero weights for all incoming connections are effectively removed (pruned). It
has been shown that this method can help the network to generalise [40].
RECIPES FOR IMPROVING THE PERFORMANCE 99

An efficient pruning algorithm is based on the idea to remove the weights that
have least effect upon the energy function (that minimise the increase in H upon
removing them) [41]. Assume that the network was trained, so that the network
reached a (local) minimum of the energy function. One starts by expanding the
energy function around this minimum. To write this expansion down in a convenient
form, one groups all weights in the network in a long weight vector w (as opposed to
grouping them in a weight matrix W as we did in Chapter 2). A particular component
wq is extracted from the vector w as follows:
..
 
.
wq = êq · w q where êq = 1 ← q . (6.34)
 
..
.

Here êq is the Cartesian unit vector in the direction q , with components eq j = δq j .
The expansion of H reads:

H = Hmin + 12 δw · Mδw + higher orders in δw . (6.35)

The term linear in δw vanishes because we expand around a local minimum. The
matrix M is the Hessian, the matrix of second derivatives of the energy function.
Eliminating the weight δwq amounts to setting

δwq + wq = 0 . (6.36)

The idea is to minimise the damage to the network by eliminating the weight that
has least effect upon H

min min{ 21 δw · Mδw } subject to the constraint êq · δw + wq = 0 . (6.37)


q δw

I omitted the constant term Hmin because it does not matter. Now we first minimise
H w.r.t. δw , for a given value of q . The linear constraint is incorporated using a
Lagrange multiplier as in Chapter 4, to form the Lagrangian

L = 12 δw · Mδw + λ(êq · δw + wq ) . (6.38)

A necessary condition for a minimum (δw , λ) satisfying the constraint is


∂L ∂L
= Mδw + λêq = 0 and = êq · δw + wq = 0 . (6.39)
∂ δw ∂λ
We denote the solution of these Equations by δw ∗ and λ∗ . It is obtained by solving
the linear system     
M êq δw 0
= . (6.40)
êT
q
0 λ −wq
100 STOCHASTIC GRADIENT DESCENT

If M is invertible we can use a standard formula for the inversion of 2 × 2 block


matrices to find the inverse of the matrix in Equation (6.40):
M (êq M êq ) − M−1 êq êT
 −1 T −1
M−1 M−1 êq

T −1
(êq M êq )−1 q
. (6.41)
êT
q
M−1 −1
From Equations (6.40) and (6.41) we find that
δw ∗ = −M−1 êq wq (êT
q
M−1 êq )−1 and λ∗ = wq (êT
q
M−1 êq )−1 . (6.42)
The second step is to find the optimal q by minimising
1
L (δw ∗ , λ∗ ; q ) = wq2 (êT
q
M−1 êq )−1 . (6.43)
2
The Hessian of the energy function is expensive to evaluate, and so is the inverse
of this matrix. Usually one resorts to an approximate expression for M−1 [41]. One
possibility is to set the off-diagonal elements of M to zero [42].

Algorithm 3 pruning least important weight


1: train the network to reach Hmin ;
2: compute M−1 approximately;
3: determine q ∗ as the value of q for which L (δw ∗ , λ∗ ; q ) is minimal;
4: if L (δw ∗ , λ∗ ; q ∗ )  Hmin then
5: update all weights using δw = −wq ∗ M−1 êq ∗ (êT
q∗
M−1 êq ∗ )−1 ;
6: goto 2;
7: else
8: end;
9: end if

The algorithm is summarised in Algorithm 3. It turns out that this algorithm


succeeds better than weight decay in reducing the unnecessary weights in the
network [41]. Weight decay eliminates the smallest weights. But small weights are
often needed to achieve a small training error. Another point is that one obtains
the weight elimination of the smallest weights by substituting M = I in the algorithm
described above [Equation (6.43)]. But this is often not a good approximation.

6.4 Summary
Backpropagation is an efficient algorithm for stochastic gradient-descent on the
energy function in weight space, because it refers only to quantities that are local to
the weight to be updated. It is sometimes argued that biological neural networks
are local in this way [2].
FURTHER READING 101

6.5 Further reading

The backpropagation algorithm is explained in Section 6.1. of Hertz, Krogh and


Palmer [1], and in Chapter 4 of the book by Haykin [2]. The paper [36] by LeCun et
al. predates deep learning, but it is still a very nice collection of recipes for making
backpropagation more efficient. Historical note: one of the first papers on error
backpropagation is the one by Rumelhart et al. [43]. Have a look! The paper gives an
excellent explanation and summary of the backpropagation algorithm. The authors
also describe results of different numerical experiments, one of them introduces
convolutional nets (Section 7.3) to learn to tell the difference between the letters T
and C (Figure 6.12).

Figure 6.12: Patterns detected by the convolutional net of Ref. [43]. After Fig. 13 in
Ref. [43].

6.6 Exercises

Covariance matrix. Show that the covariance matrix C defined in Equation (6.19)
is positive semidefinite.

Nesterov’s accelerated-gradient method. Show how the version (6.33) of Nesterov’s


algorithm corresponds to the original formulation [37]. This point is discussed in
[38].

Pruning. Show that the expression (6.42) for the weight increment δw ∗ minimises
the Lagrangian (6.38) subject to the constraint.

Skipping layers. Show how the backpropagation algorithm can be generalised for
feed-forward networks that allow for connections from the two nearest layers to the
left, not only from the nearest layer to the left.
102 STOCHASTIC GRADIENT DESCENT

6.7 Exam questions


6.7.1 Multilayer perceptron
A classification problem is shown in Figure 6.13. Here, µ = 1, 2, . . . , 11 is the index of
(µ) (µ)
input pattern x (µ) = (x1 , x2 )T , and t (µ) is the corresponding target value.
(a) Can this problem be solved by a simple perceptron with 2 input units, and a
single output unit? Explain. (0.5p).
(µ)
(b) This problem can be solved by a multilayer perceptron Š with two input units xi
(µ) (µ)
€ P2
(i = 1, 2), three hidden units V j = θH i =1
w j k xk − θ j where j = 1, 2, 3, and one
(µ)
€ P3 Š
output unit O (µ) = θH j =1
W j V j − Θ . Here
¨
1 for b >0,
θH (b ) = (6.44)
0 for b ≤0

is the Heaviside function, and w j k and Wj are weights for the hidden, and the output
layer, respectively. Finally, θ j and Θ are the thresholds assigned to the hidden units,
and to the output unit. One way of solving the classification problem is illustrated in
Fig. 6.13 where the three lines (solid, dashed, and dash-dotted line) are determined
by weights w j k and thresholds θ j assigned to the three hidden units in the hidden
layer. Compute w j k and θ j corresponding to the lines shown in Fig. 6.13. Note
that the point where the dashed and dash-dotted lines intersect has the following
coordinates (0.5, 0.8)T . (0.5p).
(µ)
(c) For each pattern x (µ) write its coordinates V j in the transformed (hidden) space.
(0.5p).
(d) Graphically illustrate the problem in this transformed space. Is the problem
space linearly separable in the transformed space, or not? If yes, illustrate a possible
solution to the problem in this space. (0.5p)
(e) Compute the corresponding weights Wj and the threshold Θ corresponding to
the solution you illustrated in (d). (0.5p).
EXAM QUESTIONS 103

(µ) (µ)
µ x1 x2 t (µ) 1

1 0.1 0.95 0 0.9

2 0.2 0.85 0 0.8

3 0.2 0.9 0 0.7

4 0.3 0.75 1 0.6

5 0.4 0.65 1 x2 0.5


6 0.4 0.75 1 0.4

7 0.6 0.45 0 0.3

0.2
8 0.8 0.25 0
0.1
9 0.1 0.65 1
0
10 0.2 0.75 1 0 0.2 0.4 0.6 0.8 1
11 0.7 0.2 1 x1
Figure 6.13: Left: Inputs and target values for a classification problem. The target
output for each pattern µ is either t (µ) = 0 (white circles) or t (µ) = 1 (black circles). Right:
the three lines illustrate a solution to the problem by a multilayer perceptron. Question
6.7.1.

6.7.2 Backpropagation

Explain how to train a multi-layer perceptron by back-propagation. Draw a flow-


chart of the algorithm. In your discussion, refer to and explain the following terms:
forward propagation, backward propagation, hidden layer, energy function, gradient
descent, local energy minima, batch mode, training set, validation set, classification
error, and overfitting. Your answer must not be longer than one A4 page. (1p).

6.7.3 Stochastic gradient descent

To train a multi-layer perceptron using stochastic gradient descent one needs up-
date formulae for the weights and thresholds in the network. Derive these update
formulae for sequential training using backpropagation for the network shown in
Fig. 6.14. The weights for the first and second hidden layer, and for the output layer
(1) (2)
are denoted by w j k , wm j , and W1m . The corresponding thresholds are denoted by
(1)
θ j , θm(2) , and Θ1 , and the activation function by g (· · · ). The target value for input
(µ)
pattern x (µ) is t 1 , and the pattern index µ ranges from 1 to p . The energy function
Pp (µ) (µ)
is H = 12 µ=1 (t 1 − O1 )2 . (2p).
104 STOCHASTIC GRADIENT DESCENT

W1m

(1) (2)
wjk wm j
(µ) (1,µ) (µ)
xk Vj Vm(2,µ) O1

Figure 6.14: Multi-layer perceptron with three input terminals, two hidden layers, and
one output unit. Question 6.7.3.

6.7.4 Backpropagation
A multilayer perceptron has L − 1 hidden layers (` = 1, . . . , L − 1 and one output layer
(`,µ)
(l = L ). The state of a neuron V j in layer ` is given by

(`,µ) (`,µ) (`,µ) (`) (`) (`−1,µ)


X
Vj = g (b j ) , with bj = −θ j + w j k Vk , (6.45)
k

(0,µ) (µ) (`) (`)


where V (0,µ) equals the inputs x (µ) (Vk = xk ), w j k are weights, θ j are thresholds,
(µ) (L ,µ)
and g (b ) is the activation function. The output is O j = g (b j ).
(µ) (`,µ) (`,µ) (µ) (`)
(a) Draw this network. Indicate where the elements xk , bj , Vj , Oi , w j k and
(`)
θ j for ` = 1, . . . , L belong. (0.5 p)
(`,µ) (p )
(b) Derive the recursive rule for how the derivatives ∂ Vi /∂ wmn depend on the
(`−1,µ) (p )
derivatives ∂ V j /∂ wmn for p < `. (1 p).
(p ) (`,µ)
(c) Evaluate the derivative ∂ V j /∂ wmn for p = `. (0.5 p).
(d) Use gradient descent on the the energy function

1 X (µ) (µ) 2

H= t i − Oi , (6.46)
2 iµ

(µ)
where t i is the target value for input pattern x (µ) . Find the batch-mode update rule
(L −2)
for the weight wmn with learning rate η. No momentum regularisation. (1 p).
EXAM QUESTIONS 105

6.7.5 Backpropagation
To train a multi-layer perceptron with stochastic gradient descent one needs update
formulae for the weights and thresholds in the network. Derive these update formu-
lae for the network shown in Fig. 6.15 using the gradient-descent algorithm with
constant learning rate, no momentum and no weight decay. The weights for the
hidden layer and for the output layer are denoted by w j k and W1 j , respectively. The
corresponding thresholds are denoted by θ j , and Θ1 , and the activation function by
(µ) (µ)
g (· · · ). The target value for input pattern x (µ) is t 1 , and the network output is O1 .
Pp (µ) (µ) 2

The energy function is H = 21 µ=1 t 1 − O1 . (2p)

wjk W1 j
(µ) (µ) (µ)
xk Vj O1
Figure 6.15: Multi-layer perceptron. Question 6.7.5.

6.7.6 True/false questions


Indicate wether the following statements are true or false. 13 correct answers give
1.5 points, 12 correct answers give 1.0 point and 10-11 correct answers give 0.5
points and 0-9 correct answers give zero points. (1.5 p).

1. Backpropagation is a form of unsupervised learning.


2. For backpropagation to work the weights must be symmetric.
3. For backpropagation it is necessary to know the target outputs of input pat-
terns in the training set.
4. Early stopping in backpropagation is a way to avoid overfitting.
5. Early stopping in backpropagation helps to avoid getting stuck in local minima
of energy.
6. The randomness in stochastic gradient descent helps to avoid being stuck in
local minima of energy.
7. The randomness in stochastic gradient descent prevents overfitting.
8. There are 2(2 ) Boolean functions with n inputs.
n

9. None of the Boolean functions with 5 inputs are linearly separable.


106 STOCHASTIC GRADIENT DESCENT

10. There are 24 Boolean functions with three input units (and output 0/1) where
exactly three input patterns map to 0.
11. When solving a t = ±1-problem in two dimensions using a decision boundary,
the resulting output problem may sometimes not be linearly separable.
12. The training time for stochastic gradient descent may depend on how the
weights are initialised.
13. The number of neurons in the input layer of a perceptron is equal to the
number of input patterns.
107

Figure 7.1: Images of iris flowers. From left to right: iris setosa (copyright T. Monto),
iris versicolor (copyright R. A. Nonemacher), and iris virginica (copyright A. Wester-
moreland). All images are copyrighted under the creative commons license.

7 Deep learning
7.1 How many hidden layers?
In Chapter 5 we saw why it is sometimes necessary to have a hidden layer: this
make it possible to solve problems that are not linearly separable. Under which
circumstances is one hidden layer sufficient? Are there problems that require more
than one hidden layer? Even if not necessary, may additional hidden layers improve
the performance of the network?
The second question is more difficult to answer than the first, so we start with
the first question. To understand how many hidden layers are necessary it is useful
to view the classification problem as an approximation problem [44]. Consider
the classification problem (x (µ) , t (µ) ) for µ = 1, . . . , p . This problem defines a target
function t (x ). Training a network to solve this task corresponds to approximating the
target function t (x ) by the output function O (x ) of the network, from N -dimensional
input space to one-dimensional output space.
How many hidden layers are necessary or sufficient to approximate a given set of
(1)
Vl
xk (1)
wl k (2)
wjl
(2)
Vj
Wj O (x )

g (b ) = b
g (b ) = σ(b )

Figure 7.2: Multi-layer perceptron for function approximation.


108 DEEP LEARNING

t (x ) O (x )

x
Figure 7.3: The neural-network output O (x ) approximates the target function t (x ).

functions to a certain accuracy, by choosing weights and thresholds? The answer


depends on the nature of the set of functions. Are they real-valued or do they assume
only discrete values? If the functions are real-valued, are they continuous or not?
We start by considering real-valued inputs and output. Consider the network
drawn in Figure 7.2. The neurons in the hidden layers have sigmoid activation
functions σ(b ) = (1 + e−b )−1 . The output is continuous, with activation function
g (b ) = b . With two hidden layers the task is to approximate the function t (x ) by
 
(2) (1) (1) (2)
X X €X Š
O (x ) = Wj g wjl g wl k xk − θl − θj −Θ. (7.1)
j l k

In the simplest case the inputs are one-dimensional (Figure 7.3). The training
set consists of pairs (x (µ) , t (µ) ). The task is then to approximate the corresponding
target function t (x ) by the network output O (x ):

O (x ) ≈ t (x ) . (7.2)

We approximate the real-valued function t (x ) by linear combinations of the basis


functions b (x ) shown in Figure 7.4(a). Any reasonable real-valued function t (x ) can
be approximated by a sums of such basis functions, each suitably shifted and scaled.
Furthermore, these basis functions can be expressed as differences of activation
functions [Figure 7.4(b)]

b (x ) = σ(w1 x − θ1 ) − σ(w2 x − θ2 ) . (7.3)

Comparison with Equation (7.1) shows that one hidden layer is sufficient to con-
struct the function O (x ) in this way.
Now consider two-dimensional inputs. In this case, suitable basis functions are

σ [σ(x1 ) − σ(x1 − θ1 ) + σ(x2 ) − σ(x2 − θ2 ) − θ ] . (7.4)


HOW MANY HIDDEN LAYERS? 109

 
b (x ) σ a 1 (x − c1 ) − σ a 2 (x − c2 )

x c1 c2 x
Figure 7.4: (a) basis function. (b) linear combination of two sigmoid functions for a
large value of a 1 = a 2 .

So for two input dimensions two hidden layers are sufficient, with four neurons in
the first layer, and one neuron per basis function in the second hidden layer. In
general, for N inputs, two hidden layers are sufficient, with 2N units in the first
layer, and one unit per basis function in second layer.
Yet it is not always necessary to use two layers for real-valued functions. For
continuous functions, one hidden layer is sufficient. This is ensured by the universal
approximation theorem [2]. This theorem says any continuous function can be
approximated to arbitrary accuracy by a network with a single hidden layer, for
sufficiently many neurons in the hidden layer.
In Chapter 5 we considered discrete Boolean functions. It turns out that any
Boolean function with N -dimensional inputs can be represented by a network with
one hidden layer, using 2N neurons in the hidden layer:

xk ∈ {+1, −1} k = 1, . . . , N inputs


Vj j = 0, . . . , 2 − 1
N
hidden neurons
g (b ) = tanh(b ) activation function of hidden neurons (7.5)
g (b ) = sgn(b ) activation function of output unit

A difference compared with the Boolean networks in Section 5.4 is that here the
inputs take the values ±1. The reason is that this simplifies the proof.
This proof goes by construction [1]. For each hidden neuron one assigns the
weights as follows
¨
δ if the k th digit of binary representation of j is 1 ,
wjk = (7.6)
−δ otherwise,

with δ > 1 (see below). The thresholds θ j of all hidden neurons are the same, equal
to N (δ − 1). The idea is that each input pattern turns on exactly one neuron in the
hidden layer (called the winning unit). This requires that δ is large enough, as we
shall see. The weights feeding into the output neuron are assigned as follows. If the
110 DEEP LEARNING

(a) (b) (c)


x1 x2 t 0 j digit 1 digit 2
-1 -1 -1 k =2 1 0 0 0
+1 -1 +1 1 0 1
k =1 2 O1
-1 +1 +1 2 1 0
xk
+1 +1 -1 w j k 3 W1 j 3 1 1
Vj

Figure 7.5: Boolean XOR function. (a) value table, (b) network layout. For the weights
feeding into the hidden layer, dashed lines correspond to w j k = −δ , solid lines to
w j k = δ. For the weights feeding into the output neuron, dashed lines correspond to
W1 j = −γ, and solid lines to W1 j = γ (b) construction principle for the weights of the
hidden layer.

P by neuron V j is +1, let W1 j = γ > 0, otherwise


output for the pattern represented
W1 j = −γ. The threshold is Θ = j W1 j .
To show how this construction works, consider the Boolean XOR function as an
example. First, for each pattern only the corresponding winning neuron gives a
positive signal. For pattern x (1) = [−1, −1]T , for example, this is the first neuron in
the hidden layer ( j = 0). To see this, compute the local fields for this input pattern:

b0(1) = 2δ − 2(δ − 1) = 2 , (7.7)


b1(1) = −2(δ − 1) = 2 − 2δ ,
b2(1) = −2(δ − 1) = 2 − 2δ ,
b3(1) = −2δ − 2(δ − 1) = 2 − 4δ .

If we choose δ > 1 then the output of the first hidden neuron gives a positive output
(V0 > 0), the other neurons produce negative outputs, V j < 0 for j = 1, 2, 3. Now
consider x (3) = [−1, +1]T . In this case

b0(3) = −2(δ − 1) = 2 − 2δ (7.8)


b1(3) = −2δ − 2(δ − 1) = 2 − 4δ
b2(3) = 2δ − 2(δ − 1) = 2
b3(3) = −2(δ − 1) = 2 − 2δ

So in this case the third hidden neuron gives a positive output, while the others
yield negative outputs. It works in the same way for the other two patterns, x (2) and
x (4) . This confirms that there is a unique winning neuron for each pattern. That
pattern µ = k gives the winning neuron j = k − 1 is of no importance, it is just a
HOW MANY HIDDEN LAYERS? 111

1 0

0
1
1

0
1 0
1 0

1
1 0
1
1 0
1 1

1
0
0

0 0
0 0
1

0
0 0

0 0

Figure 7.6: Solution of the parity problem for N -dimensional inputs. The network is
built from XOR units (Figure 5.14). Each XOR unit has a hidden layer with two neurons.
Above only the states of the inputs and outputs of the XOR units are shown, not those
of the hidden neurons. In total, the whole network has O (N ) neurons.

consequence of how the patterns are ordered in the value table in 7.5. Second, the
output neuron computes

O1 = sgn(−γV1 + γV2 + γV3 − γV4 ) (7.9)

with γ > 0, and Θ = j W1 j = 0. For x (1) and x (4) we find the correct result O1 = −1.
P

The same is true for x (2) and x (3) , we obtain O1 = 1. In summary, this example
illustrates how an N -dimensional Boolean function is represented by a network
with one hidden layer, with 2N neurons. The problem is of course that this network
is expensive to train for large N because the number of hidden neurons is very large.
There are more efficient layouts if one uses more than one hidden layer. As
an example, consider the parity function for N binary inputs equal to 0 or 1. The
function measures the parity of the input sequence. It gives 1 if there is an odd
number of ones in the input, otherwise 0. A construction similar to the above yields
a network layout with 2N neurons in the hidden layer. If one instead wires together
the XOR networks shown in Figure 5.14, one can solve the parity problem with O (N )
neurons, as Figure 7.6 demonstrates. When N is a power of two then this network
has 3(N − 1) neurons. To see this, set the number of inputs to N = 2k . Figure 7.6
shows that the number Nk of neurons satisfies the recursion Nk +1 = 2Nk + 3 with
N1 = 3. The solution of this recursion is Nk = 3(2k − 1).
This example also illustrates a second reason why it may be useful to have more
than one hidden layer. To design a network for a certain task it is often convenient
112 DEEP LEARNING

input `=1 `=2 `=3 `=4 `=5 output

Figure 7.7: Fully connected deep network with five hidden layers. How deep is deep?
Usually one says: deep networks have two or more hidden layers.

to build the network from building blocks. One wires them together, often in a
hierarchical fashion. In Figure 7.6 there is only one building block, the XOR network
from Figure 5.14.
Another example are convolutional networks for image analysis (Section 7.3).
Here the fundamental building blocks are feature maps, they recognise different
geometrical features in the image, such as edges or corners.

7.2 Training deep networks


It was believed for a long time that networks with many hidden layers (deep networks)
are so difficult to train that it is not practical to use many hidden layers. But the
past few years have witnessed a paradigm shift regarding this question. It has been
demonstrated that fully connected deep networks can in fact be trained efficiently
with backpropagation (Section 6.2), and that these networks can solve complex
classification tasks with very small error rates.
Browsing through the recent literature of the subject one may get the impression
that training deep networks such as the one shown in Figure 7.7 is more an art than
a science, some read like manuals, or like collections of engineering recipes. But
there are several fundamental facts about how deep networks learn. I summarise
them in this Section. A comprehensive summary of these (and more) principles is
given in the book Deep learning [4].

7.2.1 Unstable gradients


In Chapter 6 we discussed that the learning slows down when the gradients g 0 (b )
become small. When the network has many hidden layers, this problem becomes
worse. One finds that the neurons in hidden layers close to the input layer (small
values of ` in Figure 7.7) change only by small amounts, the smaller the more hidden
layers the network has. One way of quantifying this effect is to measure the gradient
TRAINING DEEP NETWORKS 113

2
10

|∇θ H |
(`)
1
10

0
10

-1
10
0 10 20 30 40 50
epochs
(`)
Figure 7.8: Shows how the norm of the gradient of H w.r.t. θ j in layer ` depends on
the number of training epochs. Gradient in hidden layer ` = 1 (◦), ` = 2 (ƒ), ` = 3 (◊),
` = 4 (4). The data was obtained by training a network with four fully connected hidden
layers with N = 30 neurons each on the MNIST data set (Section 7.4). The output layer
has 10 neurons. Sigmoid activation functions (6.15a), quadratic energy function (6.4),
learning rate η = 3 × 10−3 . The data was obtained by averaging over an ensemble of 100
independent runs (data by Johan Fries).

of the energy function with respect to the thresholds in layer `

 ∂ 
(`)
∂ θ1
(`)  . 
∇θ H =  ..  H . (7.10)

(`)
∂ θN

Equation (6.12) shows that the errors in layer ` and thus the weight updates are
(`)
proportional to −∇θ H . Figure 7.8 demonstrates that the norm of this gradient
tends to be very small for the first 20 training epochs. In this regime the gradient
(and thus the speed of training) vanishes exponentially as ` → 1. This slowing down
is the result of the diminished effect of the neurons in layer ` upon the output, when
` is small. This is the vanishing-gradient problem.
To explain this phenomenon, consider the very simple case shown in Figure 7.9:
a deep network with only one neuron per layer. To measure the effect of a given
neuron on the output, we calculate how the output of the network changes when
changing the state of a neuron in a particular layer. The output V (L ) is given by the
nested activation functions

€ € Š Š
V (L ) =g w (L ) g w (L −1) · · · g w (2) g (w (1) x − θ (1) ) − θ (2) . . .−θ (L −1) − θ (L ) .

(7.11)
114 DEEP LEARNING

θ (1) θ (2) θ (L −1) (L ) θ (L )


w (1) w (2) w (3) w (L −1) w

x V (1) V (2) V (L −1) V (L )

Figure 7.9: ‘Network’ illustrating the vanishing-gradient problem, with neurons V (`) ,
weights w (`) , and thresholds θ (`) .

The effects of the neurons in Figure 7.9 are computed using the chain rule:

∂ V (L )
= g 0 (b (L ) )w (L )
∂ V (L −1)
∂ V (L ) ∂ V (L ) ∂ V (L −1)
= = g 0 (b (L ) )w (L ) g 0 (b (L −1) )w (L −1)
∂ V (L −2) ∂ V (L −1) ∂ V (L −2)
..
. (7.12)

where b (k ) = w (k ) V (k −1) − θ (k ) is the local field for neuron k . This yields the following
expression for J`,L ≡ ∂ V (L ) /∂ V (`) :
`+1
∂ V (L ) Y 0 (k ) (k )
J`,L = = [g (b )w ] . (7.13)
∂ V (`) k =L

Note that the error δ(`) of the hidden layer ` is determined by a closely related product.
Algorithm 2 shows that the errors are given recursively by δ(`) = δ(`+1) w (`+1) g 0 (b (`) ).
Using δ(L ) = [t − V (L ) (x )]g 0 (b (L ) ) we have
`+1
Y
(`) (L ) (L )
δ = [t − V (x )]g (b
0
) [w (k ) g 0 (b (k −1) )] . (7.14)
k =L

Equations (7.13) and (7.14) are consistent since δ(`) = [t − V (L ) ]∂ V (L ) /∂ (−θ (`) ) =
[t − V (L ) ]∂ V (L ) /∂ V (`) g 0 (b (`) ).
Coming back to the product (7.13), consider first the early stages of training. If one
initialises the weights as described in Chapter 6 to Gaussian random variables with
mean zero and variance σ2w , and the thresholds to zero, then the factors w (k ) g 0 (b (k −1) )
are usually smaller than unity (for the activation functions (6.15), the maximum
of g 0 (b ) is 12 and 1, respectively). The product of these factors vanishes quickly
as ` decreases. So the slowing down is a consequence of multiplying many small
numbers to get something really small (vanishing-gradient problem).
What happens at later times? One might argue that the weights may grow during
training, as a function of `. If that happened, the problem might become worse
TRAINING DEEP NETWORKS 115

still, because g 0 (b ) tends to zero exponentially as |b | grows. This indicates that the
first layers may continue to learn slowly. Figure 7.8 shows that the effect persists
for about 20 epochs. But then even the first layers begin to learn faster. This does
not contradict the above discussion, because it assumed random weights. As the
network learns, the weights are no longer independent random numbers. But there
is to date no mathematical theory describing how this transition occurs.
More fundamentally, Equation (7.13) demonstrates that different layers of the
network learn at different speeds, because their neurons typically have different
effects on the output. This is due to the fact that the product in Equation (7.13) is
unlikely to remain of order unity when L is large. To see this, assume that the weights
are independently distributed random numbers. Taking the logarithm and using
the central-limit theorem shows that the distribution of the product is log normal.
This means that the learning speed can be substantially different in different layers.
This is also referred to the problem of unstable gradients. The example shown in
Figure 7.9 illustrates the origin of this problem: it is due to the fact that multiplying
many small numbers together produces a result that is very small. Multiplying many
numbers that are larger than unity, by contrast, yields a large result.
In networks like the one shown in Figure 7.7 the principle is the same, but instead
of multiplying numbers one multiplies matrices. The product (7.13) of random
numbers becomes of product of random matrices. Assume that all layers ` = 1, . . . , L
(`,L ) (L ) (`)
have N neurons. We denote the matrix with elements Ji j = ∂ Vi /∂ V j by J`,L .
Using the chain rule we find:
∂ Vi
(L ) X ∂ V (L ) ∂ V (L −1) ∂ Vn(`+1)
i l
(`)
= (L −1) (L −2)
··· (`)
. (7.15)
∂ Vj l m···n ∂ Vl ∂ Vm ∂ Vj
Using the update rule
(k ) (k ) (k −1) (k )
€X Š
Vi =g wi j V j − θi (7.16)
j

we can evaluate each factor:


∂ Vp(k ) (k )
(k −1)
= g 0 (bp(k ) )wp l . (7.17)
∂ Vl
In summary, this yields the following expression for J`,L :
J`,L = D(L ) W(L ) D(L −1) W(L −1) · · · D(`+1) W(`+1) , (7.18)
where W(k ) is the matrix of weights feeding into layer k , and
(k )
 
g 0 (b1 )
D(k ) =  ..
. (7.19)
 
.
(k )
g 0 (bN )
116 DEEP LEARNING
(a) f (I ) (b) max{0, b }

0.1
0.05
2 4 6 8 10 I b
Figure 7.10: (a) Firing rate of a leaky integrate-and-fire neuron as a function of the
electrical current I through the cell membrane, Equation (7.22) for τ = 25 and Uc /R = 2
(see text). (b) Rectified linear unit, g (b ) = max{0, b }.

This expression is analogous to Equation (7.13).


The eigenvalues of the matrix J0,k describe how small changes δV (0) to the inputs
V (0) (or small differences between the inputs) grow as they propagate through the
layers. If the maximal eigenvalue is larger than unity, then |δV (0) | grows exponen-
tially as a function of layer index k . This is quantified by the maximal Lyapunov
exponent [45]
1

log tr JT

λ = lim 0,k
J0,k (7.20)
k →∞ 2k

where the average is over realisations of weights and thresholds. The matrix JT J
0,k 0,k
is called the right Cauchy-Green matrix, and tr denotes the trace of this matrix, the
sum of its diagonal elements. The right Cauchy-Green matrix is symmetric, and it is
positive definite. The eigenvectors of JT J are called forward Lyapunov vectors.
0,k 0,k
They describe how small corrections to the inputs rotate, shrink, or stretch as they
propagate through the network.
If we multiply the matrix Jk ,L from the left with the transpose of the vector δ(L )
of output errors, we see how the errors change as they propagate backwards from
layer k to the leftmost hidden layer, how this vector rotates, shrinks, or stretches.
There are a number of different tricks that help to suppress vanishing gradients,
to some extent at least. First, it is usually argued that it helps to use an activation
function that does not saturate at large b , such as the ReLU function introduced
in Section 7.2.2. But the results of Ref. [46] show that the effect is perhaps not as
strong as originally thought. Second, batch normalisation (Section 7.2.6) may help
against the unstable gradient problem. Third, introducing connections that skip
layers (residual network) can also reduce the unstable-gradient problem. This is
discussed in Section 7.6.

7.2.2 Rectified linear units


Glorot et al. [47] suggested to use a different activation function. Their choice
is motivated by the response curve of leaky integrate-and-fire neurons. This is a
model for the relation between the electrical current I through the cell membrane
TRAINING DEEP NETWORKS 117

into the neuron cell, and the membrane potential U . The simplest models for the
dynamics of the membrane potential represent the neuron as a capacitor. In the
leaky integrate-and-fire neuron, leakage is added by a resistor R in parallel with the
capacitor C , so that
U dU
I = +C . (7.21)
R dt
For a constant current, the membrane potential grows from zero as a function of
time, U (t ) = R I [1 − exp(−t /τ)], where τ = R C is the time constant of the model.
One says that the neuron produces a spike when the membrane potential exceeds
a critical value, Uc . Immediately after, the membrane potential is set to zero (and
begins to grow again). In this model, the firing rate f (I ) is thus given by t c−1 , where
t c is the solution of U (t ) = Uc . It follows that the firing rate exhibits a threshold
behaviour (the system works like a rectifier):

I ≤ Uc /R ,
¨
0 for
f (I ) = ” € Š—−1 (7.22)
τ log R IR−U
I
c
for I > Uc /R .

This response curve is illustrated in Figure 7.10 (a). The main message is that there
is a threshold below which the response is strictly zero (this is not the case for the
activation function shown in Figure 1.6). The response function looks qualitatively
like the ReLU function

g (b ) = ReLU(b ) ≡ max{0, b } , (7.23)

shown in panel (b). Neurons with this activation function are called rectified linear
units. The derivative of the ReLU function is discontinuous at b = 0. A common
convention is to set the derivative to zero at b = 0.
What is the point of using rectified linear units? When training a deep network
with ReLU functions it turns out that many of the hidden neurons (as many as
50%) produce outputs strictly equal to zero. This means that the network of active
neurons (non-zero output) is sparsely connected. It is thought that sparse networks
have desirable properties, and sparse representations of a classification problem are
more likely to be linearly separable (as shown in Section 10.1). Figure 7.11 illustrates
that for a given input pattern only a certain fraction of hidden neurons is active. For
these neurons the computation is linear, yet different input patterns give different
sets of active neurons. The product in Equation (7.18) acquires a particularly simple
structure: the matrices D(k ) are diagonal with 0/1 entries. But while the weight
matrices are independent, the D(k ) -matrices are correlated: which elements vanish
depends on the states of the neurons in the corresponding layer, which in turn
depend on the weights to the right of D(k ) in the matrix product.
118 DEEP LEARNING

input `=1 `=2 `=3 `=4 `=5 output

Figure 7.11: Sparse network of active neurons with ReLU activation functions. The red
paths correspond to active neurons with positive local fields.

A hidden layer with only one or very few active neurons might act as a bottleneck
preventing efficient backpropagation of output errors which could in principle slow
down training. For the examples given in Ref. [47] this does not occur.
The ReLU function is unbounded for large positive local fields. Therefore, the
vanishing-gradient problem (Section 7.2.1) is thought to be less severe in networks
made of rectified linear units, but see Ref. [46]. Since the ReLU function does not
saturate, the weights tend to increase. Glorot et al. [47] suggested to use L 1 -weight
decay (Section 6.3.3) to make sure that the weights do not grow.
Finally, using ReLU functions instead of sigmoid functions speeds up the training,
because the ReLU function has piecewise constant derivatives. Such function calls
are faster to evaluate than sigmoid functions, for example.

7.2.3 Outputs and cost functions


Up to now we discussed networks that have the same activation functions for all
neurons in all layers, either sigmoid or tanh activation functions (Equation 6.15), or
ReLU functions (Section 7.2.2). These networks are trained by stochastic gradient
descent on the quadratic energy function (5.25). It has been shown that it may be
advantageous to employ a different energy function, and to use slightly different
activation functions for the neurons in the output layer, so-called softmax outputs,
defined as
(L )
eαbi
Oi = PM (L )
. (7.24)
αbk
k =1
e
(L ) (L ) (L −1) (L )
Here bi = j wi j V j − θi are the local fields in the output layer. Usually the
P

constant α is taken to be unity. In the limit α → ∞, you see that Oi = δi i 0 where i 0 is


(L )
the index of the winning output unit, the one with the largest value bi (Chapter
9). Usually one takes α = 1, then Equation (7.24) is a soft version of this maximum
criterion, thus the name softmax. Three important properties of softmax outputs
TRAINING DEEP NETWORKS 119

are, first, that 0 ≤ Oi ≤ 1. Second, the values of the outputs sum to one
M
X
Oi = 1 . (7.25)
i =1

This means that the outputs of softmax units can be interpreted as probabilities.
(L )
Third, the outputs are monotonous: when bi increases then Oi increases but the
values Ok of the other output neurons k 6= i decrease.
Softmax output units can simplify interpreting the network output for classifi-
cation problems where the inputs must be assigned to one of M classes. In this
(µ)
problem, the output Oi of softmax unit i represents the probability that the input
(µ) (µ)
x (µ) is in class i (in terms of the targets: t i = 1 while t k = 0 for k 6= i ). Softmax units
are often used in conjunction with a different energy function (or cost function). It
is defined in terms of negative log likelihoods
(µ) (µ)
X
H =− t i log Oi . (7.26)

Here and in the following log stands for the natural logarithm. The function (7.26)
(µ) (µ)
is minimal when Oi = t i . Since the function (7.26) is different from the energy
function used in Chapter 6, the details of the backpropagation algorithm are slightly
different. To find the correct formula for backpropagation, we need to evaluate
(µ) (µ)
∂H X t
i ∂ Oi
=− (µ)
. (7.27)
∂ wmn iµ O ∂ w mn
i

Here I did not write out the labels L that denote the output layer, and in the following
equations I also drop the index µ that refers to the input pattern. Using the identities

∂ Oi ∂ bl
= Oi (δi l − Ol ) and = δl m Vn , (7.28)
∂ bl ∂ wmn
one obtains
∂ Oi X∂O ∂b
i l
= = Oi (δi m − Om )Vn . (7.29)
∂ wmn l
∂ b l ∂ w mn

So
∂H X
(µ)
X
δwmn = −η =η t i (δi m − Om(µ) )Vn(µ) = η (t m(µ) − Om(µ) )Vn(µ) , (7.30)
∂ wmn iµ µ
PM (µ)
since i =1 t i = 1 for the type of classification problem where each input belongs to
precisely one class. The corresponding expression for the threshold updates reads
∂H X
δθm = −η = −η (t m(µ) − Om(µ) ) . (7.31)
∂ θm µ
120 DEEP LEARNING

O1

softmax
O2
..
.
OM −1
OM
Figure 7.12: The symbol for a softmax layer indicates that the neurons in this layer are
not independent.

Equations (7.30) and (7.31) highlight a further advantage of softmax output neurons
(apart from the fact that they allow the output to be interpreted in terms of probabil-
ities). The weight and threshold increments for the output layer derived in Section 6
[Equations (6.7) and (6.12a)] contain factors of derivatives g 0 (Bm(µ) ). As noted earlier,
these derivatives tend to zero when the activation function saturates, slowing down
the learning. But Equations (7.30) and (7.31) do not contain such factors! Here the
rate at which the neuron learns is simply proportional to the error, (t m(µ) − Om(µ) ), no
small factors reduce this rate.
Softmax units are normally only used in the output layer. First, the derivation
shows that the learning speedup mentioned above is coupled to the use of the log
likelihood function (7.26). Second, one usually tries to avoid dependence between
the neurons in a given hidden layer, but Equation (7.24) shows that the output
of neuron i depends on all local fields in the hidden layer (Figure 7.12). A better
alternative is usually the ReLU activation function discussed in Section 7.2.2.
There is an alternative way of choosing the cost function that is very similar to
the above, but works with sigmoid units:
(µ) (µ) (µ) (µ)
X
H =− t i log Oi + (1 − t i ) log(1 − Oi ) , (7.32)

with Oi = σ(bi ) where σ is the sigmoid function (6.15a). The function (7.32) is called
cross-entropy function. To compute the weight increments, we apply the chain rule:

∂H X t (µ) (µ) 
∂ Ol X t (µ) − O (µ) ∂ O

i 1 − ti i i l
= (µ)
− (µ)
= (µ) (µ)
. (7.33)
∂ wmn iµ O 1 − O ∂ w mn iµ O (1 − O ) ∂ w mn
i i i i

Using Equation (6.16a) we obtain


X
δwmn = η (t m(µ) − Om(µ) )Vn(µ) , (7.34)
µ

identical to Equation (7.30). The threshold increments are also updated in the same
way, Equation (7.31). Yet the interpretation of the outputs is slightly different, since
TRAINING DEEP NETWORKS 121

the values of the softmax units in the output layers sum to unity, while those of
the sigmoid units do not. In either case you can use the definition (6.23) for the
classification error.

7.2.4 Weight initialisation


The results of Section 7.2.1 point to the importance of initialising the weights in the
right way, to avoid that the learning slows down. This is significant because it is
often found that the initial transient learning phase poses a substantial bottleneck
to learning [48]. For this initial transient, correct weight initialisation can give a
substantial improvement.
Moreover, when training deep networks with sigmoid activation functions in the
hidden layers, it was observed that the values of the output neurons remain very
close to zero for many training iterations (Figure 2 in Ref. [49]), slowing down the
training substantially. It is argued that this is a consequence of the way the weights
are initialised, in combination with the particular shape of the sigmoid activation
function. It is sometimes argued that tanh activation functions work better than
sigmoids, although the reasons are not quite clear.
So, how should the weights be initialised? The standard choice is to initialise
the weights to independent Gaussian random numbers with mean zero and unit
variance and the thresholds to zero (Section 6.1). But in networks that have large
hidden layers with many neurons, this scheme may fail. Consider a neuron i in the
first hidden layer with N incoming connections. Its local field

N
X
bi = wi j x j (7.35)
j =1

is a sum of many independently identically distributed random numbers. Assume


that the input patterns have independent random bits, equal to 0 or 1 with prob-
ability 12 . From the central-limit theorem we find that the local field is Gaussian
distributed in the limit of large N , with mean zero and variance

σ2b = σ2w N /2 . (7.36)


p
This means that the local field is typically quite large, of order N , and this implies
that the units of the first hidden layer saturate – slowing down the learning. This
conclusion rests on our particular assumption concerning the input patterns, but it
is in general much better to initialise the weights uniformly or Gaussian with mean
zero and with variance
σ2w ∝ N −1 , (7.37)
122 DEEP LEARNING

Figure 7.13: Illustrates regularisation by drop out.

to cancel the factor of N in Equation (7.36). The thresholds can be initialised to


zero, as described in Section 6.1.
The normalisation (7.37) makes sure that the weights are not too large initially,
but it does not circumvent the vanishing-gradient problem discussed in Section
7.2.1. There the problem was illustrated for N = 1, so unit variance for the initial
weight distribution corresponds to Equation (7.37).

7.2.5 Regularisation
Deeper networks have more neurons, so the problem of overfitting (Figure 6.9) tends
to be more severe for deeper networks. Therefore regularisation schemes that limit
the tendency to overfit are more important for deeper networks. In Section 6.3.3 sev-
eral regularisation schemes were described, for example L 1 - and L 2 -regularisation.
In training deep networks, a number of other regularisation schemes have proved
useful: drop out, pruning (Section 6.3.5), and expanding the training set.

Drop out
In this scheme some neurons are ignored during training [50]. Usually this reg-
ularisation technique is applied to hidden neurons. The procedure is illustrated
in Figure 7.13. In each step of the training algorithm (for each mini batch, or for
each individual pattern) one ignores at random a fraction p of neurons from each
hidden layer, and updates the weights in the remaining, diluted network in the usual
fashion. The weights coming into the dropped neurons are not updated, and as a
consequence neither are their outputs. For the next step in the training algorithm,
the removed neurons at put back, and another set of hidden neurons is removed.
Once the training is completed, all hidden neurons are activated, but their outputs
are multiplied by p .
Srivastava et al. [50] motivate this method by remarking that the performance
of machine-learning algorithms is usually improved by combining the results of
several learning attempts. In our case this corresponds to separately training several
TRAINING DEEP NETWORKS 123

n 10 8 6 4 2
Training success
without pruning 98.5 96.8 92.5 78.3 49.1
pruned network - - - 97.9 83.3

Figure 7.14: Boolean XOR problem. The network has one hidden layer with n ReLU
neurons. The output neuron has a sigmoid activation function. The network is trained
with stochastic gradient descent for 10 000 iterations. The initial weights were Gaussian
random numbers with mean zero, standard deviation 0.1, and max-norm regularisation
|wi j | < 2. The thresholds were initially zero. Training success was measured in an
ensemble of 1000 independent training realisations. Data from Ref. [51].

networks with different layouts on different inputs, and then to average over their
outputs. However, for deep networks this is computationally very expensive. Drop
out is an attempt to achieve the same goal more efficiently. The idea is that dropout
corresponds to effectively training a large number of different networks. If there
are k hidden neurons, then there are 2k different combinations of neurons that
are turned on or off. The hope is that the network learns more robust features
of the input data in this way, and that this reduces overfitting. In practice the
method is usually applied together with another regularisation scheme, max-norm
regularisation. This means that weights are not allowed to grow larger than a given
constant: |wi j | ≤ c .

Pruning

Pruning (Section 6.3.5) is also a regularisation method: by removing unnecessary


weights one reduces the risk of overfitting. As opposed to drop out, where hidden
neurons are only temporarily ignored, pruning refers to permanently removing
hidden neurons. The idea is to train a large network, and then to prune a large
fraction of neurons to obtain a much smaller network. It is usually found that
such pruned nets generalise much better than small nets that were trained without
pruning. Up to 90% of the hidden neurons can be removed. This method is applied
with success to deep networks, but here I want to discuss a simple example: Frankle
& Carbin [51] used the Boolean XOR function to illustrate the effectiveness of pruning.

Figure 5.14 shows that the XOR function can be represented by a hidden layer
with two neurons. Suitable weights and thresholds are given in this Figure. Frankle &
Carbin [51] point out that backpropagation takes a long time to find a valid solution,
124 DEEP LEARNING

for random initial weights. They observe that a network with many more neurons
in the hidden layer usually learns better. Figure 7.14 lists the fraction of successful
trainings for networks with different numbers of neurons in the hidden layer. With
two hidden neurons, only 49.1% of the networks learned the task in 10 000 training
steps of stochastic gradient descent. Networks with more neurons in the hidden
layer ensure better training success. The Figure also shows the training success of
pruned networks, that were initially trained with n = 10 neurons. Then networks
were pruned iteratively during training, removing the neurons with the largest
average magnitude. After training, the weights and threshold were reset to their
initial values, the values before training began. One can draw three conclusions
from this data (from Ref. [51]). First, iterative pruning during training singles out
neurons in the hidden layer that had initial weights and thresholds resulting in the
correct decision boundaries. Second, the pruned network with two hidden neurons
has much better training success than the network that was trained with only two
hidden neurons. Third, despite pruning more than 50% of the hidden neurons, the
network with n = 4 hidden neurons performs almost as well as then one with n = 10
hidden neurons. When training deep networks it is common to start with many
neurons in the hidden layers, and to prune up to 90% of them. This results in small
trained networks that can efficiently and reliably classify.

Expanding the training set

If one trains a network with a fixed number of hidden neurons on larger training sets,
one observes that the network generalises with higher accuracy (better classification
success). The reason is that overfitting is reduced when the training set is larger.
Thus, a way of avoiding overfitting is to expand or augment the training set. It
is sometimes argued that the recent success of deep neural networks in image
recognition and object recognition is in large part due to larger training sets. One
example is ImageNet, a database of more than 107 hand-classified images, into more
than 20 000 categories [52]. Naturally it is expensive to improve training sets in this
way. Instead, one can expand a training set artificially. For digit recognition (Figure
2.1), for example, one can expand the training set by randomly shifting, rotating,
and shearing the digits.

7.2.6 Batch normalisation


Batch normalisation [53] can significantly speed up the training of deep networks
with backpropagation. The idea is to shift and normalise the input data for each
hidden layer, not only the input patterns (Section 6.3.1). This is done separately for
(µ)
each mini batch (Section 6.2), and for each component of the inputs V j , j = 1, . . .
TRAINING DEEP NETWORKS 125

(Algorithm 4). One calculates the average and variance over each mini batch
mB mB
1 X (µ) 1 X (µ)
Vj= V and σ2B = (V − V j )2 , (7.38)
m B µ=1 j m B µ=1 j

(µ)
Æ
subtracts the mean from the V j , and divides by σ2B + ε. The parameter ε > 0
is added to the denominator to avoid division by zero. There are two additional
parameters in Algorithm 4, namely γ j and β j . If one were to set β j = V j and γ j =
(µ)
Æ
σ2B + ε (algorithm 4), then batch normalisation would leave the V j unchanged.
But instead these two parameters are learnt by backpropagation, just like the weights
and thresholds. In general the new parameters are allowed to differ from layer to
(`) (`)
layer, γ j and β j .
Batch normalisation was originally motivated by arguing that it reduces possible
covariate shifts faced by hidden neurons in layer `: as the parameters of the neurons
in the preceding layer ` − 1 change, their outputs shift thus forcing the neurons
in layer ` to adapt. However in Ref. [54] it was argued that batch normalisation
does not reduce the internal covariate shift. It speeds up the training by effectively
smoothing the energy landscape.
Batch normalisation helps to combat the vanishing-gradient problem because
it prevents local fields of hidden neurons to grow. This makes it possible to use
sigmoid functions in deep networks, because the distribution of inputs remains
normalised. It is sometimes argued that batch normalisation has a regularising
effect, and it has been suggested [53] that batch normalisation can replace drop out
(Section 7.2.5). It is also argued that batch normalisation can help the network to
generalise better, in particular if each mini batch contains randomly picked inputs.
Then batch normalisation corresponds to randomly transforming the inputs to each
hidden neuron (by the randomly changing means and variances). This may help to
make the learning more robust. There is no theory that proves either of these claims,
but it is an empirical fact that batch normalisation often speeds up the training.

Algorithm 4 batch normalisation


1: for j = 1, . . . do
PmB (µ)
2: calculate mean V j ← m1B µ=1 Vj
m (µ)
calculate variance σ2B ← m1B µ=1 (V j − V j )2
P B
3:
(µ) (µ)
Æ
4: normalise Vˆj ← (V j − V j )/ σ2B + ε
(µ)
5: calculate outputs as: g (γ j Vˆj + βj )
6: end for
7: end;
126 DEEP LEARNING

(a) (b)
inputs 10 × 10 hidden 8 × 8 inputs 10 × 10 hidden 8 × 8 × 4

V11

inputs 10 × 10 hidden 8 × 8

V12

Figure 7.15: (a) layout of a convolution layer. (b) several convolution layers are con-
nected to the input layer to detect different features.

7.3 Convolutional networks


Convolutional networks have been around since the 1980’s. They became widely
used after Krizhesvky et al. [55] won the ImageNet challenge (Section 7.5) with a
convolutional net. One reason for the recent success of convolutional networks is
that they have fewer neurons. This has two advantages. Firstly, such networks are
obviously cheaper to train. Secondly, as pointed out above, reducing the number of
neurons regularises the network, it reduces the risk of overfitting.
Convolutional neural networks are designed for object recognition and pattern
detection. They take images as inputs (Figure 7.1), not just a list of attributes (Figure
5.1). Convolutional networks have important properties in common with networks
of neurons in the visual cortex of the Human brain [4]. First, there is a spatial array
of input terminals. For image analysis this is the two-dimensional array of bits.
Second, neurons are designed to detect local features of the image (such as edges or
corners for instance). The maps learned by such neurons, from inputs to output, are
referred to as feature maps. Since these features occur in different parts of the image,
one uses the same feature map (with the same weights and thresholds) for different
parts of the image. Since feature maps are local, and since they act in a translational-
invariant way, the number of neurons from the two-dimensional input array is
greatly reduced, compared with usual fully connected networks. Feature maps act
like the mathematical convolution operation. Therefore, layers with feature maps
are also referred to as convolution layers.
Convolutional networks can have hierarchies of convolution layers. The idea is
that the additional layers can learn more abstract features. Apart from feature maps,
convolutional networks contain other types of layers. Pooling layers connect directly
to the convolution layer(s), their task is to simplify the output of the convolution
CONVOLUTIONAL NETWORKS 127

layers. Connected to the pooling layers, convolutional networks may also contain
several fully connected layers.

7.3.1 Feature maps


Figure 7.15(a) illustrates the layout of a convolution layer. Neuron V11 connects
to a 3 × 3 area of pixels in the input layer. In analogy with the terminology used
in neuroscience, this area is called the local receptive field of this hidden neuron.
Neuron V12 connects to a shifted local receptive field, as illustrated in the Figure.
Since the input has 10 × 10 pixels, the dimension of the convolution layer is 8 × 8.
The important point is that the neurons V11 and V12 , and all other neurons in this
convolution layer, share their weights and the threshold. In the example shown in
Figure 7.15(a) there are thus only 9 independent weights, and one threshold. Since
the different neurons in the convolution layer share weights and thresholds, their
computation rule takes the form of a discrete convolution:
3 X
€X 3 Š
Vi j = g wp q xp +i −1,q + j −1 − θ . (7.39)
p =1 q =1

The activation function g can be the sigmoid function. Usually one connects several
convolution layers to the input layer, as shown in Figure 7.15(b). Different layers
contain different feature maps, one that detects edges for example, and another
one that detects corners, and so forth.
Figure 7.15 depicts a two-dimensional input array. For colour images there are
usually three colour channels, in this case the input array is three-dimensional, and
the input bits are labeled by three indices: two for position and the last one for
colour, xi j k . If one has several convolution layers that connect to the inputs, one
groups the weights (and thresholds) into still higher-dimensional arrays (tensors).
In this case the convolution takes the form:
€X Š
Vi j k ··· = g wp q k r ··· xp +i −1,q + j −1,r ··· − θk ··· . (7.40)
p q r ···

The software package TensorFlow [56] is designed to efficiently perform tensor


operations as in Equation (7.40).
In Figure 7.15 the local receptive field is shifted by one pixel at a time. Sometimes
it is useful to use a different stride, to shift the receptive field by s pixels. Also, the
local receptive regions need not have size 3 × 3. If we assume that their size is Q × P ,
the rule (7.39) takes the form
€X Q
P X Š
Vi j = g wp q xp +s (i −1),q +s ( j −1) − θ . (7.41)
p =1 q =1
128 DEEP LEARNING

input 10 × 10 hidden 8 × 8 × 4 max pooling fully connected

4×4×4

Figure 7.16: Layout of a convolutional neural network for object recognition and image
classification. The inputs are in a 10 × 10 array. They feed into four convolution layers
representing four different 3 × 3 feature maps. Each convolution layer feeds into its own
max-pooling layer. Between these and the output layer are a couple of fully connected
hidden layers.

If one couples several convolution layers together, the number of neurons in these
layers decreases rapidly as one moves to the right. In this case one can pad the
image (and the convolution layers) by adding rows and columns of bits set to zero. In
Figure 7.15(a), for example, one obtains a convolution layer of the same dimension
as the original image if one pads the image with two rows and columns of bits.
Convolution layers are trained with backpropagation. Consider the simplest
case, Equation (7.39). As usual, we use the chain rule to evaluate the gradients: the
gradients
∂ Vi j X ∂ Vi j ∂ b
rs
= . (7.42)
∂ wmn rs
∂ br s ∂ wmn
The derivatives of the local fields are evaluated by applying rule (5.27) to Equation
(7.39):
∂ br s X
= δmp δnq xp +i −1,q + j −1 . (7.43)
∂ wmn pq

In this way one can train several stacked convolution layers too. It is important to
keep track of the summation boundaries. To that end it helps to pad out the image
and the convolution layers, so that the upper bounds remain the same in different
layers.
Details aside, the fundamental principle of feature maps is that the map is applied
in the same form to different parts of the image (translational invariance). In this
way the learning of parameters is shared between pixels, each weight in a given
feature map is trained on different parts of the image. This effectively increases the
training set for the feature map and combats overfitting.

7.3.2 Pooling layers


Pooling layers process the output of convolution layers. A neuron in a pooling
layer takes the outputs of several neighbouring feature maps and summarises their
LEARNING TO READ HANDWRITTEN DIGITS 129

Figure 7.17: Examples of digits from the MNIST data set of handwritten digits [57]. The
images were produced using MATLAB. But note that by default MATLAB displays the
digits white one black background. Copyright for the data set: Y. LeCun and C. Cortes.

outputs into a single number. Max-pooling units, for example, summarise the
outputs of nearby feature maps (in a 2×2 square for instance) by taking the maximum
over the feature-map outputs. Instead, one may compute the root-mean square of
the map values (L 2 -pooling). There are no weights or thresholds associated with
the pooling layers, they compute the output from the inputs using a pre-defined
prescription. Other ways of pooling are discussed in Ref. [4].
Usually several feature maps are connected to the input. Pooling is performed
separately on each of them. The network layout looks like the one shown schemat-
ically in Figure 7.16. In this Figure, the pooling layers feed into a number of fully
connected hidden layers that connect to the output neurons. There are as many
output neurons as there are classes to be recognised. This layout is qualitatively
similar to the layout used by Krizhesvky et al. [55] in the ImageNet challenge (see
Section 7.5 below).

7.4 Learning to read handwritten digits


Figure 7.17 shows patterns from the MNIST data set of handwritten digits [57]. The
data set derives from a data set compiled by the National Institute of Standards and
Technology (NIST), of digits handwritten by high-school students and employees of
the United States Census Bureau. The data contains a data set of 60 000 images of
digits with 28× 28 pixels, and a test set of 10 000 digits. The images are grayscale with
8-bit resolution, so each pixel contains a value ranging from 0 to 255. The images in
the database were preprocessed. The procedure is described on the MNIST home
page. Each original binary image from the National Institute of Standards and
Technology was represented as a 20×20 gray-scale image, preserving the aspect
ratio of the digit. The resulting image was placed in a 28×28 image so that the
centre-of-mass of the image coincided with its geometrical centre. These steps can
make a crucial difference (Section 7.4.3).
We divide the data set into a training set with 50 000 digits and a validation set
130 DEEP LEARNING

Algorithm 5 network layout and training options: no hidden layers, softmax output
layer with 10 units. Here net is the network object containing the training data set,
the network layout, and the training options.
layers = [imageInputLayer([28 28 1])
fullyConnectedLayer(10)
softMaxLayer
classificationLayer];
options = trainingOptions(
’sgdm’,...
’MiniBatchSize’, 8192,...
’ValidationData’, {xValid, tValid},...
’ValidationFrequency’, 30,...
’MaxEpochs’,200,...
’Plots’, ’Training-Progress’,...
’L2Regularization’, 0, ...
’Momentum’, 0.9, ...
’ValidationPatience’, 5, ...
’Shuffle’, ’every-epoch’, ...
’InitialLearnRate’, 0.001);
net = trainNetwork(xTrain, tTrain, layers, options);

with 10 000 digits. The latter is used for cross-validation and early stopping. The test
data is used for measuring the classification error after training. For this purpose
one should use a data set that was not involved in the training.
The goal of this Section is to show how the principles described in Chapters 6 and
7 allow to learn the MNIST data with low classification error, as outlined in Ref. [5].
You can follow the steps described below with your own computer program, using
MATLAB 2017b which is available at StuDAT. But if you prefer you can also use other
software packages such as Keras [58], an interface for TensorFlow [56], Theano [59],
or PyTorch [60]. The networks described below use ReLU units (Section 7.2.2) in the
hidden layers and a softmax output layer (Section 7.2.3) with ten output units Oi
and energy function (7.26), so that output Oi is the probability that the pattern fed
to the network falls into category i .

7.4.1 Fully connected layout


The simplest network has no hidden layers at all, just one softmax output layer with
10 neurons. The representation of this network and its training algorithm in MATLAB
is summarised in Algorithm 5. There are three parts. The layout of the network is
LEARNING TO READ HANDWRITTEN DIGITS 131

defined in the array layers=[...]. Here inputLayer([28 28 1]) reads in the


28×28 inputs, and preprocesses the inputs by subtracting the mean image averaged
over the whole training set from each input image [Equation (6.17)]. The three array
elements fullyConnectedLayer, softMaxLayer and classificationLayer
define a softmax Player with 10 output units. First, fullyConnectedLayer(10)
computes b j = k w j k V j + θk for j = 1, . . . , 10. Note that the sign of the threshold
differs from the convention we use (Algorithm 2). Second, softMaxLayer com-
(µ)
putes the softmax function (7.24), where Oi is the probability that input pattern
x (µ) belongs to class i . Finally, classificationLayer computes the negative log
likelihoods (7.26).
The options variable defines the training options for the backpropagation al-
gorithm. This includes choices for the learning parameters, such as mini-batch
size m B , learning rate η, momentum constant α, and so forth. To find appropriate
parameter values and network layouts is one of the main difficulties when training
a neural network, and it usually requires a fair deal of experimenting. There are
recipes for finding certain parameters [61], but the general approach is still trial and
error [5].
In options, ’sgdm’ means stochastic gradient descent with momentum, Equa-
tion (6.29). The momentum constant Momentum is set to α = 0.9. The mini-batch
size [Equation (6.14)] is set to 8912. The validation set is specified by {xValid,
tValid}. The algorithm computes the validation error during training, in intervals
specified by ValidationFrequency. This allows for cross validation and early
stopping. Roughly speaking training stops when the validation error begins to in-
crease. During training, the algorithm keeps track of the smallest validation error
observed so far. Training stops when the validation error was larger than the mini-
mum for a specified number of times, ValidationPatience. This variable is set to
5 in Algorithm 5. The variable Shuffle determines at which intervals the sequence
of patterns in the training set is randomised. The parameter InitialLearnRate
defines the initial learning rate, η = 0.001. By default it does not change as a func-
tion of time. One Epoch corresponds to applying p patterns or p /m B mini batches
(Section 6.1).
The resulting classification accuracy is about 90%. It can be improved by adding
a hidden layer with 30 ReLU units (Algorithm 6), giving a classification accuracy
of about 96%. The accuracy can be further improved by increasing the number of
neurons in the hidden layer. For 100 hidden ReLU units the accuracy becomes about
97.2% after training for 200 epochs (early stopping occurred after 135 epochs). Figure
7.18 shows how the training and the validation energies decrease during training,
for both networks. You see that the energies are a little lower for the network with
100 hidden neurons. But we observe overfitting in both cases, because after many
training steps the validation energy is much higher than the training energy. As
132 DEEP LEARNING

H
10

0.1

400 800 iterations


Figure 7.18: Energy functions for the MNIST training set (solid lines) and for the
validation set (dashed lines) for Algorithm 6 (red lines) and for a similar algorithm, but
with 100 neurons in the hidden layer, green lines. The data was smoothed and the plot
is schematic. The x -axis shows iterations. One iteration corresponds to feeding one
minibatch of patterns. One epoch consists of 50000/8192 ≈ 6 iterations.

mentioned above, early stopping caused the training of the larger network to abort
after 135 epochs, this corresponds to 824 iterations.

Now let us add more hidden layers. Experimenting shows that it is better to use a
slightly higher learning rate, η = 0.01. For two hidden layers we obtain classification
accuracies that are only slightly higher, 97.3%. Adding a third hidden layer does not
help much either. Try adding even more neurons, and/or more layers. You will see
that it is difficult to increase the classification accuracy further. Adding more fully
connected hidden layers does not necessarily improve the classification accuracy,
even if you train for more epochs. One possible reason is that the network overfits
the data (Section 6.3.2). This problem becomes more acute as you add more hidden
neurons. The tendency of the network to overfit is reduced by regularisation (Section
7.2.5). For the network with one hidden layer with 100 ReLU units, L 2 -regularisation
improves the classification accuracy to almost 98%. Here L2Regularization was
set to 0.03 and the learning rate to η = 0.03.

Algorithm 6 one fully connected hidden layer, softmax outputs.


layers = [imageInputLayer([28 28 1])
fullyConnectedLayer(30)
reLuLayer
fullyConnectedLayer(10)
softMaxLayer
classificationLayer];
LEARNING TO READ HANDWRITTEN DIGITS 133

2 6 6 7 7 5 5

0 0 1 1 2 3 3
6 9 9 3 0 9 4

4 4 4 5 6 7 9
Figure 7.19: Some hand-written digits from the MNIST test set, misclassified by a
convolutional net that achieved an overall classification accuracy of 98%. Correct clas-
sification (top right), misclassification (bottom right). Data from Oleksandr Balabanov.

7.4.2 Convolutional networks


Deep convolutional networks can yield still higher classification accuracies than
those obtained in the previous Section. The layout of Algorithm 7 corresponds to a
network with one convolution layer with 20 feature maps, a max-pooling layer, and
a fully connected hidden layer with 100 ReLU units, similar to the network shown in
Figure 7.16. The classification accuracy obtained after 60 epochs is slightly above
98%. It can be further improved by including a second convolution layer, and batch
normalisation (Section 7.2.6). See Algorithm 8, a slightly modified version of an
example from MathWorks. The classification accuracy is 98.99% after 30 epochs.
The accuracy can be improved further by tuning parameters and network layout,
and by using ensembles of convolutional neural networks [57]. The best classifi-
cation accuracy found in this way is 99.77% [62]. Several of the MNIST digits are
difficult to classify for Humans too (Figure 7.19), so we conclude that convolutional
nets really work very well. Yet the above examples show also that it takes much ex-
perimenting to find the right parameters and network layout as well as long training
times to reach the best classification accuracies. It could be argued that one reaches
a stage of diminishing returns as the classification error falls below a few percent.
134 DEEP LEARNING

Algorithm 7 convolutional network, one convolution layer.


layers = [imageInputLayer([28 28 1])
convolution2dLayer (5, 20, ’Padding’,1)
reLULayer
maxPooling2DLayer(2, ’Stride’,2)
fullyConnectedLayer(100)
reLuLayer
fullyConnectedLayer(10)
softMaxLayer
classificationLayer];
options = trainingOptions(’sgdm’,...
’MiniBatchSize’, 8192,...
’ValidationData’, {xValid, tValid},...
’MaxEpochs’,60,...
’Plots’, ’Training-Progress’,...
’L2Regularization’, 0, ...
’InitialLearnRate’, 0.001);
net = trainNetwork(xTrain, tTrain, layers, options);

Algorithm 8 several convolution layers, batch normalisation. After MathWorks.


layers = [imageInputLayer([28 28 1])
convolution2dLayer (3, 20, ’Padding’,1)
batchNormalizationLayer
reLULayer
maxPooling2DLayer(2, ’Stride’,2)
convolution2dLayer (3, 30, ’Padding’,1)
batchNormalizationLayer
reLULayer
maxPooling2DLayer(2, ’Stride’,2)
convolution2dLayer (3, 50, ’Padding’,1)
batchNormalizationLayer
reLULayer
fullyConnectedLayer(10)
softMaxLayer
classificationLayer];
LEARNING TO READ HANDWRITTEN DIGITS 135

Figure 7.20: Examples of digits drawn on an Ipad. Data from Oleksandr Balabanov.

Figure 7.21: Same digits as in Figure 7.20, but preprocessed like the MNIST digits. Data
from Oleksandr Balabanov.

7.4.3 Reading your own hand-written digits


In this Section I outline how you can test the convolutional networks described
above on your own data set of hand-written digits, and which conclusions you can
draw from such an experiment. Make your own data set by drawing the digits on
an Ipad with GoodNotes or a similar program. Draw as many digits as possible,
save them in a PNG file, and extract the individual digits using an image-processing
program such as Paint. Figure 7.20 shows digits obtained in this way. Now try out
one of your convolutional networks (Algorithm 7 or 8) trained on the MNIST data
set. You will see that the network has great difficulties recognising your digits.
One way of solving this problem is to add digits like the ones from Figure 7.20 to
the training set. A second possibility is to try to preprocess the digits from Figure
7.20 in the same way as the MNIST data was preprocessed. The result is shown
in Figure 7.21. Using a MNIST-trained convolutional net on these digits yields a
classification accuracy of about 90%. So the algorithm does not work very well at all.
What is going on?
Compare Figures 7.17 and 7.21. The digits in Figure 7.21 have a much more slen-
der stroke. It was suggested in Ref. [63] that this may be the reason, since it is known
that the line thickness of hand-written text can make a difference for algorithms
that read hand-written text [64]. There are different methods for normalising the
line thickness of hand-written text. Applying the method proposed in Ref. [64] to
our digits results in Figure 7.22. The algorithm of Ref. [64] has a free parameter, T ,
136 DEEP LEARNING

Figure 7.22: Same digits as in Figure 7.21. The difference is that the thickness of the
stroke was normalised (see text). Data from Oleksandr Balabanov.

that specifies the resulting line thickness. In Figure 7.22 it was taken to be T = 10,
close to the line thickness of the MNIST digits, we measured the latter to T ≈ 9.7
using the method described in Ref. [64].
If we run a MNIST-trained convolutional net (Algorithm 8) on a data set of 60
digits with normalised line thickness, it fails on only two digits. This corresponds to
a classification accuracy of roughly 97%, not so bad – but not as good as the best
results in Section 7.4.2. Note that we can only make a rough comparison. In order to
to obtain a better estimate of the classification accuracy we need to test many more
than 60 digits. A question is of course whether there are perhaps other differences
between our own hand-written digits and those in the MNIST data. It would also be
of interest to try digits that were drawn using Paint, or a similar program. How does
do MNIST-trained convolutional nets perform on computer-drawn digits?
At any rate, the results of this Section show that the way the input data are pro-
cessed can make a big difference. This raises a point of fundamental importance.
We have seen that convolutional nets can be trained to represent a distribution of
input patterns with very high accuracy. But if you test the network on a data set that
has a slightly different distribution, perhaps because it was preprocessed differently,
the network may not work as well.
DEEP LEARNING FOR OBJECT RECOGNITION 137

Figure 7.23: Object recognition using a deep convolutional network. Shown is a frame
from a movie recorded on a telephone. The network was trained on the Pascal VOC
data set [65] using YOLO [66]. Details on how to obtain the weights and how to install
the software are given on the YOLO website.

CNN
Top-5 Error in %

Year

Figure 7.24: Smallest classification error for the ImageNet challenge [67]. The data up
to 2014 comes from Ref. [67]. The data for 2015 comes from Ref. [68], for 2016 from
Ref. [69], and for 2017 from Ref. [70]. From 2012 onwards the smallest error was achieved
by convolutional neural networks (CNN). After Goodfellow et al. [4].

7.5 Deep learning for object recognition


Deep learning has become so popular in the last few years because deep convolu-
tional networks are good at recognising objects in images. Figure 7.23 shows a frame
from a movie taken from a car with my mobile telephone. A deep convolutional
network trained on the Pascal VOC training set [65] recognises objects in the movie
by putting bounding boxes around the objects and classifying them. The Pascal
VOC data set is a training set for object-class recognition in images. It contains circa
20 000 images, each annotated with one of 20 classes. The people behind this data
138 DEEP LEARNING

set ran image classification challenges from 2005 to 2012. A more recent challenge
is the ImageNet large-scale visual recognition challenge (ILSVRC) [67], a competition
for image classification and object recognition using the ImageNet database [52].
The challenge is based on a subset of ImageNet. The training set contains more than
106 images manually classified into one of 1000 classes. There are approximately
1000 images for each class. The validation set contains 50 000 images.
The ILSVRC challenge consists of several tasks. One task is image classification,
to list the object classes found in the image. A common measure for accuracy is the
so-called top-5 error for this classification task. The algorithm lists the five object
classes it identified with highest probabilities. The result is considered correct if
the annotated class is among these five. The error equals the fraction of incorrectly
classified images. Why does one not simply judge whether the most probable class
is the correct one? The reason is that the images in the ImageNet database are
annotated by a single-class identifier. Often this is not unique. The image in Figure
7.17, for example, shows not only a car but also trees, yet the image is annotated
with the class label car. This is ambiguous. The ambiguity is significantly smaller
if one considers the top five classes the algorithm gives, and checks whether the
annotated class is among them.
The tasks in the ILSVRC challenge are significantly more difficult than the digit
recognition described in Section 7.4, and also more difficult than the VOC challenges.
One reason is that the ImageNet classes are organised into a deep hierarchy of
subclasses. This results in highly specific sub classes that can be very difficult to
distinguish. The algorithm must be very sensitive to small differences between
similar sub classes. We say that the algorithm must have high inter-class variability
[71]. Different images in the same sub class, on the other hand, may look quite
different. The algorithm should nevertheless recognise them as similar, belonging
to the same class, the algorithm should have small intra-class variability [71].
Since 2012, algorithms based on deep convolutional networks won the ILSVRC
challenge. Figure 7.24 shows that the error has significantly decreased until 2017, the
last year of the challenge in the form described above. We saw in previous Sections
that deep networks are difficult to train. So how can these algorithms work so well?
It is generally argued that the recent success of deep convolutional networks is
mainly due to three factors.
First, there are now much larger and better annotated training sets available.
ImageNet is an example. Excellent training data is now recognised as one of the
most important factors, and companies developing software for self-driving cars
and systems that help to avoid accidents recognise that good training sets is one
of the most important factors, and difficult to achieve: to obtain reliable training
data one must manually collect and annotate the data (Figure 7.25). This is costly,
but at the same time it is important to have as large data sets as possible, to reduce
RESIDUAL NETWORKS 139

Figure 7.25: Reproduced from xkcd.com/1897 under the creative commons attribution-
noncommercial 2.5 license.

overfitting. In addition one must aim for a large variability in the collected data.
Second, the hardware is much better today. Deep networks are nowadays im-
plemented on single or multiple GPUs. There are also dedicated chips, such as the
tensor processing unit [72].
Third, improved regularisation techniques (Section 7.2.5) and weight sharing in
convolution layers help to fight overfitting, and ReLU units (Section 7.2.2) render
the networks less susceptible to the vanishing-gradient problem (Section 7.2.1).
The winning algorithm for 2012 was based on a network with five convolution
layers and three fully connected layers, using drop out, ReLU units, and data-set
augmentation [55]. The algorithm was implemented on GPU processors. The 2013
ILSVRC challenge was also won by a convolutional network [73], with 22 layers.
Nevertheless, the network has substantially fewer free parameters (weights and
thresholds) than the 2012 network: 4×106 instead of 60 × 106 . In 2015, the winning
algorithm [68] had 152 layers. One significant new element in the layout were
connections that skip layers (residual networks, Section 7.6). The 2016 [74] and 2017
[70] winning algorithms used ensembles of convolutional networks.

7.6 Residual networks


The network that won the 2015 ILSVRC challenge had connections that skip layers
[68]. Empirical evidence shows that skipping layers makes deep networks easier to
train. This is usually motivated by saying that skipping layers reduces the vanishing-
gradient problem.
The layout is illustrated schematically in Figure 7.26. Black arrows stand for usual
feed-forward connections. The notation differs somewhat from that of Algorithm
(`,`−1)
2. Here the weights from layer ` − 1 to ` are denoted by w j k , and those from
(`,`−2)
layer ` − 2 to ` by wi j (red arrow in Figure 7.26). Note that the superscripts are
140 DEEP LEARNING

w (`,`−2)

ReLU

ReLU

ReLU
w (`,`−1)

`−2 `−1 `
Figure 7.26: Schematic illustration of a network with skipping connections.

w (L −2,L −4) w (L −1,L −3) w (L ,L −1)

w (1,0) w (L −3,L −4) w (L −2,L −3) w (L −1,L −2) w (L ,L −1)


(1) (L −4)
x V V V (L −3) V (L −2) V (L −1) V (L )

Figure 7.27: ‘Network’ with connections that skip layers.

ordered in the same way as the subscripts: the right index refers to the layer on the
left. Neuron j in layer ` computes

(`) (`,`−1) (`−1) (`) (`,`−2)


€X X Š
Vj = g wjk Vk − θj + wjn Vn(`−2) . (7.44)
k n

The weights of connections that skip layers are trained in the usual fashion, by
stochastic gradient descent. To illustrate the structure of the resulting formulae
consider a ‘network’ with just one neuron per layer (Figure 7.27). To begin with we
calculate the increments of the weights w (`,`−1) . To update w (L ,L −1) we require

∂ V (L )
= g 0 (b (L ) )V (L −1) . (7.45)
∂ w (L ,L −1)

This gives

δw (L ,L −1) = ηδ(L ) V (L −1) with δ(L ) = (t − V (L ) )g 0 (b (L ) ) , (7.46)

as in Algorithm 2. The factor (t − V (L ) ) comes from the outer derivative of the energy
function (6.4). The outputs are O = V (L ) . As in Algorithm 2, I have omitted the sum
over µ (stochastic gradient descent, page 86).
Also the update for w (L −1,L −2) is the same as in Algorithm 2:

δw (L −1,L −2) = ηδ(L −1) V (L −2) with δ(L −1) = δ(L ) w (L ,L −1) g 0 (b (L −1) ) . (7.47)
RESIDUAL NETWORKS 141

But the update for w (L −2,L −3) is different because the short cuts come into play. The
extra connection from layer L − 2 to L gives rise to an extra term:
∂ V (L ) € ∂ V (L ) ∂ V (L −1) ∂ V (L ) Š ∂ V (L −2)
= + . (7.48)
∂ w (L −2,L −3) ∂ V (L −1) ∂ V (L −2) ∂ V (L −2) ∂ w (L −2,L −3)
Evaluating the partial derivatives we find
€ Š
= g 0 (b (L ) )w (L ,L −1) g 0 (b (L −1) )w (L −1,L −2) + g 0 (b (L ) )w (L ,L −2) g 0 (b (L −2 )V (L −2) .
This implies
δ(L −2) = δ(L −1) w (L −1,L −2) g 0 (b (L −2 ) + δ(L ) w (L ,L −2) g 0 (b (L −2) ) . (7.49)
In general, the error-backpropagation rule reads
δ(`−1) = δ(`) w (`,`−1) g 0 (b (`−1) ) + δ(`+1) w (`+1,`−1) g 0 (b (`−1) ) (7.50)
for ` = L − 1, L − 2, . . .. The first term is the same as in step 9 of Algorithm 2. The
second term is due to the skipping connections.
The update formula for w (`,`−1) is
δw (`,`−1) = ηδ(`) V (`−1) . (7.51)
The updates of the weights w (`+1,`−1) are given by
δw (`+1,`−1) = ηδ(`+1) V (`−1) , (7.52)
with the same errors as in Equation (7.51).
Skipping connections reduce the vanishing-gradient problem. To see this, note
that we can write the error δ(`) as
X
δ(`) = δ(L ) w (L ,`n ) g 0 (b (`n ) ) · · · w (`2 ,`1 ) g 0 (b (`1 ) )w (`1 ,`) g 0 (b (`) ) (7.53)
`1 ,`2 ,...,`n

where the sum is over all paths L > `n > `n−1 > · · · > `1 > ` back through the network.
The smallest gradients are dominated by the product corresponding to the path with
the smallest number of steps (factors), resulting in a smaller probability to get small
gradients. Introducing connections that skip more than one layer tends to increase
the small gradients, as Equation (7.53) shows. Recently it has been suggested to
randomise the layout by randomly short-circuiting the network. Equation (7.53)
remains valid for this case too.
The network described in Ref. [68] used unit weights for the skipping connections,

(`) (`,`−1) (`−1) (`) (`−2)


€X Š
Vj = g w j k Vk − θ j + Vj , (7.54)
k
(`−1) (`−2)
so that the hidden layer Vk learns the difference between the input V j and the
(`)
output V j . Therefore such networks are called residual networks.
142 DEEP LEARNING

7.7 Summary
Networks with many hidden layers are called deep networks. It has recently been
shown that such networks can be trained to recognise objects in images with high
accuracy. It is sometimes stated that convolutional networks are now better than
Humans, in that they recognise objects with lower classification errors than Humans
[75]. This statement is problematic for several reasons. To start with, the article
refers to the 2015 ILSVRC competition, and the company mentioned in the Guardian
article was later caught out cheating. At any rate, this and similar statements refer to
an experiment showing that the Human classification error in recognising objects in
the ImageNet database is about 5.1% [76], worse than the most recent convolutional
neural-network algorithms (Figure 7.24).
Yet it is clear that these algorithms learn in quite a different way from Humans.
They can detect local features, but since these convolutional networks rely on trans-
lational invariance, they do not easily understand global features, and can mistake
a leopard-patterned sofa for a leopard [77]. It may help to include more sofas in
the training data set, but the essential difficulty remains: translational invariance
imposes constraints on what convolutional networks can learn [77].
More fundamentally one may argue that Humans learn differently, by abstraction
instead of going through vast training sets. Just try it out for yourself, this website
[78] allows you to learn like a convolutional network. Nevertheless, the examples
described in this Chapter illustrate the tremendous success of deep convolutional
networks.
We have also seen that training deep networks suffers from a number of fun-
damental problems. First, networks with many hidden neurons have many free
parameters (their weights and thresholds). This increases the risk of overfitting.
Overfitting reduces the power of the network to generalise. The tendency of deep
networks to overfit can be reduced by cross-validation (Section 6.3.2) and by reg-
ularisation (weight decay, drop out, pruning, and data set augmentation, Section
7.2.5). In this regard convolutional nets have an advantage because they have fewer
weights, and the weights of a given feature map are trained on different parts of the
input images, effectively increasing the training set.
Second, the examples described in Section 7.4 show that convolutional nets are
sensitive to differences in how the input data are preprocessed. You may run into
problems if you train a network on given training and validation sets, but apply it to
a test set that was preprocessed in a different way – so that the test set corresponds
to a different input distribution. Convolutional nets excel at learning the properties
of a given input distribution, but they may have difficulties in recognising patterns
sampled from a slightly different distribution, even if the two distributions appear
very similar to the Human eye. Note also that this problem cannot be solved by
FURTHER READING 143

cross-validation, because training and validation sets are drawn from the same
input distribution, but here we are concerned with what happens when the network
is applied to a input distribution different from the one that was trained on. Here is
another example illustrating this point: the authors of Ref. [79] trained a convolu-
tional network on perturbed grayscale images from the ImageNet data base, adding
a little bit of noise independently to each pixel (white noise) before training. This
network failed to recognise images that were weakly perturbed in a different way, by
setting a small number of pixels to white or black. When we look at the images we
have no difficulties seeing through the noise.
Third, error backpropagation in deep networks suffers from the vanishing-gradient
problem. This is more difficult to combat. It can be reduced by using ReLU units,
by initialising the weights in certain ways, and by networks with connections that
skip layers. Yet vanishing or exploding gradients remain a fundamental difficulty,
slowing learning down in the initial phase of training. Brute force (computer power)
helps to alleviate the problem. As a consequence, convolutional neural networks
have become immensely successful in object recognition, outperforming other
algorithms significantly.
Fourth, Refs. [80, 81] illustrate intriguing failures of convolutional networks. Sze-
gedy et al. [80] show that the way convolutional nets partition input space can lead
to surprising results. The authors took an image that the network classifies correctly
with high confidence, and it perturbed slightly. The difference between the original
and perturbed images (adversarial images) is undetectable to the Human eye, yet
the network misclassifies the perturbed image with high confidence [80]. This
indicates that decision boundaries are always close in input space, not intuitive but
possible in high dimensions. Figure 1 in Ref. [81] shows images that are completely
unrecognisable to the Human eye. Yet a convolutional network classifies these
images with high confidence. This illustrates that there is no telling what a network
may do if the input is far away from the training distribution. Unfortunately the
network can sometimes be highly confident yet wrong.
To conclude, convolutional networks are very good at recognising objects in
images. But we should not imagine that they understand what they see in the same
way as Humans. The theory of deep learning has somewhat lagged behind the
performance in practice. But some progress has been made in recent years, and
there are many interesting open questions.

7.8 Further reading


What do the hidden layers in a convolutional layer actually compute? Feature maps
that are directly coupled to the inputs detect local features, such as edges or corners.
144 DEEP LEARNING

Yet it is unclear precisely how hidden convolutional layers help the network to learn.
Therefore it is interesting to visualise the activity of deep layers by asking: which
input patterns maximise the outputs of the neurons in a certain layer [82]?
Another question concerns the structure of the energy landscape. It seems that
local minima are perhaps less important for deep networks, because their energy
functions tend to have more saddle points than minima [83].
Deep networks suffer from catastrophic forgetting: when you train a network on a
new input distribution that is quite different from the one the network was originally
trained on, then the network tends to forget what it learned initially. Recently there
has been much interest in this question. A good starting point is Ref. [84].
The stochastic-gradient descent algorithm (with or without minibatches) samples
the input-data distribution uniformly randomly. As mentioned in Section 6.3.1, it
may be advantageous to sample those inputs more frequently that initially cause
larger output errors. More generally, the algorithm may use other criteria to choose
certain input data more often, with the goal to speed up learning. It may even
suggest how to augment a given training set most efficiently, by asking to specifically
label certain types of input data (active learning) [85].
For connections to Mathematical Statistics (multinomial and binary logistic
regression), start with Ref. [4].

7.9 Exercises
Decision boundaries for XOR problem. Figure 7.5 shows the layout of a network
that solves the Boolean XOR problem. Draw the decision boundaries for the four
hidden neurons in the input plane, and label the boundaries and the regions as in
Figure 5.12.

Vanishing-gradient problem. Train the network shown in Figure 7.7 on the iris data
set, available from the Machine learning repository of the University of California
Irvine. Measure the effects upon of the neurons in the different layers, by calculating
the derivative of the energy function H w.r.t. the thresholds of the neurons in
question.

7.10 Exam questions


7.10.1 Parity function
The parity function outputs 1 if and only if the input sequence of n binary numbers
has an odd number of ones, and zero otherwise. The parity function for n = 2 is also
EXAM QUESTIONS 145

known as the Boolean XOR function.


(a) The XOR function can be represented using a multilayer perceptron with two
inputs, a fully connected hidden layer with two hidden neurons, and one output
unit. The activation function is the Heaviside function:
¨
1 for b > 0 ,
θH (b ) = (7.55)
0 for b ≤ 0
for all layers. Determine suitable weight vectors w j and thresholds θ j for the two
hidden units ( j = 1, 2), as well as the weight vector W 1 and the threshold Θ1 for the
output unit. (0.5p)
(b) Illustrate the problem graphically in the input space, and indicate the planes
determined by the weight vectors w j and thresholds θ j that you determined in (a).
In a separate graph, illustrate the transformed input data in the hidden space and
draw the line determined by the weight vector W 1 and the threshold Θ1 . (0.5p)
(c) Describe how you can combine several of the small XOR multilayer perceptrons
analysed in (a)-(b) to create a deep network that computes the parity function for
n > 2. Explain how the total number of nodes in the network grows with the input
dimension n . (1p)

7.10.2 Softmax outputs


Consider a perceptron with L layers and softmax output units. For pattern µ, the
state of the i th output neuron is given by
(L ,µ)
(µ) e bi
Oi =P (L ,µ)
, (7.56)
e bm
m

where bm(L ,µ) denotes the local field of the m th output neuron:
(L ) (L −1,µ)
X
bm(L ,µ) = −θm(L ) + wmk Vk . (7.57)
k

(L ) (L −1,µ)
Here θm(L ) andwmk are thresholds and weights, and Vk is the state of the k th
neuron in layer L − 1, evaluated for pattern µ.
(µ) (L ,µ)
(a) Compute the derivative of output Oi with respect to the local field b j of the
th
j output neuron. (1p).
(b) The network is trained by gradient descent on the negative log-likelihood func-
tion,
(µ) (µ)
X 
H =− t i log Oi . (7.58)

146 DEEP LEARNING

The summation is over all patterns in the training set and over all output neurons,
(µ)
the logarithm is the natural logarithm, and t i denote targets. The targets satisfy
the constraint
(µ)
X
ti = 1 (7.59)
i

(`)
for all patterns µ. When updating, the increment of a weight wnq in layer ` is given
by

(`) ∂H
δwnq = −η (`)
, (7.60)
∂ wnq
(L )
where η denotes the learning rate. Derive the increment for weight wnq in layer L .
(1p).
147
(v x ) (v v )
wi k wi j
(µ) (µ)
x1 1 3 y3

inputs outputs

(µ) (µ)
x2 2 4 y4
feedback
Figure 8.1: Network with a feedback connection. Neurons 1 and 2 are hidden neurons.
(v x )
The weights from the input xk to the neurons Vi are denoted by wi k , the weight from
(v v )
neuron V j to neuron Vi is wi j . Neurons 3 and 4 are output neurons, with prescribed
target values yi . To avoid confusion with the iteration index t , the targets are denoted
by y in this Chapter.

8 Recurrent networks
The layout of the perceptrons analysed in the previous Chapters is special. All
connections are one way, and only to the layer immediately to the right, so that the
update rule for the i -th neuron in layer ` becomes

(`) (`) (`−1) (`)


€X Š
Vi =g wi j V j − θi . (8.1)
j

The backpropagation algorithm relies on this feed-forward layout. It means that


(`−1) (`)
the derivatives ∂ V j /∂ wmn vanish. This ensures that the outputs are nested
functions of the inputs, which in turn implies the simple iterative structure of the
backpropagation algorithm on page 89.
In some cases it is necessary or convenient to use networks that do not have this
simple layout. The Hopfield networks discussed in part I are examples where all
connections are symmetric. More general networks may have a feed-forward layout
with feedbacks, as shown in Figure 8.1. Such networks are called recurrent networks.
There are many different ways in which the feedbacks can act: from the output
layer to hidden neurons for example (Figure 8.1), or there could be connections
between the neurons in a given layer. Neurons 3 and 4 in Figure 8.1 are output units,
they are associated with targets just as in Chapters 5 to 7. The layout of recurrent
networks is very general, but because of the feedback links we must consider how
such networks can be trained.
Unlike the multi-layer perceptrons, recurrent networks are commonly used as
dynamical networks, just as in Chapters 1 and 2 [c.f. Equation (1.4)]. The dynamics
148 RECURRENT NETWORKS

can either be discrete


(v v ) (v x ) (v )
€X X Š
Vi (t ) = g wi j V j (t − 1) + wi k xk − θi for t = 1, 2, . . . , (8.2)
j k

or continuous
dVi €X
(v v )
X
(v x ) (v )
Š
τ = −Vi + g wi j V j (t ) + wi k xk − θi , (8.3)
dt j k

(v )
with time constant τ. The parameters θi are thresholds. We shall see in a moment
why it can be advantageous to use a dynamical network.
Recurrent networks can learn in different ways. One possibility is to use a training
set of pairs (x (µ) , y (µ) ) with µ = 1, . . . , p . To avoid confusion with the iteration index
t , the targets are denoted by y in this Chapter. One feeds a pattern from this set
and runs the dynamics (8.2) or (8.3) for the given x (µ) until the dynamics reaches a
steady state V ∗ (if this does not happen the training fails). Then one updates the
weights by gradient descent using the energy function
¨
1X 2 yk − Vk if Vk is an output unit,
H= Ek where Ek = (8.4)
2 k 0 otherwise,

evaluated at V = V ∗ , that is H ∗ = 12 k (Ek∗ )2 with Ek∗ = yk − Vk∗ . Instead of defining


P

the energy function in terms of the mean-squared output errors, one could also use
the negative log-likelihood function (7.32). These steps are repeated until the steady-
state outputs yield the correct targets for all input patterns. This is reminiscent of the
algorithms discussed in Chapters 5 to 7, and we shall see that the backpropagation
algorithm can be modified (recurrent backpropagation) to make the networks learn
as described earlier.
Another possibility is that inputs and targets change as functions of time t while
the network dynamics runs. In this way the network can solve temporal association
tasks where it learns to output certain targets in response to the sequence x (t )
of input patterns, and targets y (t ). In this way recurrent networks can translate
written text or recognise speech. Such networks can be trained by unfolding their
dynamics in time as explained in Section 8.2 (backpropagation in time), although
this algorithm suffers from the vanishing-gradient problem discussed in Chapter 7.

8.1 Recurrent backpropagation


Recall Figure 8.1. We want to train a network with N real-valued units Vi with
sigmoid activation functions, and weights wi j from V j to Vi . Several of the units
RECURRENT BACKPROPAGATION 149

(µ)
may be connected to inputs xk . Other units are output units with associated target
(µ)
values yi . We take the dynamics to be continuous in time, Equation (8.3), and
assume that the dynamics runs into a steady state
dVi ∗
V (t ) → V ∗ so that = 0. (8.5)
dt
From Equation (8.3) we deduce
(v v ) (v x ) (v )
€X X Š
Vi ∗ = g w i j V j∗ + wi k xk − θi . (8.6)
j k

In other words we assume that the dynamics (8.3) has a stable steady state, so that
small perturbations δVi away from Vi ∗ decay with time. Equation (8.6) is a nonlinear
self-consistent Equation for Vi ∗ , in general difficult to solve. However, if the fixed
points Vi ∗ are stable then we can use the dynamics (8.3) to automatically pick out the
steady-state solution V ∗ . This solution depends on the pattern x (µ) , but in Equations
(8.5) and (8.6) and also in the following I have left out the superscript (µ).
The goal is to find weights so that the outputs give the correct target values in
the steady state, those associated with x (µ) . To this end we use stochastic gradient
(v v )
descent on the energy function (8.4). Consider first how to update the weights wi j .
We must evaluate
(v v ) ∂H X ∂ Vk∗
δwmn = −η (v v )
=η ∗
Ek (v v )
. (8.7)
∂ wmn k ∂ wmn
To calculate the gradients of V ∗ we use Equation (8.6):
∂ Vi ∗ ∂ bi∗ X
(v v )
∂ V j∗ 
(v v )
=g0
(bi∗ )
(v v )
=g 0
(bi∗ ) δi m Vn∗ + (v v )
, wi j
(8.8)
∂ wmn ∂ wmn j ∂ wmn
(v v ) (v x ) (v )
where bi∗ = j wi j V j∗ + k wi k xk − θi . Equation (8.8) is a self-consistent equa-
P P

tion for the gradient, as opposed to the explicit equations we found in Chapters 5 to
7. The reason for the difference is that the recurrent network has feedbacks. Since
Equation (8.8) is linear in the gradients, we can solve it by matrix inversion. To this
(v v )
end, define the matrix L with elements L i j = δi j − g 0 (bi∗ )wi j . Equation (8.8) can
be written as
X ∂ V j∗
Li j (v v )
= δi m g 0 (bi∗ )Vn∗ . (8.9)
j ∂ w mn
P 
Applying k L−1 k i to both sides we find

∂ V j∗ 
(v v )
= L−1 km
g 0 (bm∗ )Vn∗ . (8.10)
∂ wmn
150 RECURRENT NETWORKS

Inserting this result into (8.7) we finally obtain for the weight increments:
X
(v v )

δwmn =η Ek∗ L−1 k m g 0 (bm∗ )Vn∗ . (8.11)
k

This learning rule can be written in the form of the backpropagation rule by intro-
ducing the error X 
∆∗m = g 0 (bm∗ ) Ek∗ L−1 k m . (8.12)
k

Then the learning rule (8.11) takes the form


(v v )
δwmn = η∆∗m Vn∗ , (8.13)

compare Equation (6.11). A problem is that a matrix inversion is required to compute


the errors ∆∗m , an expensive operation. But as outlined in Chapter 5, we can try find
the inverse iteratively. We can find a linear differential equation for
X 
∆∗i = g 0 (bi∗ ) Ek∗ L−1 k i . (8.14)
k

that does not involve the inverse of L. I used the index i here because it makes the
following calculation a bit easier to follow. The first step is to multiply both sides of
Equation (8.14) with L i j /g 0 (bi∗ ) and to sum over i : This gives
X X 
∆∗i L i j /g 0 (bi∗ ) = Ek∗ L−1 ki
L i j = E j∗ . (8.15)
i ki

(v v )
Using L i j /g 0 (bi∗ ) = δi j /g 0 (b j∗ ) − wi j we find

(v v )
X € Š
∆∗i δi j − wi j g 0 (b j∗ ) = g 0 (b j∗ )E j∗ . (8.16)
i

The trick is now to write down a dynamical equation for ∆i that has a steady state
at the solution of Equation (8.16):
(v v )
X
τ dtd ∆ j = −∆ j + ∆i wi j g 0 (b j∗ ) + g 0 (b j∗ )E j∗ . (8.17)
i

Compare this with the dynamical rule (8.3). Equations (8.3) and (8.17) exhibit the
same duality as Algorithm 2, between forward propagation of states of neurons
(step 5) and backpropagation of errors (step 9). The sum in Equation (8.17) has the
same form as the recursion for the errors in Algorithm 2 (step 9), except that there
are no layer indices ` here.
RECURRENT BACKPROPAGATION 151

The solution of (8.16) is a fixed point of this Equation. But is it stable? To decide
this we linearise the dynamical equations (8.3) and (8.17). To this end we write

Vi (t ) = Vi ∗ + δVi (t ) and ∆i (t ) = ∆∗i + δ∆i (t ) , (8.18)

and insert this ansatz into (8.3) and (8.17). To leading order we find:
(v v )
X X
τ dtd δVi = − δVi + g 0 (bi∗ ) wi j δV j = − L i j δV j , (8.19)
j j
(v v )
X X
τ dtd δ∆ j = −δ∆ j + δ∆i wi j g 0 (b j∗ ) = − δ∆i g 0 (bi∗ )L i j /g 0 (b j∗ ) . (8.20)
i i

Since the matrices with elements L i j and g 0 (bi∗ )L i j /g 0 (b j∗ ) have the same eigenvalues,
∆∗i is a stable fixed point of (8.17) if Vn∗ is a stable fixed point of (8.3). This was
assumed in the beginning, Equation (8.5). If this assumption does not hold, the
algorithm does not converge.
(v x )
Now consider the update formula for the weights wmn from the inputs:

(v x ) ∂H X ∂ Vk∗
δwmn = −η (v x )
=η Ek∗ (v x )
, (8.21a)
∂ wmn k ∂ wmn
∂ Vi ∗ € X
(v v )
∂ V j∗ Š
(v x )
= g (bi ) δi j xn +
0 ∗
wi j (v x )
. (8.21b)
∂ wmn j ∂ wmn

Equation (8.21) is analogous to Equation (8.8). Consequently


(v x )
δwmn = η∆∗m xn . (8.22)

The algorithm for recurrent backpropagation is summarised in Algorithm 9.

Algorithm 9 recurrent backpropagation


1: initialise all weights;
2: for t = 1, . . . , T do
3: choose a value of µ and apply x (µ) to Pthe inputs;
dVn (v v ) (v x ) (v )

find Vn by relaxing τ dt = −Vn + g + θ

P
4: j w nj V j k w nk x k − n
;
5: compute Ek∗ = yk − Vk∗ for all output units;
d∆
find ∆∗m by relaxing τ dtm = −∆m + j ∆ j w j m g 0 (bm∗ ) + g 0 (bm∗ )Em∗ ;
P
6:
(v v ) (v v ) (v v )
7: update all weights: wmn ← wmn + δwmn with δwmn = η∆∗m Vn∗ and
(v x ) (v x ) (v x ) (v x )
wmn ← wmn + δwmn with δwmn = η∆∗m xn ;
8: end for
9: end;
152 RECURRENT NETWORKS

O (t ) O1 ... OT

w (o v )
w (v v ) V0 V1 VT
V (t ) ...
(v v ) (v v ) (v v ) (v v ) (v v )
w w w w w
(v x )
w

x (t ) x1 ... xT

Figure 8.2: Left: recurrent network with one hidden neuron (green) and one output
neuron (blue). The input terminal is drawn red. Right: same network but unfolded in
time. Here the time arguments are written as subscripts (see page 8.2). The weights
w (v v ) remain unchanged as drawn, also the weights w (v x ) and w (o v ) remain unchanged
(not drawn).

8.2 Backpropagation through time


Recurrent networks can be used to learn sequential inputs, as in speech recognition
and machine translation. The training set is a time sequence of inputs and targets
[x (t ), y (t )]. The network is trained on the sequence and learns to predict the targets.
In this context the layout is changed a little bit compared with the one described in
the previous Section. There are two main differences. Firstly, the inputs and targets
depend on t and one uses a discrete-time update rule. Secondly, separate output
units Oi (t ) are added to the layout. The update rule takes the form
(v v ) (v x ) (v )
€X X Š
Vi (t ) = g wi j V j (t − 1) + wi k xk (t ) − θi , (8.23a)
j k
(o v ) (o )
€X Š
Oi (t ) = g wi j V j (t ) − θi . (8.23b)
j

The activation function of the outputs Oi can be different from that of the hidden
neurons V j . Often the softmax function is used for the outputs [86, 87].
To train recurrent networks with time-dependent inputs and targets and with
the dynamics (8.23) one uses backpropagation through time. The idea is to unfold
the network in time to get rid of the feedbacks, at the expense of as many copies
of the original neurons as there are time steps. This is illustrated in Figure 8.2 for
a recurrent network with one hidden neuron, one input, and one output. The
unfolded network has T inputs and outputs. It can be trained in the usual way with
stochastic gradient descent. The errors are calculated using backpropagation as in
Algorithm 2, but here the error is propagated back in time, not from layer to layer.
BACKPROPAGATION THROUGH TIME 153

The energy function is the squared error summed over all time steps

T
1X 2
H= E with E t = yt − Ot . (8.24)
2 t =1 t

One can also use the negative log-likelihood function (7.26).


Note that I have written the time argument as a subscript. Consider first how to
update the weight w (v v ) . The gradient-descent rule (5.26) gives a result that is of the
same form as Equation (8.7):

T T
X ∂ Ot X
(o v ) ∂ Vt
δw (v v ) = η Et = η ∆ t w . (8.25)
t =1
∂ w (v v ) t =1
∂ w (v v )

Here ∆t = E t g 0 (Bt ) is an output error, Bt = w (o v ) Vt − θ (o ) is the local field of the


output neuron at time t [Equation (8.23)], and ∂ Vt /∂ w (v v ) is evaluated with the
chain rule, as usual. Equation (8.23a) yields the recursion

∂ Vt €
(v v ) ∂ Vt −1
Š
= g 0
(b t ) Vt −1 + w (8.26)
∂ w (v v ) ∂ w (v v )

for t ≥ 1. Since ∂ V0 /∂ w (v v ) = 0 we have:

∂ V1
= g 0 (b1 )V0 ,
∂ w (v v )
∂ V2
= g 0 (b2 )V1 + g 0 (b2 )w (v v ) g 0 (b1 )V0 ,
∂ w (v v )
∂ V3
= g 0 (b3 )V2 + g 0 (b3 )w (v v ) g 0 (b2 )V1 + g 0 (b3 )w (v v ) g 0 (b2 )w (v v ) g 0 (b1 )V0
∂ w (v v )
..
.
∂ VT −1
= g 0 (bT −1 )VT −2 + g 0 (bT −1 )w (v v ) g 0 (bT −2 )VT −3 + . . .
∂w (v v )

∂ VT
= g 0 (bT )VT −1 + g 0 (bT )w (v v ) g 0 (bT −1 )VT −2 + . . .
∂ w (v v )

Equation (8.25) says that we must sum over t . Regrouping the terms in this sum
154 RECURRENT NETWORKS

yields:
∂ V1 ∂ V2 ∂ V3
∆1 + ∆ 2 + ∆ 3 + ...
∂ w (v v ) ∂ w (v v ) ∂ w (v v )
= [∆1 g 0 (b1 ) + ∆2 g 0 (b2 )w (v v ) g 0 (b1 ) + ∆3 g 0 (b3 )w (v v ) g 0 (b2 )w (v v ) g 0 (b1 ) + . . .]V0
+ [∆2 g 0 (b2 ) + ∆3 g 0 (b3 )w (v v ) g 0 (b2 ) + ∆4 g 0 (b4 )w (v v ) g 0 (b3 )w (v v ) g 0 (b2 ) + . . .]V1
+ [∆3 g 0 (b3 ) + ∆4 g 0 (b4 )w (v v ) g 0 (b3 ) + ∆5 g 0 (b5 )w (v v ) g 0 (b4 )w (v v ) g 0 (b3 ) + . . .]V2
..
.
+ [∆T −1 g 0 (bT −1 ) + ∆T g 0 (bT )w (v v ) g 0 (bT −1 )]VT −2
+ [∆T g 0 (bT )]VT −1 .

To write the learning rule in the usual form, we define errors δt recursively:
¨
∆T w (o v ) g 0 (bT ) for t = T ,
δt = (o v ) 0 (v v ) 0
(8.27)
∆t w g (bt ) + δt +1 w g (bt ) for 0 < t < T .

Then the learning rule takes the form


T
X
(v v )
δw =η δt Vt −1 , (8.28)
t =1

just like Equation (6.10), or like the recursion in step 9 of Algorithm 2.


The factor w (v v ) g 0 (bt −1 ) in the recursion (8.27) gives rise to a product of many
such factors in δt when T is large, exactly as described in Section 7.2.1 for multilayer
perceptrons. This means that the training of recurrent nets suffers from unstable
gradients, as backpropagation of multilayer perceptrons does (Section 7.2.1). If the
factors |w (v v ) g 0 (bp )| are smaller than unity then the errors δt become very small
when t becomes small (vanishing-gradient problem). This means that the early
states of the hidden neuron no longer contribute to the learning, causing the network
to forget what it has learned about early inputs. When |w (v v ) g 0 (bp )| > 1, on the other
hand, exploding gradients make learning impossible. In summary, the unstable
gradients in recurrent neural networks occurs much in the same way as in multilayer
perceptrons (Section 7.2.1). The resulting difficulties for training recurrent neural
networks are discussed in more detail in Ref. [88].
A slight variation of the above algorithm (truncated backpropagation through
time) suffers less from the exploding-gradient problem. The idea is that the explod-
ing gradients are tamed by truncating the memory. This is achieved by limiting the
error propagation backwards in time, errors are computed back to T − τ and not
further, where τ is the truncation time [2]. Naturally this implies that long-time
correlations cannot be learnt.
BACKPROPAGATION THROUGH TIME 155

Finally, the update formulae for the weights w (v x ) are obtained in a similar fashion.
Equation (8.23a) yields the recursion

∂ Vt €
(v v ) ∂ Vt −1
Š
= g 0
(b t ) x t + w . (8.29)
∂ w (v x ) ∂ w (v x )
This looks just like Equation (8.26), except that Vt −1 is replaced by x t . As a conse-
quence we have
T
X
(v x )
δw =η δt x t . (8.30)
t =1
(o v )
The update formula for w is simpler to derive. From Equation (8.23b) we find by
differentiation w.r.t. w (o v ) :
T
X
δw (o v ) = η E t g 0 (Bt )Vt . (8.31)
t =1

How are the thresholds updated? Going through the above derivation we see that
we must replace Vt −1 and x t in Equations (8.28) and (8.30) by −1. It works in the
same way for the output threshold.
In order to keep the formulae simple, I only described the algorithm for a single
hidden and a single output neuron, so that I could leave out the indices referring to
different hidden neurons and/or different output components. You can add those
indices yourself, the structure of the Equations remains exactly the same, save for a
number of extra sums over those indices:
T
X
(v v ) (t ) (t −1)
δwmn =η δm Vn (8.32)
t =1
(t ) (o v ) (t )
¨P
(t ) ∆i wi j g 0 (b j ) for t = T ,
δj = P i
(t ) (o v ) 0 (t ) (t +1) (v v ) (t )
i ∆i w i j g (b j ) + δi wi j g 0 (b j ) for 0 < t < T .
P
i

(t )
The second term in the recursion for δ j is analogous to the recursion in step 9 of
Algorithm 2. The time index t here plays the role of the layer index ` in Algorithm 2.
A difference is that the weights in Equation (8.32) are the same for all time steps.
In summary you see that backpropagation through time for recurrent networks is
similar to backpropagation for multilayer perceptrons. After the recurrent network is
unfolded to get rid of the feedback connections it can be trained by backpropagation.
The time index t takes the role of the layer index `. Backpropagation through time is
the standard approach for training recurrent nets, despite the fact that it suffers from
the vanishing-gradient problem. The next Section describes how improvements to
the layout make it possible to efficiently train recurrent networks.
156 RECURRENT NETWORKS

Figure 8.3: Schematic illustration of unfolded recurrent network for machine transla-
tion, after Refs. [86, 87]. The green rectangular boxes represent the hidden states in the
form of long short term memory units (LSTM). Otherwise the network layout is like the
one shown in Figure 8.2. Sutskever et al. [86] found that the network translates much
better if the sentence is read in reverse order, from the end. The tag <EOS> denotes the
end-of-sentence tag. Here it denotes the beginning of the sentence.

8.3 Recurrent networks for machine translation


Recurrent networks are used for machine translation [87]. How does this work?

Basic network layout

The networks are trained using backpropagation through time. The vanishing-
gradient problem is dealt with by improved network layouts. Hochreiter and Schmid-
huber [89] suggested to replace the hidden neurons of the recurrent network with
computation units that are specially designed to eliminate the vanishing-gradient
problem. The method is referred to as long short-term memory (LSTM). The basic
ingredient is the same as in residual networks (Section 7.6): short cuts reduce the
vanishing-gradient problem. For our purposes we can think of LSTMs as units that
replace the hidden neurons.

Representation of inputs and outputs

How are the network inputs and outputs represented? For machine translation
one must represent words in the dictionary in terms of a code. The simplest code
is a binary code where 100. . . represents the first word in the dictionary, 010. . . the
second word, and so forth. Each input is a vector with as many components as there
are words in the dictionary. A sentence corresponds to a sequence x 1 , x 2 , . . . , x T .
Each sentence ends with an end-of-sentence tag, <EOS>. Softmax outputs give
the probability p (O 1 , . . . ,O T 0 |x 1 , . . . , x T ) of an output sequence conditional on the
input sequence. The translated sentence is the one with the highest probability
(it also contains the end-of-sentence tag <EOS>). So both inputs and outputs are
RECURRENT NETWORKS FOR MACHINE TRANSLATION 157

Figure 8.4: Schematic illustration of a bidirectional recurrent network . The net con-
sists of two hidden states that are unfolded in different ways. The hidden states are
represented by LSTMs.

represented by high-dimensional vectors x t and O t . Other encoding schemes are


described in Ref. [87].
What is the role of the hidden states, represented in terms of an LSTM? The net-
work encodes the input sequence x 1 , x 2 , . . . , x T in these states. Upon encountering
the <EOS> tag in the input sequence, the network outputs the first word of the
translated sentence using the information about the input sequence stored in V T
as shown in Figure 8.3. The first output is fed into the next input, and the network
continues to translate until it produces an <EOS> tag for the output sequence. In
short, the network calculates the probabilities

T0
Y
p (O 1 , . . . ,O T 0 |x 1 , . . . , x T ) = p (O t |O 1 , . . . ,O t −1 ; x 1 , . . . , x T ) , (8.33)
t =1

where p (Ot |O1 , . . . , Ot −1 ; x 1 , . . . , x T ) is the probability of the next word in the output
sequence give the inputs and the output sequence up to Ot −1 [7].

Advanced layouts

There is a large number of recent papers on machine translation with recurrent neu-
ral nets. Most studies are based on the training algorithm described in Section 8.2,
backpropagation through time. Different algorithms mainly differ in their network
layouts. Google’s machine translation system uses a deep network with layers of
LSTMs [7]. Different hidden states are unfolded forward as well as backwards in
time, as illustrated in Figure 8.4. In this Figure the hidden states are represented by
LSTMs. In the simplest case the hidden states are just encoded in hidden neurons,
as in Figure 8.2 and Equation (8.23). If we represent the hidden states by neurons,
158 RECURRENT NETWORKS

as in Section 8.2, then the corresponding bidirectional network has the dynamics

(v v ) (v x ) (v )
€X X Š
Vi (t ) = g wi j V j (t − 1) + wi k xk (t ) − θi ,
j k
(u u ) (u x ) (u)
€X X Š
Ui (t ) = g wi j U j (t + 1) + wi k xk (t ) − θi , (8.34)
j k
(o v ) (o u) (o )
€X X Š
Oi (t ) = g wi j V j (t ) + wi j U j (t ) − θi .
j j

It is natural to use bidirectional nets for machine translation because correlations


go either way in a sentence, forward and backwards. In German, for example, the
finite verb form is usually at the end of the sentence.

Scores

Different schemes for scoring the accuracy of a translation are described by Lipton
et al. [87]. One difficulty is that there are often several different valid translations of
a given sentence, and the score must compare the machine translation with all of
them. Recent papers on machine translation usually use the so-called BLEU score
to evaluate the translation accuracy. The acronym stands for bilingual evaluation
understudy. The scheme was proposed by Papieni et al. [90], and it is commonly
judged to score not too differently from how Humans would score.

8.4 Summary
It is sometimes said that recurrent networks learn dynamical systems while multi-
layer perceptrons learn input-output maps. This notion refers to backpropagation
in time. I would emphasise, by contrast, that both networks are trained in similar
ways, by backpropagation. Neither is it given that the tasks must differ: recurrent
networks are also used to learn time-independent data. It is true though that tools
from dynamical-systems theory have been used with success to analyse the dynamics
of recurrent networks [88, 91].
Recurrent neural networks are trained by stochastic gradient descent after unfold-
ing the network in time to get rid of feedback connections. This algorithm suffers
from the vanishing-gradient problem. To overcome this difficulty, the hidden states
in the recurrent network are usually represented by LSTMs. Recent layouts for
machine translation use deep bidirectional networks with layers of LSTMs.
FURTHER READING 159

O (t )

w (o v )

V (t ) w (v v )

w (v x )

x (t )

Figure 8.5: Recurrent network with one input unit x (t ) (red), one hidden neuron V (t )
(green) and one output neuron O (t ) (blue).

8.5 Further reading


The training of recurrent networks is discussed in Chapter 15 of Ref. [2]. Recurrent
backpropagation is described by Hertz, Krogh and Palmer [1], for a slightly differ-
ent network layout. For a recent review see Ref. [87]. This page [92] gives a very
enthusiastic overview about what recurrent nets can do. A more pessimistic view is
expressed in this blog.

8.6 Exercises
Recurrent backpropagation. Show that recurrent backpropagation is a special case
of the backpropagation algorithm for layered feed-forward networks.

8.7 Exam questions

8.7.1 Recurrent network


Figure 8.5 shows a simple recurrent network with one hidden neuron V (t ), one
input x (t ) and one output O (t ). The network learns a time series of input-output
pairs [x (t ), y (t )] for t = 1, 2, 3, . . . , T . Here t is a discrete time index and y (t ) is the
target value at time t (the targets are denoted by y to avoid confusion with the time
index t ). The hidden unit is initialised to a value V (0) at t = 0. This network can be
trained by backpropgation by unfolding it in time.
(a) Draw the unfolded network, label the connections using the labels shown in
Figure 8.5, and discuss the layout (max half an A4 page). (0.5p).
(b) Write down the dynamical rules for this network, the rules that determine V (t )
in terms of V (t − 1) and x (t ), and O (t ) | in terms of V (t ). Assume that both V (t )
160 RECURRENT NETWORKS

and O (t ) have the same activation function g (b ). (0.5p).


(c) Derive the update rule for w (o v ) for gradient descent on the energy function
T
1X
H= E (t )2 where E (t ) = y (t ) − O (t ) . (8.35)
2 t =1

Denote the learning rate by η. Hint: the update rule for w (o v ) is much simpler to
derive than those for w (v x ) and w (v v ) . (1p).
(d) Explain how recurrent networks are used for machine translation. Draw the
layout, describe how the inputs are encoded. How is the unstable-gradient problem
overcome? (Max one A4 page). (1p).
161

PART III
UNSUPERVISED LEARNING
162 UNSUPERVISED HEBBIAN LEARNING

Figure 9.1: Supervised learning finds decision boundaries (left). Unsupervised learning
can find clusters in the input data (right).

Chapters 5, 6, 7, and 8 described supervised learning where the networks are


trained to produce the correct outputs. The remaining Chapters discuss unsuper-
vised learning. In this case there is no feedback telling the network whether it has
learnt correctly or not. The learning goal is not defined, so that the network must
discover relevant ways of organising the input data. This requires redundancy in
the input data. Possible tasks are to determine the familiarity of input patterns, or
to find clusters (Figure 9.1) in high-dimensional input data. A further application
is to determine spatial maps of spatially distributed inputs, so that nearby inputs
activate nearby output units. Such unsupervised-learning algorithms are explained
in Chapter 9.
Chapter 10 introduces radial-basis function networks, they learn using a hybrid
algorithm with supervised and unsupervised learning. A different hybrid algorithm
is discussed in Chapter 11, reinforcement learning. Here the idea is that the network
receives only partial feedback on its performance, it cannot access the full set of
target values. For instance, the feedback may just be +1 (good solution) or −1 (not
so good). In this case the network can learn by building up its own training set from
its outputs with +1 feedback.

9 Unsupervised Hebbian learning


The material in this Chapter comes from the book by Hertz, Krogh, and Palmer [1].
The simplest example for unsupervised learning is given by a distribution P (x ) of
input patterns
 
x1
x =  ...  (9.1)
xN

with continuous-valued components xi . Patterns are drawn from this distribution


and fed one after another to the network shown in Figure 9.2. It has one linear
output unit y = w · x with weight vector
OJA’S RULE 163

x1
w1
x2
y =w ·x

wN
xN

Figure 9.2: Network for unsupervised Hebbian learning, with a single linear output
unit that has weight vector w . The network output is denoted by y in this Chapter.

 
w1
w =  ...  . (9.2)
wN
The network can detect how familiar certain input patterns are. The idea is that
the output is the larger the more frequently the input pattern occurs in P (x ). This
learning goal is achieved by Hebb’s rule:

w 0 = w + δw with δw = ηy x , (9.3)

where y = w · x is the output. The rule (9.3) is also called Hebbian unsupervised
learning. As usual, η > 0 is small learning rate. How does this learning rule work?
Since we keep adding multiples of the pattern vectors x to the weights, the magni-
tude of the output |y | becomes the larger the more often the input pattern occurs
in the distribution P (x ). So the most familiar pattern produces the largest output. A
problem is potentially that the weight vector may continue to grow as we keep on
adding increments. This usually happens, and this means that the simple Hebbian
learning rule (9.3) does not converge to a steady state. To achieve definite learn-
ing outcomes we require the network to approach a steady state. Therefore the
learning rule (9.3) must be modified. One possibility is to introduce weight decay as
described in Section 6.3.3. This is discussed in the next Section.

9.1 Oja’s rule


Adding a weight-decay term with coefficient proportional to y 2 to Equation (9.3)

δw = ηy (x − y w ) = η x x T w − [w · (x x T )w ]w

(9.4)

ensures that the weights remain normalised. For the second equality I used that the
output is given by y = w · x = w T x = x T w . To see why Equation (9.4) does the trick,
164 UNSUPERVISED HEBBIAN LEARNING

Algorithm 10 Oja’s rule


1: initialise weights randomly;
2: for t = 1, . . . , T do
3: draw an input pattern x from P (x ) and apply it to the network;
4: update all weights using δw = ηy (x − y w );
5: end for
6: end;

consider an analogy: a vector q that obeys the differential equation


d
dt q = A(t )q . (9.5)

For a general matrix A(t ), the norm |q | may grow or shrink. We can ensure that q
remains normalised by adding a term to Equation (9.5):
d
dt w = A(t )w − [w · A(t )w ]w . (9.6)

The vector w turns in the same way as q , and if we set |w | = 1 initially, then w
remains normalised (w = q /|q |). You can see this by noting that dtd |w 2 | = 2w · dtd w =
0. Equation (9.6) describes the dynamics of the normalised orientation vector of a
small rod in turbulence [93], where A(t ) is the matrix of fluid-velocity gradients.
But let us return to Equation (9.4). It is called Oja’s rule [94]. Oja’s learning algo-
rithm is summarised in Algorithm 10. One draws a pattern x from the distribution
P (x ) of input patterns, applies it to the network, and updates the weights as pre-
scribed in Equation (9.4). This is repeated many PT times. In the following we denote
the average over T input patterns as 〈· · · 〉 = T1 t =1 · · · .
Now we show that a steady state w ∗ of Algorithm 10 has the following properties:
1. |w ∗ | = 1
2. w ∗ is the eigenvector of C0 = 〈x x T 〉 with maximal eigenvalue
3. w ∗ maximises 〈y 2 〉 over all w with |w | = 1.
In particular, the weight vector remains normalised. We first prove statement 1, as-
suming that a steady state w ∗ has been reached. In a steady state the increments δw
must average to zero because the weights would either grow or decrease otherwise:

0 = 〈δw 〉w ∗ . (9.7)

Here 〈· · · 〉w ∗ is an average at fixed w ∗ (the presumed steady state). So w ∗ is not


averaged over, just x . Equation (9.7) is a condition upon w ∗ . Using the learning rule
(9.4), Equation (9.7) implies

0 = C0 w ∗ − (w ∗ · C0 w ∗ )w ∗ . (9.8)
OJA’S RULE 165

It follows that w ∗ must obey C0 w ∗ = λw ∗ . In other words, w ∗ must be an eigenvector


of C0 . We denote the eigenvalues and eigenvectors of C0 by λα and u α . Since C0 is
symmetric, its eigenvalues are real, the eigenvectors can be chosen orthonormal,
u α · u β = δαβ , and they form a basis. Moreover, C0 is positive semidefinite. This
means that the eigenvalues cannot be negative. It also follows from Equation (9.8)
that |w ∗ | = 1.
Second, we prove statement 2: only the eigenvector corresponding to λmax repre-
sents a stable steady state. To demonstrate this, we investigate the linear stability of
w ∗ (in the same way as in Section 8.1). We use the ansatz:

w =w∗+ε (9.9)

where ε is a small initial displacement from w ∗ . Now we determine the average


change of ε after one iteration step:

〈δε〉 = 〈δw 〉w ∗ +ε . (9.10)

To this end we expand Equation (9.10) in ε to leading order:


” —
〈δε〉 ≈ η C0 ε − 2(ε · C0 w ∗ )w ∗ − (w · Cw ∗ )ε . (9.11)

We know that w ∗ must be an eigenvector of C0 . So we choose a value of α, put


w ∗ = u α , and multiply with u β from the left, to determine the β -th component of
〈δε〉:
u β · 〈δε〉 ≈ η[(λβ − λα ) − 2λα δαβ ] (u β · ε) . (9.12)
Here we used that δαβ (u α · ε) = δαβ (u β · ε). We conclude: if λβ > λα the component
of the displacement along u β must grow, on average. So if λα is not the largest
eigenvalue, the corresponding eigenvector u α cannot be a steady-state direction.
The eigenvector corresponding to λmax , on the other hand, represents a steady state.
So only the choice λα = λmax leads to a steady state.
Third, we prove that these statements imply property 3. We need to demonstrate
that
〈y 2 〉 = w ∗ · C0 w ∗ (9.13)
is maximised when w ∗ is along maximal eigenvalue direction. To this end we com-
pute X
w ∗ · C0 w ∗ = λα (w ∗ · u α )2 . (9.14)
α

We also need that α (w ∗ ·u α )2 = 1. This follows from |w ∗ | = 1 (statement 1). Together,


P

the two expressions show that no other direction w gives a larger value of 〈y 2 〉. In
other words: w ∗ maximises 〈y 2 〉. This completes the proof.
166 UNSUPERVISED HEBBIAN LEARNING

x2
u2 u1

x1
Figure 9.3: Maximal eigenvalue direction u 1 of the matrix C0 for input data with non-
zero mean.

For zero-mean inputs, Oja’s rule finds the maximal principal direction of the
input data by maximising 〈y 2 〉 (note that 〈y 〉 = 0 for zero-input data). For inputs
with non-zero means, maximising 〈y 2 〉 still finds the maximal eigenvalue direction
of C0 .
But for inputs with non-zero means, this direction is different from the maximal
principal direction (Section 6.3.1). Figure 9.3 illustrates this difference. The Figure
shows three data points in a two-dimensional input plane. The elements of C0 =
〈x x T 〉 are  
1 2 1
C =
0
, (9.15)
3 1 2
with eigenvalues and eigenvectors
   
1 1 1 −1
λ1 = 1 , u1 = p and λ2 = ,
1
3 u2 = p . (9.16)
2 1 2 1

Thus the maximal eigenvalue direction is u 1 . To compute the principal direction of


the data we find the data-covariance matrix C with elements (6.19). The maximal-
eigenvalue direction of C is u 2 . This is the maximal principal component of the data
shown in Figure 9.3.
Oja’s rule can be generalised in different ways to compute M principal compo-
nents of zero-mean input data using M output neurons that compute yi = w i · x for
i = 1, . . . , M :
€ Xi Š
δwi j = ηyi x j − yk wk j Sanger’s rule , (9.17)
k =1
or
€ M
X Š
δwi j = ηyi x j − yk wk j Oja’s M -rule . (9.18)
k =1

For M = 1 both rules reduce to Oja’s rule.


COMPETITIVE LEARNING 167

x x
w2 w 02

w1 w1

Figure 9.4: Detection of clusters by unsupervised learning.

9.2 Competitive learning


In Equations (9.17) and (9.18) several outputs can be active (non-zero) at the same
time. An alternative approach is competitive learning where only one output is
active at a time, as in Section 7.1.
Such algorithms can categorise or cluster input data: similar inputs are classified
to belong to the same category, and activate the same output unit. Figure 9.4 shows
input patterns on the unit circle that cluster into two distinct clusters. The idea is to
find weight vectors w i that point into the direction of the clusters. To this end we
take M linear output units i with weight vectors w i , i = 1, . . . , M . We feed a pattern
x from the distribution P (x ) into the units and define the winning unit i 0 as the
one that has minimal angle between its weight and the pattern vector x . This is
illustrated in Figure 9.4, where i 0 = 2. Then only this weight vector is updated by
adding a little bit of the difference x −w i 0 between the pattern vector and the weight
of the winning unit. The other weights remain unchanged:
¨
η(x − w i ) for i = i 0 (x , w 1 . . . w M ) ,
δw i = (9.19)
0 otherwise .

In other words, only the winning unit is updated, w 0i 0 = w i 0 + δw i 0 . Equation (9.19)


is called competitive-learning rule.
The learning rule (9.19) has the following geometrical interpretation: the weight
of the winning unit is drawn towards the pattern x . Upon iterating (9.19), the weight
vectors are drawn to clusters of inputs. So if the patterns are normalised as in Figure
9.4, the weights end up normalised on average, even though |w i 0 | = 1 does not
imply that |w i 0 + δw i 0 | = 1, in general. The algorithm for competitive learning is
summarised in Algorithm 11.
When weight and input vectors are normalised, then the winning unit i 0 is the
one with the largest scalar product w i · x . For linear output units yi = w i · x (Figure
9.2) this is simply the unit with the largest output. Equivalently, the winning unit
is the one with the smallest distance |w i − x |. Output units with w i that are very
far away from any pattern may never be updated (dead units). There are several
168 UNSUPERVISED HEBBIAN LEARNING

strategies to avoid this problem [1]. One possibility is to initialise the weights to
directions found in the inputs.
The solution of the clustering problem is not uniquely defined. One possibility is
to monitor progress by an energy function
1 X (t )
2
H= M i t x j − wi j . (9.20)
2T i j t

Here x (t ) is the pattern fed in iteration number t , T is the total number of iterations,
and ¨
1 for i = i 0 (x (t ) , w 1 , . . . , w M ) ,
Mi t = (9.21)
0 otherwise .
Note that i 0 is a function of the patterns x (µ) and of the weights w 1 , . . . , w M . For
given patterns, the indicator M i µ is a piecewise constant function of the weights.
Gradient descent on the energy function (9.20) gives
∂H ηX (t )
〈δwi j 〉 = −η = M i t (x j − w j ) . (9.22)
∂ wi j T t

Apart from the sum over patterns this is the same as the competitive learning rule
(9.19). The angular brackets on the l.h.s. of this Equation indicate that the weight
increments are summed over patterns.
If we define ¨
1 for i = i 0
yi = δi i 0 = (9.23)
0 otherwise
then the rule (9.19) can be written in the form of Oja’s M -rule:
 M
X ‹
δwi j = ηyi x j − yk wk j . (9.24)
k =1

This is again Hebb’s rule with weight decay.

Algorithm 11 competitive learning


1: initialise weights to vectors with random angles and norm |w i | = 1;
2: for t = 1, . . . , T do
3: draw a pattern x from P (x ) and feed it to the network:
4: find the winning unit i 0 (smallest angle between w i 0 and x );
5: update only the winning unit δw i 0 = η(x − w i 0 );
6: end for
7: end;
KOHONEN’S ALGORITHM 169

outputs
coordinates r
r1 r2

x (2)
x (1) inputs
coordinates x
Figure 9.5: Spatial map. If patterns x (1) and x (2) are close in input space, then the two
patterns activate neighbouring outputs, r 1 ≈ r 2 .

9.3 Kohonen’s algorithm


Kohonen’s algorithm can learn spatial maps. To this end one arranges the output
units geometrically. The idea is that close inputs activate nearby outputs (Figure 9.5).
Not that input and output space need not have the same dimension. The spatial map
is learned with a competitive learning rule (9.24) , similar to the previous Section.
But an important difference is that the rule must be modified to incorporate spatial
information. In Kohonen’s rule this is done by updating not only the winning unit,
but also its neighbours in the output array.

δwi j = ηΛ(i , i 0 )(x j − wi j ) . (9.25)

Here η > 0 is the learning rate, as before. The function Λ(i , i 0 ) is called the neigh-
bourhood function. A common choice is

€ r i − r i 2 Š

Λ(i , i 0 ) = exp −
0
. (9.26)
2σ2
As a result, nearby output units respond to inputs that are close in input space. Ko-
honen’s rule drags the winning weight vector w i 0 towards x , just as the competitive
learning rule (9.19), but it also drags the neighbouring weight vectors along.
Figure 9.6 illustrates a geometrical interpretation of Kohonen’s rule. We can think
of the weight vectors as pointing to the nodes of an elastic net that has the same
layout as the output array. As one feeds patterns from the input distribution, the
weights are updated, causing the nodes of the network to move. This changes the
shape of the elastic net. In the steady state, this shape resembles the shape defined
by the distribution of input patterns.
Kohonen’s rule has two parameters: the learning rate η, and the width σ of the
neighbourhood function. Usually one adjusts these parameters as the learning
170 UNSUPERVISED HEBBIAN LEARNING

x2 a w2 b w2 c w2 d

x1 w1 w1 w1
Figure 9.6: Learning a shape with Kohonen’s algorithm. (a) Input-pattern distribution.
P (x ) is unity within a parallelogram with unit area, and zero outside. (b) to (d) Illustra-
tion of the dynamics in terms of an elastic net. (b) Initial condition. (c) Intermediate
stage (note the kink). (d) In the steady-state the elastic net resembles the shape defined
by the input-pattern distribution.

proceeds. Typically one begins with large values for η and σ (ordering phase), and
then reduces these parameters as the elastic net evolves (convergence phase): quickly
at first and then in smaller steps, until the algorithm converges. Details are given by
Hertz, Krogh and Palmer [1]. As for competitive learning one can monitor progress
of the learning with an energy function
1 X 2
H= Λ(i , i 0 ) x (t ) − w i . (9.27)
2T i t

Gradient descent yields

∂H ηX (t )
〈δwi j 〉 = −η = Λ(i , i 0 )(x j − wi j ) . (9.28)
∂ wi j T t

Figure 9.6 shows how Kohonen’s network learns by unfolding the elastic net of weight
vectors until the shape of the net resembles the form of the input distribution
¨
1 for x in the parallelogram in Figure 9.6(a) ,
P (x ) = (9.29)
0 otherwise .

In other words, Kohonen’s algorithm learns by distributing the weight vectors to


reflect the distribution of input patterns. For the distribution (9.29) of inputs one
may hope that the weights end up distributed uniformly in the parallelogram. This
is roughly how it works (Figure 9.6), but there are problems at the boundaries. Why
this happens is quite clear: for the distribution (9.29) there are no patterns outside
the parallelogram that can draw the elastic net very close to the boundary.
To analyse how the boundaries affect learning for Kohonen’s rule, we consider
the steady-state condition
ηX
Λ(i , i 0 ) x (t ) − w ∗i = 0

〈δw i 〉 = (9.30)
T t
KOHONEN’S ALGORITHM 171

weights and patterns


w∗


w ∗ (r ) + dw
dr δr
x

δr

r r0 (x ) output position
Figure 9.7: To find out how w ∗ varies near r , we expand w ∗ in δr around r . This gives
∗ 1 d2 w ∗
w ∗ (r ) + dw
dr δr + 2 dr 2 δr + . . ..
2

This is a condition for the steady state w ∗i . The condition is more complicated
than it looks at first sight, because i 0 depends on the weights and on the patterns,
as mentioned above. The steady-state condition (9.30) is very difficult to analyse
in general. One of the reasons is that global geometric information is difficult
to learn. It is usually much easier to learn local structures. This is particularly
true in the continuum limit where we can analyse local learning progress using
Taylor expansions. For this reason we assume now that we have a very dense  net
of weights, so that we can i → r , i 0 → r 0 , w i → w (r ), Λ(i , i 0 ) → Λ r − r 0 (x ) , and
R
1
d x P (x ). In this continuum approximation, Equation (9.30) reads
P
T t →

Z
 
dx P (x ) Λ r − r 0 (x ) x − w ∗ (r ) = 0 . (9.31)

This is an Equation for the spatial map w ∗ (r ). Equation (9.31) is still quite difficult
to analyse. So we specialise to one input and one output dimension, with spatial
output coordinate r . This has the added advantage that we can easily draw the
spatial map w ∗ (r ). It is the solution of
Z
 
dx P (x ) Λ r − r0 (x ) x − w ∗ (r ) = 0 (9.32)

The neighbourhood function is sharply peaked at r = r0 (x ). This means that the con-
dition (9.32) yields the local properties of w ∗ (r ) around r0 , where r0 is the coordinate
of the winning unit, x = w ∗ (r0 ). Equation (9.32) involves an integral over patterns
x . Using x = w ∗ (r0 ), this integral is expressed as an integral over r0 . Specifically we
consider how w (r0 ) changes in the vicinity of a given point r , as r0 (x ) changes. To
this end we expand w ∗ around r :

w (r0 ) = w (r ) + w 0 (r )δr + 12 w 0 (r )0 δr 2 + . . . with δr = r0 (x ) − r . (9.33)


172 UNSUPERVISED HEBBIAN LEARNING

Here w 0 denotes the derivative dw /dr evaluated at r , and I have dropped the
asterisk. Using this expansion and x = w (r0 ) we express dx in Equation (9.32) in
terms of dδr :

dx = dw (r0 ) = dw (r + δr ) ≈ (w 0 + δr w 00 ) dδr . (9.34a)

To perform the integration we express the integrand in (9.32) in terms of δr :



P (x ) = P w (r0 ) ≈ P (w ) + δr w 0 dw
d
P (w ) , (9.34b)

and
1
x − w (r ) = w 0 δr + w 00 (r )δr 2 + . . . . (9.34c)
2
Inserting these expressions into Equation (9.31) we find
Z
0= dδr (w 0 + δr w 00 ) [P + δr w 0 dw
d
P ]Λ(δr )(δr w 0 + 21 δr 2 w 00 )
Z ∞
3
= w [ w 00 P (w ) + w (w 0 )2 dw
0 d
P (w )] dδr δr 2 Λ(δr ) (9.35)
2 −∞

Since the last integral in Equation (9.35) is non-zero, we must either have

3 00
w0 =0 or w P (w ) + (w 0 )2 dw
d
P (w ) = 0 . (9.36)
2
The first solution can be excluded because it corresponds to a singular weight
distribution [see Equation (9.38)] that does not contain any geometrical information
about the input distribution P (x ). The second solution gives

w 00 d
2 w 0 dw P (w )
=− (9.37)
w0 3 P (w )
2
d
In other words, dx log |w 0 | = − 23 dx
d
log P (w ). This means that |w 0 | ∝ P (w )− 3 . So the
distribution % of output weights is

dr 1

2
%(w ) ≡ = = P (w ) 3 . (9.38)
dw |w 0 |

This tells us that the Kohonen net learns the input distribution in the following
way: the distribution of output weights in the steady state reflects the distribution
of input patterns. Equation (9.38) tells us that the two distributions are not equal
(equality would have been a perfect outcome). The distribution of weights is instead
SUMMARY 173

2
proportional to P (w ) 3 . This is a consequence of the fact that the elastic net has dif-
ficulties reaching the corners and edges of the domain where the input distribution
is non-zero.
Let us finally discuss the convergence of Kohonen’s algorithm. The update rule
(9.25) can be rewritten as

(wi − x ) ← (1 − ηΛ)(wi − x ) . (9.39)

For small enough η the factor [1 − ηΛ] is positive. Then it follows from Equation
(9.39) that the order of weights in a monotonically increasing (decreasing) sequence
does not change under an update. What happens at the boundary between two
such regions, where a kink is formed? It turns out that kinks can only disappear
in two ways. Either they move to one of the boundaries, or two kinks (a minimum
and a maximum) annihilate each other if they collide. Both processes are slow.
This means that convergence to the steady state can be very slow. Therefore one
usually starts with a larger learning rate, to get rid of kinks. After this ordering phase,
one continues with a smaller step size to get the details of the distribution right
(convergence phase).

9.4 Summary
The unsupervised learning algorithms described above are based on Hebb’s rule:
certainly the Hebbian unsupervised learning rule (9.3) and Oja’s rule (9.4). Also the
competitive learning rule can be written in this form [Equation (9.24)]. Kohonen’s
algorithm is closely related to competitive learning, although the way in which
Kohonen’s rule learns spatial maps is better described by the notion of an elastic
net that represents the values of the output weights, as well as their spatial location
in the output array. Unsupervised learning rules can learn different features of
the distribution of input patterns. They can discover which patterns occur most
frequently, they can help to reduce the dimensionality of input space by finding
the principal directions of the input distribution, detect clusters in the input data,
compress data, and learn spatial input-output maps. The important point is that
the algorithms learn without training, unlike the algorithms in Chapters 5 to 8.
Supervised-learning algorithms are now widely used for different applications.
This is not really the case yet for unsupervised-learning algorithms, except that
similar (sometimes equivalent) algorithms are used in Mathematical Statistics (k -
means clustering) and Bioinformatics (structure [95]) where large data sets must
be analysed, such as Human sequence data (HGDP) [96]. But the simple algorithms
described in this Chapter provide a proof of concept: how machines can learn
without feedback. In addition there is one significant application of unsupervised
174 UNSUPERVISED HEBBIAN LEARNING

learning: where the network learns from incomplete feedback. This reinforcement
learning is introduced in the next Chapter.

9.5 Exercises
Kohonen net. Write a computer program that implements Kohonen’s algorithm
with a two-dimensional output array, to learn the properties of a two-dimensional
input distribution that is uniform inside an equilateral triangle with sides of unit
length, and zero outside. Hint: to generate this distribution, sample at least 1000
points uniformly distributed over the smallest square that contains the triangle, and
then accept only points that fall inside the triangle. Increase the number of weights
and study how the two-dimensional density of weights near the boundary depends
on the distance from the boundary.

9.6 Exam questions


9.6.1 Oja’s rule
The aim of unsupervised learning is to construct a network that learns the properties
of a distribution P (x ) of input patterns
PxN = (x1 , . . . , xN ) . Consider a network with
T

one linear output that computes y = j =1 w j x j . Under Oja’s learning rule δw j =


ηy (x j − y w j ) the weight vector w converges to a steady state w ∗ with components
w j∗ . Show that the steady state has the following properties:
PN
1. |w ∗ |2 ≡ j =1
(w j∗ )2 = 1.

2. w ∗ is the leading eigenvector of the matrix C0 with elements Ci0j = 〈xi x j 〉. Here
〈· · · 〉 denotes the average over P (x ).

3. w ∗ maximises 〈y 2 〉.

All correct gives in total 3p.

9.6.2 Oja’s rule


The output of Oja’s rule for the input pattern x (µ) is
(µ)
X
y (µ) = wi xi , (9.40)
i
EXAM QUESTIONS 175

(µ)
and the update rule based on this pattern is wi → wi + δwi with
(µ) (µ)
δwi = ηy (µ) (xi − y (µ) wi ) (9.41)

with targets y (µ) . Let 〈δwi 〉 denote the update of wi averaged over the input patterns.
(a) Show that 〈δwi 〉 = 0 implies that the weight vector in the steady state is nor-
malised to unity. (1 p).
(b) Calculate the principal component of the patterns in Figure 9.8. (1 p).

x (4) x (5)
x (3)
x2

x (2)
x (1)

x1
Figure 9.8: Calculate the principal component of this data set. Question 9.6.2.

9.6.3 Covariance matrices


A covariance matrix C has eigenvectors
   
4 1
u= and v= (9.42)
−1 4

and eigenvalues λu = 4 and λv = 1.


(a) Write down the matrix C. (0.5 p).
(b) Illustrate a distribution with this covariance matrix, and indicate the principal
component in your illustration. (0.5 p).
(c) P (x ) is a Gaussian distribution of two-dimensional patterns x = [x1 , x2 ]T . The
distribution is determined by its mean 〈x 〉 and its covariance matrix
 
C11 C21
C= . (9.43)
C21 C22

Show how to draw an input pattern from this distribution using the following steps.
176 UNSUPERVISED HEBBIAN LEARNING

1. Draw a random pattern z = (z 1 , z 2 )T , where z 1 and z 2 are two independent


random numbers drawn from a Gaussian distribution with mean zero unit
variance.

2. Compute x = 〈x 〉 + Lz , where
 
L 11 0
L= . (9.44)
L 21 L 22

Express L 11 , L 21 and L 22 in terms of C11 , C21 and C22 . (1p).

9.6.4 Kohonen net


The update rule for a Kohonen network reads: δwi j = ηΛ(i , i 0 )(x j − wi j ). Here
 x = (x1 , . . . , xN ) . The neighbourhood func-
T
i 0 labels the winning unit for pattern
tion Λ(i , i 0 ) = exp − |r i − r i 0 |2 /(2σ2 ) is Gaussian with width σ, and r i denotes the
position of the i -th output neuron in the output array.
(a) Explain the meaning of the parameter σ in Kohonen’s algorithm. Discuss the
nature of the update rule in the limit of σ → 0. (0.5p).
(b) Discuss and explain the implementation of Kohonen’s algorithm in a computer
program. In the discussion, refer to and explain the following terms: output array,
neighbourhood function, ordering phase, convergence phase, kinks. Your answer
must not be longer than one A4 page.
177

10 Radial basis-function networks


Problems that are not linearly separable can be solved by perceptrons with hidden
layers, as we saw in Chapter 5. Figure 5.10, for example, shows a piecewise linear
decision boundary that can be parameterised by hidden neurons.
Another approach is to map the coordinates of input space non-linearly so that
the problem becomes linearly separable. It is usually easier to separate patterns in
higher dimensions. To see this consider the XOR problem (Figure 10.1). It is not
linearly separable in two-dimensional input space. The problem becomes separable
when we embed the points in a three-dimensional space, for instance by assigning
x3 = 0 to the t = +1 patterns and x3 = 1 to the t = −1 patterns. This example
illustrates why it is often helpful to map input space to a higher-dimensional space
– because it is more likely that the resulting problem is linearly separable. But it
may also be possible to achieve separability by a non-linear transformation of input
space to a space of the same dimension. The example below shows how this works
for the XOR problem. Either way we can apply a single perceptron to classify the
data if they are linearly separable in the new coordinates.
Figure 10.1 shows how the XOR problem can be transformed into a linearly
separable problem by the transformation

u 1 (x ) = (x2 − x1 )2 and u 2 (x ) = x2 . (10.1)

The Figure shows the non-separable problem in input space (in the x1 -x2 plane),
and in the new coordinates u 1 and u 2 . Since the problem is linearly separable in
the u 1 -u 2 plane we can solve it by a single McCulloch-Pitts neuron with weights W
and threshold Θ, parameterising the decision boundary as W · u (x ) = Θ. In fact, one
does not need the threshold Θ because the function u can have a constant part. For
instance, we could choose u 1 (x ) = 2(x2 − x1 )2 − 1. In the following we therefore set
Θ = 0.
We expect that it should be easier to achieve linear separability the higher the
embedding dimension is. This statement is quantified by Cover’s theorem, discussed
x2 Legend
u2
t (µ) = 1
t (µ) = −1

x1 u1
Figure 10.1: Left: input plane for the XOR function (Figure 5.8). The problem is not
linearly separable. Right: in the u 1 -u 2 plane the problem is linearly separable.
178 RADIAL BASIS-FUNCTION NETWORKS

Figure 10.2: Left: 5 points in general position in the plane. Right: these points are not
in general position because three points lie on a straight line.

in Section 10.1. The question is of course how to find the non-linear mapping u (x ).
One possibility is to use radial basis functions. This is a way of parameterising the
functions u j (x ) in terms of weight vectors w j , and to determine suitable weight
vectors iteratively. How this works is summarised in Section 10.2.

10.1 Separating capacity of a surface


In its simplest form, Cover’s theorem [97] concerns a classification problem given by
p points with coordinate vectors u (µ) in m -dimensional space. It is assumed that
the points are in general position (Figure 10.2). We assign random target values to
these points, ¨
+1 with probability 21 ,
t (µ) = (10.2)
−1 with probability 12 .
This random classification problem is homogeneously linearly separable if we can
find an m -dimensional weight vector W with components Wj ,
 
W1
W =  ...  , (10.3)
Wm

so that W · u = 0 is a valid decision boundary that goes through the origin:

W · u (µ) > 0 if t (µ) = 1 and W · u (µ) < 0 if t (µ) = −1 . (10.4)

So homogeneously linearly separable problems are classification problems that are


linearly separable by a hyperplane that goes contains the origin (zero threshold,
Chapter 5).
Cover’s theorem states the probability that the random classification problem of
p patterns in dimension m is homogeneously linearly separable:
¨ p −1 Pm −1
p −1

1
k =0
for p > m ,
P (p , m ) = 2 k
(10.5)
1 otherwise .
SEPARATING CAPACITY OF A SURFACE 179
P (λm , m )
1
m = 10
m = 50
m = 250

1 2 3 λ
Figure 10.3: Probability (10.5) of separability for p = λm as a function of λ for three
different values of the embedding dimension m. Note the pronounced threshold near
λ = 2, for large values of m .

l
Here k = (l −kl !)!k ! are the binomial coefficients. Equation (10.5) is proven by recursion,


starting from a set of p − 1 points in general position. Assume that the number
C (p −1, m ) of homogeneously linearly separable classification problems given these
points is known. After adding one more point, one can compute the C (p , m ) in
terms of C (p − 1, m ). Recursion yields Equation (10.5).
To connect the result (10.5) to the discussion at the beginning of this Chapter, we
take u (µ) = u (x (µ) ) where x (µ) are p patterns in N -dimensional input space
(µ)
 
x1
x (µ) =  ...  for µ = 1, . . . , p , (10.6)
 
(µ)
xN

and we assume that u is a set of m polynomial functions of finite order. Then the
probability that the problem of the p points x (µ) in N -dimensional input space is
separable by a polynomial decision boundary is given by Equation (10.5) [2, 97].
Note that the probability P (p , m ) is independent of the dimension N of the input
space. Figure 10.3 shows this probability for p = λm as a function of λ for different
values of m . Note that P (2m , m) = 12 . In the limit of large m , the function P (λm, m )
approaches a step function. In this limit one can separate at most 2m patterns
(separability threshold).
Now consider a random sequence of patterns x 1 , x 2 , . . . and targets t 1 , t 2 , . . . and
ask [97]: what is the distribution of the largest integer so that the problem x 1 , x 2 , . . . , x n
is separable in embedding dimension m, but x 1 , x 2 , . . . , x n , x n+1 is not? P (n , m) is
the probability that n patterns are linearly separable in embedding dimension m .
We can write P (n + 1, m ) = q (n + 1|n )P (n, m ) where q (n + 1|n ) is the conditional
probability that n + 1 patterns are linearly separable if the n patterns were. Then
the probability that n + 1 patterns are not separable (but n patterns are) reads
(1 − q )P (n , m ) = P (n , m ) − P (n + 1, m ). We can interpret the right-hand side of this
180 RADIAL BASIS-FUNCTION NETWORKS

u1

W1
u2
x1 W2
u3 linear output
x2 W3
Pm
O = j =1 Wj u j
u4
W4

Figure 10.4: Radial basis-function network for N = 2 inputs and m = 4 radial basis
functions (10.8). The output neuron has weights W and zero threshold.

Equation as a distribution pn of the random variable n , the maximal number of


separable patterns in embedding dimension m:
 ‹n  
1 n −1
pn = P (n , m ) − P (n + 1, m ) = for n = 0, 1, 2, . . . .
2 m −1

It follows that the expected maximal number of separable patterns is



X
〈n 〉 = npn = 2m . (10.7)
n=0

So the expected maximal number of separable patterns is twice the embedding


dimension. This quantifies the notion that it is easier to separate patterns in higher
embedding dimensions.
Comparing with the discussion of linear separability in Chapter 5 we see that
Cover’s theorem determines the separation capacity of a single-layer perceptron [1].

10.2 Radial basis-function networks


The idea behind radial basis-function networks is to parameterise the functions
u j (x ) in terms of weight vectors w j , and to use an unsupervised-learning algorithm
(Chapter 9) to find weights that separate the input data. A common choice [2] are
radial basis functions of the form:
€ 1 2 Š
u j (x ) = exp − 2 x − w j . (10.8)
2s j

These functions are not of the finite-order polynomial form that was assumed
in Cover’s theorem. This means that the theorem does not strictly apply. The
parameters s j parameterise the widths of the radial basis functions. In the simplest
RADIAL BASIS-FUNCTION NETWORKS 181

x2 x2

x1 x1
Figure 10.5: Comparison between radial-basis function network and perceptron. Left:
the output of a radial basis function is localised in input space. Right: to achieve a
localised output with sigmoid units one needs two hidden layers (Section 7.1). One
layer determines the lightly shaded cross, the second layer localises the output to the
darker square.

version of the algorithm they are set to unity. Other choices for radial basis functions
are given by Haykin [2].
Figure 10.2 shows a radial basis-function network for N = 2 and m = 4. The
four neurons in the hidden layer stand for the four radial basis functions (10.8) that
map the inputs to four-dimensional u -space. The network looks like a perceptron
(Chapter 5). But here the hidden layers work in a different way. Perceptrons have
hidden McCulloch-Pitts neurons that compute non-local outputs σ(w j · x − θ ).
The output of radial basis functions u j (x ), by contrast, is localised in input space
[Figure 10.5(left)]. We saw in Section 7.1 how to make localised basis functions out
of McCulloch-Pitts neurons with sigmoid activation functions σ(b ), but one needs
two hidden layers to do that [Figure 10.5(right)].
Radial basis functions produce localised outputs with a single hidden layer, and
this makes it possible to divide up input space into localised regions, each corre-
sponding to one radial basis function. Imagine for a moment that we have as many
radial basis functions as input patterns. In this case we can simply take w µ = x (µ)
for µ = 1, . . . , p . Then the classification problem can be written as

UW = t , (10.9)

where U is the symmetric p × p matrix with entries Ui j = u j (x i ). Here we used that


the output unit is linear. If all patterns are pairwise different, x (µ) 6= x (ν) for µ =
6 ν,
then the matrix U is invertible [2], and the solution of the classification problem
reads W = U−1 t .
In practice one can get away with fewer radial basis functions by choosing their
weights to point in the directions of clusters of input data. To this end one can
use unsupervised competitive learning (Algorithm 12), where the winning unit
is defined to be the one with largest u j . How are the widths s j determined? The
width s j of radial basis function u j (x ) is taken to be equal to the minimum distance
182 RADIAL BASIS-FUNCTION NETWORKS

Algorithm 12 radial basis functions


1: initialise the weights w j k independently randomly from [−1, 1];
2: set all widths to s j = 0;
3: for t = 1, . . . , T do
4: feed randomly chosen pattern x (µ) ;
5: determine winning unit j0 : u j0 ≥ u j for all values of j ;
6: update widths: s j = min j 6=k |w j − w k |;
7: update only winning unit: δw j0 = η(x (µ) − w j0 );
8: end for
9: end;

between w j and the centers of the surrounding radial basis functions. Once weights
and widths of the radial basis functions are found, the weights of the output neuron
are determined by minimising

1 X (µ) 2
H= t − O (µ) (10.10)
2 µ

with respect to W . If the problem is linearly separable in the u -plane, then the
minimum is given by Equation (5.23). An approximate solution can be obtained
by stochastic gradient descent on H keeping the parameters of the radial basis
functions fixed. Cover’s theorem indicates that the problem is more likely to be
separable if the embedding dimension m is higher. Figure 10.6 shows an example
for a non-linear decision boundary found by this algorithm.

t = +1 x2 decision
boundary
t = −1 m classification error
5 0.26
20 0.06
x1

Figure 10.6: Left: non-linear decision boundary found by a radial basis-function


network for m = 20 (schematic). Right: classification error for different embedding
dimensions. The decision boundary is not yet quite perfect for m = 20, it still cuts the
corners a little bit.
SUMMARY 183

10.3 Summary
Radial basis-function networks are similar to the perceptrons described in Chap-
ters 5 to 7, in that they are feed-forward networks designed to solve classification
problems. The outputs work in similar ways.
A fundamental difference is that the parameters of the radial basis functions
are determined by unsupervised learning, whereas perceptrons are trained using
supervised learning for all units.
While McCulloch-Pitts neurons compute weights to minimise their output from
given targets, the radial basis functions compute weights by maximising u j as a
function of j . The algorithm for finding the weights of the radial basis functions is
summarised in Algorithm 12.
Further, as opposed to the deep networks from Chapter 7, radial basis-function
networks have only one hidden layer, and a linear output neuron. Radial-basis
function nets learn using a hybrid scheme: unsupervised learning for the parameters
of the radial basis functions, and supervised learning for the weights of the output
neuron.

10.4 Further reading


For a proof of Cover’s theorem see Ref. [98]. Radial-basis function networks are
discussed by Haykin in Chapter 5 of his book [2]. Hertz, Krogh and Palmer [1]
have a brief Section on radial basis-function nets too (their Section 9.7). They use
normalised radial basis functions
2 
exp − 2s12 x − w j
j
u j (x ) = Pm 2  , (10.11)
1
k =1
exp − 2s 2
x − w j

j

instead of Equation (10.8).


It has been argued that radial-basis function nets do not generalise as well as
perceptrons do [99]. To solve this problem, Poggio and Girosi [100] suggested to
determine the parameters w j of the radial basis function by supervised learning,
using stochastic gradient descent.

10.5 Exercises
Expected maximal number of separable patterns. Show that the sum in Equation
(10.7) sums to 2m.
184 RADIAL BASIS-FUNCTION NETWORKS

10.6 Exam questions


10.6.1 Radial basis functions for XOR
(a) Show that the Boolean XOR problem (Table 10.1) cannot be solved by a simple
perceptron.
(b) The problem can be solved by transforming input space using radial-basis
functions, and applying a simple perceptron with linear activation function to
the transformed input data. Show that the two-dimensional Boolean XOR problem
can be solved using the two radial basis functions g 1 (x (µ) ) = exp(−|x (µ) − w 1 |2 ) and
g 2 (x (µ) ) = exp(−|x (µ) − w 2 |2 ) with w 1 = (1, 1)T and w 2 = (0, 0)T . Draw the positions
of the four input patterns in the transformed space with coordinates g 1 and g 2 ,
encoding the different target values. Hint: to compute the states of input patterns
in the transformed space (g 1 , g 2 )T , use the following approximations: exp(−1) ≈ 0.37,
exp(−2) ≈ 0.14. Explain the term decision boundary, draw a decision boundary for
the XOR problem, and give the corresponding weights and thresholds for the simple
perceptron. (1p).

x1 x2 t
0 0 0
0 1 1
1 0 1
1 1 0
Table 10.1: Inputs and target values for the XOR problem. Question 10.6.1.

10.6.2 Radial basis functions


(a) Illustrate the problem given in Table 10.2 graphically. Explain whether or not it
can be solved by a simple perceptron with three input units, and one output unit
P3 (µ)
O (µ) = sgn( j =1 Wj x j − Θ), where Wj is the weight from unit j to the output with
threshold Θ. (0.5 p).
(b) Show that this problem can be solved following the three steps below.
1. Transform the inputs (x1 , x2 , x3 ) to two-dimensional coordinates (g 1 , g 2 ) using
radial basis functions:

g 1 (x (µ) ) = exp(−|x (µ) − w 1 |2 ), with w 1 = (0, 0, 0)T ,


g 2 (x (µ) ) = exp(−|x (µ) − w 2 |2 ), with w 2 = (1, 0, 1)T .
EXAM QUESTIONS 185

(µ) (µ) (µ)


Here x (µ) = (x1 , x2 , x3 )T , and | · · · | denotes the norm of a vector. Plot the
positions of the eight input patterns listed in Table 10.2 in (g 1 , g 2 )-coordinates,
encoding the different target outputs. Hint: to compute g i (x (µ) ) use the fol-
lowing approximations: exp(−1) ≈ 0.37, exp(−2) ≈ 0.14, exp(−3) ≈ 0.05. (0.5 p).

2. Use the transformed input data as inputs to a simple perceptron with sgn(· · · )
activation function. In the plot you drew in the previous step, draw also
a decision boundary that solves the problem when a simple perception is
applied to the transformed data. (0.5 p).
3. Compute a weight vector and a threshold for the simple perceptron corre-
sponding to the decision boundary you drew in the previous step. (0.5p)

(µ) (µ) (µ)


µ x1 x2 x3 t (µ)
1 0 1 0 −1
2 1 0 1 +1
3 0 0 0 +1
4 1 0 0 −1
5 1 1 0 −1
6 0 0 1 −1
7 0 1 1 −1
8 1 1 1 −1

Table 10.2: Inputs and target values. Question 10.6.2.


186 REINFORCEMENT LEARNING

11 Reinforcement learning
Reinforcement learning is a modified version of supervised learning. In the latter
case, target patterns t (µ) are available for all input patterns x (µ) . In practice, less
information may be needed. For instance, the feedback may consist only of a single
bit of information (output O1 , . . . , OM right or wrong).
To show how neural networks can learn from incomplete feedback, I explain the
associative reward penalty algorithm in the following. This algorithm is formulated
in terms of stochastic output units. So we must briefly discuss these first.

11.1 Stochastic output units


(µ)
Consider a set of output units Si with targets t i = ±1. As energy function we take
Equation (6.4):

1 X € (µ) (µ) 2
Š
H= t i − Oi . (11.1)
2 iµ

In Chapters 5 to 8 the output neurons were deterministic functions of their inputs,


either sigmoid, tanh, or softmax functions. Reinforcement learning, by contrast,
works with stochastic output units. The idea is the same as in Chapters 3 and 4:
stochastic output units can explore a wider range of possible states which may in
the end lead to a better solution. Usually only the output units are stochastic. The
weights of hidden neurons are trained as usual by backpropagation.
For the output neurons we use the stochastic update rule (3.2):
(µ)
¨
(µ) +1, with probability P (bi ) ,
Oi = (µ) (11.2a)
−1, with probability 1 − P (bi ) ,

where bi is the local field with activation function equal to the identity function:

(µ) (µ)
X
bi = Wi j V j , (11.2b)
j

and P (b ) = (1 + e −2β b )−1 . The parameter β −1 is the noise level, as in Chapter 3. The
V j represent hidden neurons or inputs.
How does one train the weights Wi j of the stochastic neurons? To formulate a
definite learning rule, we average over the stochastic dynamics of the network, just
as in Chapter 3. But how can one learn on average? One possibility is to compare
STOCHASTIC OUTPUT UNITS 187

(µ)
the average output with the desired output. This corresponds to replacing Oi by
(µ)
〈Oi 〉 in the energy function:

(µ) (µ) 2 (µ) (µ)


X€ Š € Š
H= t i − 〈Oi 〉 with 〈Oi 〉 = tanh β bi . (11.3)

Here we used Equation (3.11). Gradient descent gives for the weight changes in the
output layer:

∂H X
(µ) (µ) (µ)
δWmn = η =η (t i − 〈Oi 〉)β (1 − 〈Oi 〉2 ) δi m Vn(µ) . (11.4)
∂ wmn iµ

This update rule contains a sum over patterns (batch training). The corresponding
sequential rule can be written in the usual form:
(µ) (µ) (µ) (µ)
δWmn = ηδm Vn with δm = (t m(µ) − 〈Om(µ) 〉)β (1 − 〈Oi 〉2 ) . (11.5)

The factor β (1 − 〈Om(µ) 〉2 ) corresponds to the factor g 0 (Bm(µ) ) in Equation (6.7). One
can show that this rule decreases the energy, on average. To this end we use that
(µ) (µ)
Oi = ±1 and t i = ±1. This means that the energy function can be rewritten as
(µ) (µ)
X 
H= 1 − t i Oi . (11.6)

The change in the average energy due to the weight increment (11.5) is
(µ)
(µ) ∂ 〈Oi 〉
X ∂ 〈H 〉 X
δ〈H 〉 = δWmn = − ti δWmn . (11.7)
mn
∂ Wmn mni µ
∂ Wmn

Using
(µ)
∂ 〈Oi 〉 (µ)
= β [1 − tanh2 (β bi )]δi m Vn(µ) (11.8)
∂ Wmn
one finds
X
δ〈H 〉 = −ηβ 2 [1 − tanh2 (β bm(µ) )]2 [1 − t m(µ) tanh(β bm(µ)) ](Vn(µ) )2 < 0 . (11.9)
mn µ

Returning to the learning rule (11.5), note that the formula for the error is some-
times written without the factor β (1 − 〈Om(µ) 〉2 ) (in Ref. [1] for instance):
(µ) (µ) (µ)
δWmn = ηδm Vn with δm = t m(µ) − 〈Om(µ)) 〉 . (11.10)

This choice corresponds to using the energy function (7.32), see Section 7.2.3.
188 REINFORCEMENT LEARNING

11.2 Associative reward-penalty algorithm


Now consider a learning problem where the network receives incomplete informa-
tion about the target patterns. In the simplest case it might just receive a reinforce-
ment signal,
¨
+1, reward,
r= (11.11)
−1, penalty.
The network gets positive feedback (r = 1) when its output match the targets well
(but perhaps not perfectly), and it receives negative feedback (r = −1) if the match is
not good. The idea of reinforcement learning is to use outputs that receive positive
feedback instead of the target values that are not known:

(µ)
¨
(µ) Oi , if r (µ) = +1
ti = (µ) (11.12)
−Oi , if r (µ) = −1.

Substituting Equation (11.12) into the stochastic learning rule (11.10) yields the
learning rule for the associative reward-penalty algorithm:

(µ) (µ) (µ)


¨
η(+) (Oi − 〈Oi 〉)V j reward (r (µ) = +1)
δWi j = (µ) (µ) (µ) (11.13)
η(−) (−Oi − 〈Oi 〉)V j penalty (r (µ) = −1),

Often η(+) is taken to be larger than η(−) , for example η(+) /η(−) ≈ 10 or 100 [1].
A special case is a network with only one binary output neuron. In this case
reinforcement learning determines correct output pattern (−O (µ) is correct if O (µ) is
wrong), so the target patterns t (µ) are precisely reconstructed in this way. But there
is no general convergence proof for the algorithm for more than one binary output
unit.

11.3 Summary
In Chapters 5 to 8 we discussed how to train networks to correctly associate in-
put/output pairs (x (µ) , t (µ) ) in a training set (supervised learning). Chapters 9 and 10
described unsupervised learning where the network learns without any feedback.
The reinforcement algorithm introduced in the present Chapter is in between. It
learns from incomplete feedback. Not all targets are known to the network. The feed-
back may simply be r = ±1 (good or bad solution). The associative reward-penalty
algorithm uses this feedback to build up a training set that allows the network to
learn.
FURTHER READING 189

11.4 Further reading


Reinforcement learning has developed substantially in recent years [101–103]. Many
recent studies on reinforcement learning use the Q-learning algorithm [102, 103]. A
recent application of reinforcement learning is AlphaGo, an algorithm that learnt to
play the game of Go. How it works is described in this blog.
190 REINFORCEMENT LEARNING
BIBLIOGRAPHY 191

Bibliography
[1] HERTZ, J, KROGH, A & PALMER, R 1991 Introduction to the Theory of Neural
Computation. Addison-Wesley.

[2] HAYKIN, S 1999 Neural Networks: a comprehensive foundation. New Jersey:


Prentice Hall.

[3] HORNER, H, Neuronale Netze, [Online; accessed 8-November-2018].

[4] GOODFELLOW, I, BENGIO, Y & COURVILLE, A, Deep Learning, [Online; accessed


5-September-2018].

[5] NIELSEN, M, Neural Networks and Deep Learning, [Online; accessed 13-
August-2018].

[6] LECUN, A, BENGIO, Y & HINTON, G 2015 Deep learning. Nature 521, 463.

[7] MACHEREY, W, KRIKUN, M, CAO, Y, GAO, Q, MACHEREY, K, KLINGNER, J, SHAH,


A, JOHNSON, M, LIU, X, KAISER, L, GOUWS, S, KATO, Y, KUDO, T, KAZAWA, H,
STEVENS, K, KURIAN, G, PATIL, N, WANG, W, YOUNG, C, SMITH, J, RIESA, J,
RUDNICK, A, VINYALS, O, CORRADO, G, HUGHES, M & DEAN, J 2016 Google’s
neural machine translation system: bridging the gap between Human and
machine translation arxiv:1609.08144.

[8] DEEPMIND, AlphaGo, [Online; accessed: 20-August-2018].

[9] BRAIN MAPS, An interactive multiresolution brain atlas; SMI32-


immunoreactive pyramidal neuron in medial prefrontal cortex of macaque,
brainmaps.org, [Online; accessed 14-August-2018].

[10] GABBIANI, F & METZNER, W 1999 Encoding and processing of sensory infor-
mation in neuronal spike trains. Journal of Experimental Biology 202 (10),
1267.

[11] MCCULLOCH, W & PITTS, W 1943 A logical calculus of the ideas immanent in
nervous activity. Bull. Math. Biophys. 5, 115.

[12] ROSENBLATT, F 1958 A probabilistic model for information storage and orga-
nization in the brain. Psychological Review 65, 386.

[13] ROSENBLATT, F 1958 The perceptron: A probabilistic model for information


storage and organization in the brain. Psychological Rev. 65, 386.
192 BIBLIOGRAPHY

[14] HOPFIELD, J. J, Hopfield network, [Online; accessed 14-August-2018].

[15] MÜLLER, B, REINHARDT, J & STICKLAND, M. T 1999 Neural Networks: a com-


prehensive foundation. Heidelberg: Springer.

[16] AMIT, D. J, GUTFREUND, H & SOMPOLINSKY, H 1985 Spin-glass models of neural


networks. Phys. Rev. A 32, 1007.

[17] AMIT, D. J & GUTFREUND, H 1987 Statistical mechanics of neural networks


near saturation. Ann. Phys. 173, 30.

[18] ZIRNBAUER, M 1994 Another critique of the replica trick arxiv:cond-


mat/9903338.

[19] STEFFAN, H & KUEHN, R 1994 Replica symmetry breaking in attractor neural
network models arxiv:cond-mat/9404036.

[20] VOLK, D 1998 On the phase transition of Hopfield networks – another Monte
Carlo study. Int. J. Mod. Phys. C 9, 693.

[21] POTVIN, J. Y & SMITH, K. A 1999 Artificial neural networks for combinatorial
optimization. In Handbook of Metaheuristics (ed. G F. & G Kochenberger).
Heidelberg: Springer.

[22] MANDZIUK, J 2002 Neural networks for the n -Queens problem: a review. Con-
trol and Cybernetics 31, 217, Special issue on neural networks for optimization
and control.

[23] WATERMAN, M 1995 Introduction to Bioinformatics. Prentice Hall.

[24] HOPFIELD, J. J & TANK, D. W 1985 Neural computation of decisions in optimi-


sation problems. Biol. Cybern. 52, 141.

[25] PRESS, W, TEUKOLSKY, S, VETTERLING, W & FLANNERY, W 1992 Numerical


Recipes in C: The Art of Scientific Computing, second edition. New York: Cam-
bridge University Press.

[26] KAMPEN, N. V 2007 Stochastic processes in physics and chemistry. North Hol-
land.

[27] MEHLIG, B, HEERMANN, D. W & FORREST, B. M 1992 Hybrid Monte Carlo


method for condensed-matter systems. Phys. Rev. B 45, 679–685.

[28] BINDER, K, ed. 1986 Monte Carlo Methods in Statistical Physics. Heidelberg:
Springer.
BIBLIOGRAPHY 193

[29] NEWMAN, M. E. J & BARKEMA, G. T 1999 Monte Carlo Methods in Statistical


Physics. Oxford: Clarendon Press.

[30] SALAMON, P, SIBANI, P & FROST, R 2002 Facts, conjectures, and improvements
for simulated annealing. SIAM.

[31] MACHINE LEARNING REPOSITORY UNIVERSITY OF CALIFORNIA IRVINE,


archive.ics.uci.edu/ml, [Online; accessed 18-August-2018].

[32] FISHER, R. A 1936 The use of multiple measurements in taxonomic problems.


Ann. Eugenics 7, 179.

[33] LÖWE, M 1998 In the storage capcaity of Hopfield models with correlated
patterns. Ann. Prob. 8, 1216.

[34] MINSKY, M & PAPERT, S 1969 Perceptrons. An Introduction to Computational


Geometry. MIT Press.

[35] KANAL, L. N 2001 Perceptrons. In International Encyclopedia of the Social and


Behavioral Sciences.

[36] LECUN, Y, BOTTOU, L, ORR, G. B & MÜLLER, K.-R 1998 Efficient back prop. In
Neural networks: tricks of the trade (ed. G. B Orr & K.-R Müller). Springer.

[37] NESTEROV, Y 1983 A method of solving a convex programming problem with


convergence rate o(1/k 2 ). Soviet Mathematics Doklady 27, 372.

[38] SUTSKEVER, I 2013 Training recurrent neural networks. PhD thesis, University
of Toronto, [Online; accessed 27-October-2018].

[39] HANSON, S. J & PRATT, L. Y 1989 Comparing biases for minimal network con-
struction with backpropagation. In Advances in Neural Information Processing
Systems 1.

[40] KROGH, A & HERTZ, J. A 1992 A simple weight decay can improve generalization.
In Advances in Neural Information Processing Systems 4.

[41] HASSIBI, B & G.STORK, D 1993 Second order derivatives for network pruning:
Optimal brain surgeon. In Advances in Neural Information Processing Systems
5.

[42] LECUN, Y, DENKTER, J. S & SOLLA, S 1990 Optimal brain damage. In Advances
in Neural Information Processing Systems 2 (ed. D. S Touretzky), p. 598.
194 BIBLIOGRAPHY

[43] RUMELHART, D. E, HINTON, G. E & WILLIAMS, R. J 1986 Learning internal


representations by error propagation. In Parallel distributed processing: ex-
plorations in the microstructure of cognition (ed. D. E Rumelhart & J. L MC-
Clelland).

[44] HORNIK, K, STINCHOME, M & WHITE, H 1989 Neural networks are universal
approximators. Neural Networks 2, 359.

[45] CVITANOVIC, P, ARTUSO, G, MAINIERI, R, TANNER, G & VATTAY, G, chaos-


book.org/version15 (Niels Bohr Institute, Copenhagen 2015), Lyapunov expo-
nents, [Online; accessed 30-September-2018].

[46] PENNINGTON, J, SCHOENHOLZ, S. S & GANGULI, S 2017 Resurrecting the sig-


moid in deep learning through dynamical isometry: theory and practice. In
Advances in Neural Information Processing Systems 30.

[47] GLOROT, X, BORDES, A & BENGIO, Y 2011 Deep sparse rectifier neural networks.
In Proceedings of Machine Learning Research.

[48] SUTSKEVER, I, MARTENS, J, DAHL, G & HINTON, G 2013 On the importance of


initialization and momentum in deep learning ACM Digital Library.

[49] GLOROT, X & BENGIO, Y 2010 Understanding the difficulty of training deep
feedforward neural networks. In Proceedings of Machine Learning Research.

[50] SRIVASTAV, N, HINTON, G, KRISHEVSKY, A, SUTSKEVER, I & SALKHUTDINOV, R


2014 Dropout: A simple way to prevent neural networks from overfitting
Journal of Machine Learning Research.

[51] FRANKLE, J & CARBIN, M 2018 The lottery ticket hypothesis: Finding small,
trainable neural networks arxiv:1803.03635.

[52] IMAGENET, image-net.org, [Online; accessed 3-September-2018].

[53] IOFFE, S & SZEGEDY, C 2015 Batch normalization: Accelerating deep network
training by reducing internal covariate shift arxiv:1502.03167.

[54] SANTURKAR, S, TSIPRAS, D, ILYAS, A & MADRY, A 2018 How does batch nor-
malization help optimization? (No, it is not about internal covariate shift)
arxiv:1805.11604.

[55] KRIZHEVSKY, A, SUTSKEVER, I & HINTON, G. E 2012 Imagenet classification


with deep convolutional neural networks. In Advances in Neural Information
Processing Systems 25.
BIBLIOGRAPHY 195

[56] TENSORFLOW, tensorflow.org, [Online; accessed 3-September-2018].

[57] LECUN, Y & CORTES, C, MNIST, [Online; accessed 3-September-2018].

[58] KERAS, Keras, [Online; accessed 26-September-2018].

[59] THEANO, Theano, [Online; accessed 26-September-2018].

[60] PY TORCH, PyTorch, [Online; accessed 28-October-2018].

[61] SMITH, L. N 2015 Cyclical learning rates for training neural networks
arxiv:1506.01186.

[62] CIRESAN, D, MEIER, U & SCHMIDHUBER, J 2012 Multi-column deep neural


networks for image classification arxiv:1202.2745.

[63] PICASSO, J. P, Pre-processing before digit recognition for nn and cnn trained
with mnist dataset, stackexchange, [Online; accessed 26-September-2018].

[64] KOZIELSKI, M, FORSTER, J & NEY, H 2012 Moment-based image normalization


for handwritten text recognition. In Proceedings of the 2012 International
Conference on Frontiers in Handwriting Recognition.

[65] PASCAL VOC DATA SET, Pascal VOC, [Online; accessed 6-September-2018].

[66] REDMON, J, DIVVALA, S, GIRSHICK, R & FARHADI, A, You only look once: Unified,
real-time object detection, arxiv:1506.02640.

[67] RUSSAKOVSKY, O, DENG, J, SU, H, KRAUSE, J, SATHEESH, S, MA, S, HUANG, Z,


KARPATHY, A, KHOSLA, A, BERNSTEIN, M, BERG, A. C & FEI-FEI, L 2014 Imagenet
large scale visual recognition challenge arxiv:1409.0575.

[68] HE, K, ZHANG, X, REN, S & SUN, J 2015 Deep residual learning for image
recognition arxiv:1512.03385.

[69] LI, F, JOHNSON, J & YEUNG, S, CNN architectures, CNN architectures, [Online;
accessed 23-September-2018].

[70] HU, J, SHEN, L & SUN, G 2018 Squeeze-and-excitation networks


arxiv:1709.01507.

[71] SEIF, G, Deep learning for image recognition: why it’s challenging, where we’ve
been, and what’s next, Towards Data Science, [Online; accessed 26-September-
2018].
196 BIBLIOGRAPHY

[72] Tensor processing unit, github.com/tensorflow/tpu, [Online; accessed 23-


September-2018].

[73] SZEGEDY, C, LIU, W, JIA, Y, SERMANET, P, REED, S, ANGUELOV, D, ERHAN,


D, VANHOUCKE, V & RABINOVICH, A 2014 Going deeper with convolutions
arxiv:1409.4842.

[74] ZENG, X, OUYANG, W, YAN, J, LI, H, XIAO, T, WANG, K, LIU, Y, ZHOU, Y, YANG,
B, WANG, Z, ZHOU, H & WANG, X 2016 Crafting GBD-net for object detection
arxiv:1610.02579.

[75] HERN, A, Computers now better than humans at recognising and sorting
images, The Guardian, [Online; accessed 26-September-2018].

[76] KARPATHY, A, What I learned from competing against a convnet on imagenet,


blog, [Online; accessed 26-September-2018].

[77] KHURSHUDOV, A, Suddenly, a leopard print sofa appears, [Online; accessed


23-August-2018].

[78] KARPATHY, A, ILRSVC labeling interface, [Online; accessed 26-September-


2018].

[79] GEIRHOS, R, TEMME, C. R. M, RAUBER, J, SCHÜTT, H. H, BETHGE, M &


WICHMANN, F. A 2018 Generalisation in Humans and deep neural networks
arxiv:1808.08750.

[80] SZEGEDY, C, Z AREMBA, W, SUTSKEVER, I, BRUNA, J, ERBAN, D. A ND GOOD-


FELLOW, I & F ERGUS , R 2013 Intriguing properties of neural networks
arxiv:1312.6199.

[81] NGUYEN, A, YOSINSKI, J & CLUNE, J 2015 Deep neural networks are easily fooled:
high confidence predictions for unrecognisable images arxiv:1412.1897.

[82] YOSINSKI, J, CLUNE, J, NGUYEN, A, FUCHS, T & LIPSON, H 2015 Understanding


neural networks through deep visualization arxiv:1506.06579.

[83] CHOROMANSKA, A, HENAFF, M, MATHIEU, M, BEN AROUS, G & LECUN, Y 2014


The loss surfaces of multilayer networks arxiv:1412.0233.

[84] KIRKPATRICK, J, PASCANU, R, RABINOWITZ, N, VENESS, J, DESJARDINS, G, RUSU,


A. A, MILAN, K, QUAN, J, RAMALHO, T, GRABSKA-BARWINSKA, A, HASSABIS,
D, CLOPATH, C, KUMARAN, D & HADSELL, R 2016 Overcoming catastrophic
forgetting in neural networks arxiv:1612.00796.
BIBLIOGRAPHY 197

[85] SETTLES, B 2009 Active Learning Literature Survey. Tech. Rep. 1648. University
of Wisconsin–Madison.

[86] SUTSKEVER, I, VINYALS, O & LE, Q. V 2014 Sequence to sequence learning with
neural networks arxiv:1409.3215.

[87] LIPTON, Z. C, BERKOWITZ, J & ELKAN, C 2015 A critical review of recurrent


neural networks for sequence learning arxiv:1506.00019.

[88] PASCANU, R, MIKOLOV, T & BENGIO, Y 2012 On the difficulty of training recur-
rent neural networks arxiv:1211.5063.

[89] HOCHREITER, S & SCHMIDHUBER, J 1997 Long short-term memory. Neural


Computation 9, 1735.

[90] PAPINENI, K, ROUKOS, S, WARD, T & ZHU, W.-J 2002 Bleu: a method for auto-
matic evaluation of machine translation. In Proceedings of the 40th annual
meeting on association for computational linguistics, p. 311.

[91] DOYA, K 1993 Bifurcations of recurrent neural networks in gradient descent


learning. IEEE Transactions on Neural Networks 1, 75.

[92] KARPATHY, A, The unreasonable effectiveness of recurrent neural networks,


webpage, [Online; accessed 4-October-2018].

[93] WILKINSON, M, BEZUGLYY, V & MEHLIG, B 2009 Fingerprints of random flows?


Phys. Fluids 21, 043304.

[94] OJA, E 1982 A simplified neuron model as a principal component analyzer. J.


Math. Biol. 15, 267.

[95] PRITCHARD, J. K, STEPHENS, M & DONNELLY, P 2000 Inference of population


structure using multilocus genotype data. Genetics 155, 945.

[96] Human genome diversity project, HGDP, [Online; accessed 10-October-2018].

[97] COVER, T. M 1965 Geometrical and statistical properties of systems of linear


inequalities with applications in pattern recognition. IEEE Trans. on electronic
computers p. 326.

[98] SOMPOLINSKY, H, Introduction: the perceptron, The perceptron, [Online;


accessed 9-October-2018].
198 BIBLIOGRAPHY

[99] WETTSCHERECK, D & DIETTERICH, T 1992 Improving the performance of radial


basis function networks by learning center locations. In Advances in Neural
Information Processing Systems 4, pp. 1133–1140. Morgan Kaufmann.

[100] POGGIO, T & GIROSI, F 1990 Networks for approximation and learning. In
Proceedings of the IEEE, , vol. 78, p. 1481.

[101] SUTTON, R. S, ed. 1992 A special issue of machine learning on reinforcement


learning, reprinted from Machine Learning Vol. 8, Nos. 3–4. Springer.

[102] WIERING, M & VAN OTTERLO, M 1999 Reinforcement learning. Heidelberg:


Springer.

[103] SZEPESVARI, C 2010 Algorithms for reinforcement learning. In Synthesis Lec-


tures on Artificial Intelligence and Machine Learning (ed. R. J Brachmann &
T Dietterich). Morgan and Claypool Publishers.

You might also like