Professional Documents
Culture Documents
Artificial Neural Networks:: Introduction, or How The Brain Works
Artificial Neural Networks:: Introduction, or How The Brain Works
1 2 3
4 5 6
1
An artificial neural network consists of a number of Architecture of a typical artificial neural network Analogy between biological and
very simple processors, also called neurons, which artificial neural networks
are analogous to the biological neurons in the brain.
Output Signals
Input Signals
The neurons are connected by weighted links Biological Neural Network Artificial Neural Network
passing signals from one neuron to another. Soma Neuron
Dendrite Input
The output signal is transmitted through the Axon Output
neuron’s outgoing connection. The outgoing Synapse Weight
connection splits into a number of branches that
transmit the same signal. The outgoing branches
terminate at the incoming connections of other Middle Layer
© Negnevitsky, Pearson Education, 2005 7 © Negnevitsky, Pearson Education, 2005 8 © Negnevitsky, Pearson Education, 2005 9
7 8 9
The neuron as a simple computing element The neuron computes the weighted sum of the input Activation functions of a neuron
signals and compares the result with a threshold
Diagram of a neuron value, q. If the net input is less than the threshold, Step function Sign function Sigmoid function Linear function
the neuron output is –1. But if the net input is greater
Input Signals Weights Output Signals than or equal to the threshold, the neuron becomes Y Y Y Y
10 11 12
2
Can a single neuron learn a task? Single-layer two-input perceptron The Perceptron
In 1958, Frank Rosenblatt introduced a training The operation of Rosenblatt’s perceptron is based
algorithm that provided the first procedure for Inputs on the McCulloch and Pitts neuron model. The
training a simple ANN: a perceptron. x1 model consists of a linear combiner followed by a
Linear Hard
Combiner Limiter
hard limiter.
The perceptron is the simplest form of a neural w1
Output
network. It consists of a single neuron with The weighted sum of the inputs is applied to the
adjustable synaptic weights and a hard limiter. Y hard limiter, which produces an output equal to +1
w2 if its input is positive and 1 if it is negative.
q
x2
Threshold
© Negnevitsky, Pearson Education, 2005 13 © Negnevitsky, Pearson Education, 2005 14 © Negnevitsky, Pearson Education, 2005 15
13 14 15
Linear separability in the perceptrons How does the perceptron learn its classification
The aim of the perceptron is to classify inputs, x1, x2 x2 tasks?
x2, . . ., xn, into one of two classes, say A1 and A2.
Class A1 This is done by making small adjustments in the
In the case of an elementary perceptron, the n- weights to reduce the difference between the actual
1
dimensional space is divided by a hyperplane into 2 and desired outputs of the perceptron. The initial
1
two decision regions. The hyperplane is defined by x1 weights are randomly assigned, usually in the range
the linearly separable function: Class A2 x1
[0.5, 0.5], and then updated to obtain the output
2 consistent with the training examples.
n x1w1 + x2w2 q = 0 x1w1 + x2w2 + x3w3 q = 0
xi wi q 0
x3
(a) Two-input perceptron. (b) Three-input perceptron.
i 1
© Negnevitsky, Pearson Education, 2005 16 © Negnevitsky, Pearson Education, 2005 17 © Negnevitsky, Pearson Education, 2005 18
16 17 18
3
If at iteration p, the actual output is Y(p) and the The perceptron learning rule Perceptron’s training algorithm
desired output is Yd (p), then the error is given by:
wi ( p 1) wi ( p) a . xi ( p) . e( p) Step 1: Initialisation
e( p) Yd ( p) Y( p) Set initial weights w1, w2,…, wn and threshold q
where p = 1, 2, 3, . . . where p = 1, 2, 3, . . . to random numbers in the range [0.5, 0.5].
a is the learning rate, a positive constant less than
unity. If the error, e(p), is positive, we need to increase
Iteration p here refers to the pth training example
perceptron output Y(p), but if it is negative, we
presented to the perceptron. The perceptron learning rule was first proposed by need to decrease Y(p).
If the error, e(p), is positive, we need to increase Rosenblatt in 1960. Using this rule we can derive
perceptron output Y(p), but if it is negative, we the perceptron training algorithm for classification
need to decrease Y(p). tasks.
© Negnevitsky, Pearson Education, 2005 19 © Negnevitsky, Pearson Education, 2005 20 © Negnevitsky, Pearson Education, 2005 21
19 20 21
Perceptron’s training algorithm (continued) Perceptron’s training algorithm (continued) Example of perceptron learning: the logical operation AND
Inputs Desired Initial Actual Error Final
Step 3: Weight training Epoch
x1 x2
output
Yd
weights
w1 w2
output
Y e
weights
w1 w2
Step 2: Activation Update the weights of the perceptron 1 0 0 0 0.3 0.1 0 0 0.3 0.1
0 1 0 0.3 0.1 0 0 0.3 0.1
Activate the perceptron by applying inputs x1(p),
wi ( p 1) wi ( p) Dwi ( p) 1
1
0
1
0
1
0.3
0.2
0.1
0.1
1
0
1
1
0.2
0.3
0.1
0.0
x2(p),…, xn(p) and desired output Yd (p). 2 0 0 0 0.3 0.0 0 0 0.3 0.0
Calculate the actual output at iteration p = 1 where Dwi(p) is the weight correction at iteration p. 0 1 0 0.3 0.0 0 0 0.3 0.0
1 0 0 0.3 0.0 1 1 0.2 0.0
1 1 1 0.2 0.0 1 0 0.2 0.0
n The weight correction is computed by the delta rule: 3 0 0 0 0.2 0.0 0 0 0.2 0.0
Y ( p ) step x i ( p ) w i ( p ) q 0
1
1
0
0
0
0.2
0.2
0.0
0.0
0
1
0
1
0.2
0.1
0.0
0.0
i 1 1 1 1 0.1 0.0 0 1 0.2 0.1
22 23 24
4
Two-dimensional plots of basic logical operations Multilayer neural networks Multilayer perceptron with two hidden layers
x2 x2 x2
A multilayer perceptron is a feedforward neural
1 1 1
network with one or more hidden layers.
Output Signals
Input Signals
x1 x1 x1 The network consists of an input layer of source
0 1 0 1 0 1 neurons, at least one middle or hidden layer of
computational neurons, and an output layer of
(a) AND (x1 x2) (b) OR (x 1 x 2 ) (c) Ex cl usiv e- OR
(x 1 x2 ) computational neurons.
The input signals are propagated in a forward First Second
A perceptron can learn the operations AND and OR, Input hidden hidden Output
direction on a layer-by-layer basis.
but not Exclusive-OR. layer layer layer layer
© Negnevitsky, Pearson Education, 2005 25 © Negnevitsky, Pearson Education, 2005 26 © Negnevitsky, Pearson Education, 2005 27
25 26 27
28 29 30
5
Three-layer back-propagation neural network The back-propagation training algorithm Step 2: Activation
Input signals Activate the back-propagation neural network by
1
Step 1: Initialisation applying inputs x1(p), x2(p),…, xn(p) and desired
x1
1 y1 Set all the weights and threshold levels of the outputs yd,1(p), yd,2(p),…, yd,n(p).
2
1 network to random numbers uniformly
x2
2
2 y2 distributed inside a small range: (a) Calculate the actual outputs of the neurons in
the hidden layer:
i wij j wjk
yk 2.4 2.4 n
xi k
,
Fi y j ( p ) sigmoid xi ( p ) wij ( p ) q j
Fi
m i 1
n l yl
xn where Fi is the total number of inputs of neuron i
Input Hidden Output
where n is the number of inputs and neuron j is in
in the network. The weight initialisation is done
layer layer layer the hidden layer, and sigmoid is the sigmoid
on a neuron-by-neuron basis.
Error signals activation function.
© Negnevitsky, Pearson Education, 2005 31 © Negnevitsky, Pearson Education, 2005 32 © Negnevitsky, Pearson Education, 2005 33
31 32 33
Step 2 : Activation (continued) Step 3: Weight training Step 3: Weight training (continued)
(b) Calculate the actual outputs of the neurons in Update the weights in the back-propagation network (b) Calculate the error gradient for the neurons in
the output layer: propagating backward the errors associated with the hidden layer:
output neurons. (a) l
m
yk ( p) sigmoid x jk ( p) w jk ( p) q k
Calculate the error gradient for the neurons in the j ( p) y j ( p) [1 y j ( p)] k ( p) w jk ( p)
output layer: k 1
j 1
Calculate the weight corrections:
where m is the number of neurons in the hidden k ( p) yk ( p) 1 yk ( p) ek ( p)
34 35 36
6
Step 4: Iteration Three-layer network for solving the
Increase iteration p by one, go back to Step 2 and The effect of the threshold applied to a neuron in the
Exclusive-OR operation hidden or output layer is represented by its weight, q,
repeat the process until the selected error criterion 1
is satisfied. connected to a fixed input equal to 1.
q3
As an example, we may consider the three-layer w13 1 The initial weights and threshold levels are set
back-propagation network. Suppose that the
x1 1 3 w35 q5
randomly as follows:
w23
network is required to perform logical operation w13 = 0.5, w14 = 0.9, w23 = 0.4, w24 = 1.0, w35 = 1.2,
Exclusive-OR. Recall that a single-layer perceptron
5 y5 w45 = 1.1, q3 = 0.8, q4 = 0.1 and q5 = 0.3.
w24
could not do this operation. Now we will apply the x2 2 4 w45
three-layer net. w24
Input q4 Output
layer layer
1
Hiddenlayer
© Negnevitsky, Pearson Education, 2005 37 © Negnevitsky, Pearson Education, 2005 38 © Negnevitsky, Pearson Education, 2005 39
37 38 39
We consider a training set where inputs x1 and x2 are The next step is weight training. To update the
equal to 1 and desired output yd,5 is 0. The actual Next we calculate the error gradients for neurons 3
weights and threshold levels in our network, we and 4 in the hidden layer:
outputs of neurons 3 and 4 in the hidden layer are propagate the error, e, from the output layer
calculated as backward to the input layer. 3 y3(1 y3) 5 w35 0.5250 (1 0.5250) ( 0.1274) ( 1.2) 0.0381
y4 (1 y4 ) 5 w45 0.8808 (1 0.8808) ( 0.1274) 1.1 0.0147
y3 sigmoid ( x1w13 x2w23 q3) 1/ 1 e(10.510.410.8) 0.5250 4
First, we calculate the error gradient for neuron 5 in
y4 sigmoid ( x1w14 x2w24 q4 ) 1/ 1 e(10.911.010.1) 0.8808 the output layer: We then determine the weight corrections:
Now the actual output of neuron 5 in the output layer 5 y5 (1 y5 ) e 0.5097 (1 0.5097) ( 0.5097) 0.1274 Dw13 x1 3 0.1 1 0.0381 0.0038
is determined as: Dw23 x2 3 0.11 0.0381 0.0038
Then we determine the weight corrections assuming Dq3 ( 1) 3 0.1 ( 1) 0.0381 0.0038
y5 sigmoid( y3w35 y4w45 q5) 1/ 1 e(0.52501.20.88081.110.3) 0.5097 that the learning rate parameter, a, is equal to 0.1: Dw14 x1 4 0.11 (0.0147) 0.0015
Dw24 x2 4 0.11 ( 0.0147) 0.0015
Thus, the following error is obtained: Dw35 y3 5 0.1 0.5250 (0.1274) 0.0067 Dq 4 ( 1) 4 0.1 ( 1) ( 0.0147) 0.0015
Dw45 y4 5 0.1 0.8808 (0.1274) 0.0112
e yd ,5 y5 0 0.5097 0.5097 Dq5 ( 1) 5 0.1 (1) (0.1274) 0.0127
© Negnevitsky, Pearson Education, 2005 40 © Negnevitsky, Pearson Education, 2005 41 © Negnevitsky, Pearson Education, 2005 42
40 41 42
7
At last, we update all weights and threshold: Learning curve for operation Exclusive-OR Final results of three-layer network learning
Sum-Squared Network Error for 224 Epochs
w13 w13 D w13 0.5 0.0038 0.5038 10 1
w14 w14 Dw14 0.9 0.0015 0.8985 Inputs Desired Actual Sum of
w23 w23 D w23 0.4 0.0038 0.4038
output output squared
10 0
x1 x2 yd y5 e errors
w24 w24 D w24 1.0 0.0015 0.9985
Sum-Squared Error
w35 w35 D w35 1.2 0.0067 1.2067 10 -1
1 1 0 0.0155 0.0010
0 1 1 0.9849
w45 w45 D w45 1.1 0.0112 1.0888
1 0 1 0.9849
q 3 q 3 D q 3 0.8 0.0038 0.7962 10 -2
0 0 0 0.0175
q 4 q 4 D q 4 0.1 0.0015 0.0985
10 -3
q 5 q 5 D q 5 0.3 0.0127 0.3127
© Negnevitsky, Pearson Education, 2005 43 © Negnevitsky, Pearson Education, 2005 44 © Negnevitsky, Pearson Education, 2005 45
43 44 45
46 47 48
8
Learning with momentum for operation Exclusive-OR Learning with adaptive learning rate
We also can accelerate training by including a 10 2
Training for 126 Epochs
To accelerate the convergence and yet avoid the
momentum term in the delta rule: 10 1
danger of instability, we can apply two heuristics:
10 0
Dw jk ( p) Dw jk ( p 1) y j ( p) k ( p)
10 -1
10 -2
Heuristic 1
10 -3 If the change of the sum of squared errors has the same
where b is a positive number (0 b 1) called the 10 -4
0 20 40 60 80 100 120 algebraic sign for several consequent epochs, then the
momentum constant. Typically, the momentum Epoch
learning rate parameter, a, should be increased.
1.5
constant is set to 0.95. 1 Heuristic 2
Learning Rate
0.5 If the algebraic sign of the change of the sum of
0
squared errors alternates for several consequent
This equation is called the generalised delta rule. -0.5
epochs, then the learning rate parameter, a, should be
-1
0 20 40 60 80 100 120 140 decreased.
Epoch
© Negnevitsky, Pearson Education, 2005 49 © Negnevitsky, Pearson Education, 2005 50 © Negnevitsky, Pearson Education, 2005 51
49 50 51
Learning with adaptive learning rate Learning with momentum and adaptive learning rate
Adapting the learning rate requires some changes Tr aining for 103 Epochs Tr aining for 85 Epochs
in the back-propagation algorithm.
2
10 10 2
10 1 1
Sum-Squared Erro
10
Sum-Squared Erro
Learning Rate
Learning Rate
If the error is less than the previous one, the 0.6 1.5
0.2 0.5
by 1.05).
0 0
0 20 40 60 80 100 120 0 10 20 30 40 50 60 70 80 90
Epoch Epoch
© Negnevitsky, Pearson Education, 2005 52 © Negnevitsky, Pearson Education, 2005 53 © Negnevitsky, Pearson Education, 2005 54
52 53 54
9
The Hopfield Network Multilayer neural networks trained with the back- The stability of recurrent networks intrigued
propagation algorithm are used for pattern several researchers in the 1960s and 1970s.
Neural networks were designed on analogy with
recognition problems. However, to emulate the However, none was able to predict which network
the brain. The brain’s memory, however, works by
human memory’s associative characteristics we would be stable, and some researchers were
association. For example, we can recognise a
need a different type of network: a recurrent pessimistic about finding a solution at all. The
familiar face even in an unfamiliar environment
neural network. problem was solved only in 1982, when John
within 100-200 ms. We can also recall a complete
Hopfield formulated the physical principle of
sensory experience, including sounds and scenes, A recurrent neural network has feedback loops
from its outputs to its inputs. The presence of storing information in a dynamically stable
when we hear only a few bars of music. The brain
such loops has a profound impact on the learning network.
routinely associates one thing with another.
capability of the network.
© Negnevitsky, Pearson Education, 2005 55 © Negnevitsky, Pearson Education, 2005 56 © Negnevitsky, Pearson Education, 2005 57
55 56 57
Single-layer n-neuron Hopfield network The Hopfield network uses McCulloch and Pitts The current state of the Hopfield network is
neurons with the sign activation function as its determined by the current outputs of all neurons,
x1 1 y1 computing element: y1, y2, . . ., yn.
x2 2 y2 Thus, for a single-layer n-neuron network, the state
1, if X 0 can be defined by the state vector as:
sign
xi i yi Y 1, if X
Y, if X y1
xn n yn y
Y
2
y n
© Negnevitsky, Pearson Education, 2005 58 © Negnevitsky, Pearson Education, 2005 59 © Negnevitsky, Pearson Education, 2005 60
58 59 60
10
In the Hopfield network, synaptic weights between Possible states for the three-neuron The stable state-vertex is determined by the weight
Hopfield network matrix W, the current input vector X, and the
neurons are usually represented in matrix form as
follows: y2 threshold matrix q. If the input vector is partially
incorrect or incomplete, the initial state will converge
M (1,1, 1) (1, 1, 1)
into the stable state-vertex after a few iterations.
W YmYmT M I Suppose, for instance, that our network is required to
m1
(1, 1, 1) (1, 1, 1) memorise two opposite states, (1, 1, 1) and (1, 1, 1).
where M is the number of states to be memorised y1
Thus,
by the network, Ym is the n-dimensional binary 0
1 1
vector, I is n n identity matrix, and superscript T Y2 1 or Y1T 1 1 1 Y2T 1 1 1
(1,1,1) (1,1,1) Y1 1
denotes matrix transposition.
1 1
61 62 63
The 3 3 identity matrix I is First, we activate the Hopfield network by applying The remaining six states are all unstable. However,
1 0 0 the input vector X. Then, we calculate the actual stable states (also called fundamental memories) are
I 0 1 0 output vector Y, and finally, we compare the result capable of attracting states that are close to them.
0 0 1 with the initial input vector X. The fundamental memory (1, 1, 1) attracts unstable
Thus, we can now determine the weight matrix as states (1, 1, 1), (1, 1, 1) and (1, 1, 1). Each of
0 2 2 1 0 1
follows: these unstable states represents a single error,
Y1 sign 2 0 2 1 0 1 compared to the fundamental memory (1, 1, 1).
1 1 1 0 0 0 2 2 2 2 0 1 0 1
W 1 1 1 1 1 1 1 1 2 0 1 0 2 0 2 The fundamental memory (1, 1, 1) attracts
1 1 0 0 1 2 2 0 0 2 2 1 0 1 unstable states (1, 1, 1), (1, 1, 1) and (1, 1, 1).
Y2 sign2 0 2 1 0 1 Thus, the Hopfield network can act as an error
Next, the network is tested by the sequence of input 2
vectors, X1 and X2, which are equal to the output (or 2 0 1 0 1 correction network.
target) vectors Y1 and Y2, respectively.
© Negnevitsky, Pearson Education, 2005 64 © Negnevitsky, Pearson Education, 2005 65 © Negnevitsky, Pearson Education, 2005 66
64 65 66
11
Storage capacity of the Hopfield network Bidirectional associative memory (BAM) To associate one memory with another, we need a
recurrent neural network capable of accepting an
The Hopfield network represents an autoassociative
Storage capacity is or the largest number of input pattern on one set of neurons and producing
type of memory it can retrieve a corrupted or
fundamental memories that can be stored and incomplete memory but cannot associate this memory a related, but different, output pattern on another
retrieved correctly. with another different memory. set of neurons.
The maximum number of fundamental memories Human memory is essentially associative. One thing Bidirectional associative memory (BAM), first
Mmax that can be stored in the n-neuron recurrent may remind us of another, and that of another, and so proposed by Bart Kosko, is a heteroassociative
network is limited by on. We use a chain of mental associations to recover network. It associates patterns from one set, set A,
a lost memory. If we forget where we left an to patterns from another set, set B, and vice versa.
M max 0.15n umbrella, we try to recall where we last had it, what Like a Hopfield network, the BAM can generalise
we were doing, and who we were talking to. We and also produce correct outputs despite corrupted
attempt to establish a chain of associations, and or incomplete inputs.
thereby to restore a lost memory.
© Negnevitsky, Pearson Education, 2005 67 © Negnevitsky, Pearson Education, 2005 68 © Negnevitsky, Pearson Education, 2005 69
67 68 69
m ym(p) m ym(p)
W X m YmT
m1
xn(p) n xn(p+1) n
Input Output Input Output where M is the number of pattern pairs to be stored
layer layer layer layer in the BAM.
(a) Forward direction. (b) Backward direction.
© Negnevitsky, Pearson Education, 2005 70 © Negnevitsky, Pearson Education, 2005 71 © Negnevitsky, Pearson Education, 2005 72
70 71 72
12
Stability and storage capacity of the BAM
The BAM is unconditionally stable. This means that
any set of associations can be learned without risk of
instability.
The maximum number of associations to be stored in
the BAM should not exceed the number of
neurons in the smaller layer.
The more serious problem with the BAM is
incorrect convergence. The BAM may not
always produce the closest association. In fact, a
stable association may be only slightly related to
the initial input vector.
© Negnevitsky, Pearson Education, 2005 73
73
13