CS407 Neural Computation

Lecture 3: Neural Network

Learning Rules

Lecturer: A/Prof. M. Bennamoun

„ Learning is a process by which free parameters of
NN are adapted thru stimulation from environment
„ Sequence of Events
– stimulated by an environment
– undergoes changes in its free parameters
– responds in a new way to the environment
„ Learning Algorithm
– prescribed steps of process to make a system
• ways to adjust synaptic weight of a neuron
– No unique learning algorithms - kit of tools
„ The Lecture covers
– five learning rules, learning paradigms
– probabilistic and statistical aspect of learning

Gradients and Derivatives

Gradient Descent Minimization
Gradients and Derivatives.
Differential Calculus is the branch of mathematics concerned
with computing gradients. Consider a function y = f(x) :

The gradient, or rate of change, of f(x) at a particular value of x,

as we change x can be approximated by ∆y/ ∆x. Or we can write
it exactly as

which is known as the partial derivative of f(x) with respect to x.

Examples of Computing Derivatives
Some simple examples should make this clearer:

Other derivatives can be computed in the same way. Some

useful ones are:
Gradient Descent Minimisation
Suppose we have a function f(x) and we want to change the
value of x to minimise f(x). What we need to do depends on the
derivative of f(x). There are three cases to consider:
then f(x) increases as x increases so we should decrease x
then f(x) decreases as x increases so we should increase x
then f(x) is at a maximum or minimum so we should not change x

In summary, we can decrease f(x) by changing x by the amount:

where η is a small positive constant specifying how much we

change x by, and the derivative ∂f/∂x tells us which direction to
go in. If we repeatedly use this equation, f(x) will (assuming η
is sufficiently small) keep descending towards its minimum,
and hence this procedure is known as gradient descent
Types of Learning
Learning with Teacher
„ Supervised learning
„ Teacher has knowledge of environment to learn
„ input and desired output pairs are given as a
training set
„ Parameters are adjusted based on error signal
– The desired response of the system is provided
by a teacher, e.g., the distance ρ[d,o] as an
error measure
Learning with Teacher
Learning with Teacher
– Estimate the negative error gradient direction and reduce the
error accordingly
• Modify the synaptic weights to reduce the stochastic
minimization of error in multidimensional weight space
„ Move toward a minimum point of error surface
– may not be a global minimum
– use gradient of error surface - direction of steepest descent
„ Good for pattern recognition and function approximation
Unsupervised Learning
„ Self-organized learning
– The desired response is unknown, no explicit error
information can be used to improve network
• E.g. finding the cluster boundaries of input
– Suitable weight self-adaptation mechanisms have
to be embedded in the trained network
– No external teacher or critics
– Task-independent measure of quality is required
to learn
– Network parameters are optimized with respect to
a measure
– competitive learning rule is a case of unsupervised
Learning with Teacher
Learning without Teacher
„ Reinforcement learning
– No teacher to provide direct (desired) response at
each step
• example : good/bad, win/loose

Primary reinforcement

Environment Critics
„ Training set: The ensemble of “inputs” used to train
the system. For a supervised network. It is the
ensemble of “input-desired” response pairs used to
train the system.
„ Validation set: The ensemble of samples that will be
used to validate the parameters used in the training
(not to be confused with the test set which assesses
the performance of the classifier).
„ Test set: The ensemble of “input-desired” response
data used to verify the performance of a trained
system. This data is not used for training.
„ Training epoch: one cycle through the set of training
„ Generalization: The ability of a NN to produce
reasonable responses to input patterns that are
similar, but not identical, to training patterns.
„ Asynchronous: process in which weights or
activations are updated one at a time, rather than all
being updated simultaneously.
„ Synchronous updates: All weights are adjusted at the
same time.
„ Inhibitory connection: connection link between two
neurons such that a signal sent over this link will
reduce the activation of the neuron that receives the
signal . This may result from the connection having a
negative weight, or from the signal received being
used to reduce the activation of a neuron by scaling
the net input the neuron receives from other neurons.
„ Activation: a node’s level of activity; the result of
applying the activation function to the net input to the
node. Typically this is also the value the node

Vectors- Overview
Vectors- A Brief review

Vector w.r.t cartesian axes

2-D vector
v = v12 + v22
Inner product- A Brief review…

2 The projection of v is given by:

v .w = v1w1 + v2 w2 = ∑ vi wi r
i =1 vw = v cos(φ )
rr r v rr
v .w = v w cos(φ ) vw
vw = r
Inner product- A Brief review…
Learning Rules (LR)
The General Learning Rule
„ The weight adjustment is proportional to the
product of input x and the learning signal r
r r r r
∆wi (t ) = c r[ wi (t ), x (t ), d (t )]. x (t )
„ c is a positive learning constant.
r r r r r r r
wi (t + 1) = wi (t ) + ∆wi (t ) = wi (t ) + c r[ wi (t ), x (t ), d (t )]. x (t )
Learning Rule 1

Error Correction Learning Rule

LR1:Error Correction Learning
LR1:Error Correction Learning…
„ Error signal, ek(n)
ek(n) = dk(n) - yk(n)
where n denotes time step
„ Error signal activates a control mechanism for
corrective adjustment of synaptic weights
„ Mininizing a cost function, E(n), or index of
1 2
E ( n) =
2 ek
( n)

„ Also called instantaneous value of error energy

„ step-by-step adjustment until
– system reaches steady state; synaptic weights are
„ Also called deltra rule, Widrow-Hoff rule
Error Correction Learning…
∆wkj(n) = ηek(n)xj(n)
„ η : rate of learning; learning-rate parameter
wkj(n+1) = wkj(n) + ∆wkj(n)
wkj(n) = Z-1[wkj(n+1) ]
„ Z-1 is unit-delay operator
„ adjustment is proportioned to the product of
error signal and input signal
„ error-correction learning is local
„ The learning rate η determines the stability or
E.g 1: Perceptron Learning Rule
ƒ Supervised learning, only applicable for binary neuron
response (e.g. [-1,1])
ƒThe learning signal is equal to:

E.g., in classification task, the weight is adapted only

when classification error occurred
ƒ The weight initialisation is random
E.g1:Perceptron Learning Rule…
E.g1:Perceptron Learning Rule…
E.g2:Delta Learning Rule
ƒ Supervised learning, only applicable for continuous
activation function
ƒThe learning signal r is called delta and defined as:

- Derived by calculating the gradient vector with

respect to wi of the squared error.
E.g2: Delta Learning Rule…
ƒ The weight initialization is random
ƒAlso called continuous perceptron training rule
E.g2: Delta Learning Rule…
E.g3: Widrow-Hoff LR Widrow 1962
ƒ Supervised learning, independent of the activation
function of the neuron
ƒMinimize the squared error between the desired output
value and the neuron active value
– Sometimes called LMS (Least Mean Square)
learning rule
ƒ The learning signal r is:

Considered a special case of the delta learning rule

Learning Rule 2

Memory-based Learning Rule

LR2: Memory-based Learning
„ In memory-based learning, all (or most) of the
past experiences are explicitly stored in a
large memory of correctly classified input-
output examples
{( x i , d i ) }i = 1

– Where xi denotes an input vector and di

denotes the corresponding desired
„ When classification of a test vector xtest (not
seen before) is required, the algorithm
responds by retrieving and analyzing the
traing data in a “local neighborhood” of xtest
LR2: Memory-based Learning
„ All memory-based learning algorithm involve
2 essential Ingredient (which make them
different from each others)
– Criterion used for defining local neighbor of
– Learning rule applied to the training
examples in local neighborhood of xtest
„ Nearest Neighbor Rule (NNR)
– the vector X’N ∈ { X1, X2, …,XN } is the
nearest neighbor of Xtest if
r r r r
min d ( X i , X test ) = d ( X '
N , X test )
– X’n is the class of Xtest
LR2: Nearest Neighbor Rule (NNR)
„ Cover and Hart (1967)
– Examples (xi,di) are independent and
identically distributed (iid), according to
the joint pdf of the example (x,d)
– The sample size N is infinitely large
– works well if no feature or class noise
– as number of training cases grows
large, the error rate of 1-NN is at most 2
times the Bayes optimal rate
– ÎHalf of the “classification information”
in a training set of infinite size is
contained in the Nearest Neighbor !!
LR2: k-Nearest Neighbor Rule
„ K-nearest Neighbor rule (variant of the NNR)
– Identify the k classified patterns that lie
nearest to Xtest for some integer k,
– Assign Xtest to the class that is most frequently
represented in the k nearest neighbors to Xtest
„ KNN: find the k nearest neighbors of an

„ Radial-basis function network is a memory-based

K nearest neighbors
Data are represented as
high-dimensional vectors
KNN requires:
•Distance metric
•Choice of K
•Potentially a choice of
element weighting in the
Given a new example
ƒ Compute distances to
each known example

ƒChoose class of most

K nearest neighbors
New item
K nearest neighbors
New item

•Compute distances
K nearest neighbors
New item

•Compute distances
•Pick K best distances
K nearest neighbors
New item

•Compute distances
•Pick K best distances
•Assign class to new
Example: image search

Query image

„ Images represented as features (color histogram,

texture moments, etc.)
„ Similarity search using these features
„ “Find 10 most similar images for the query image”
Other Applications
„ Web-page search
– “Find 100 most similar pages for a given
– Page represented as word-frequency vector
– Similarity: vector distance
„ GIS: “find 5 closest cities of Brisbane”…
Learning Rule 3

Hebbian Learning Rule

D. Hebb
LR3: Hebbian Learning
“When an axon of cell A is near enough to excite a cell B and
repeatedly or persistently takes place in firing it, some growth
process or metabolic change takes place in one or both cells such
that A’s efficiency, as one of the cells firing B, is increased” (Hebb,
In other words:
1. If two neurons on either side of a synapse (connection) are activated
simultaneously (i.e. synchronously), then the strength of that synapse is
selectively increased.
This rule is often supplemented by:
2. If two neurons on either side of a synapse are activated
asynchronously, then that synapse is selectively weakened or
eliminated. so that chance coincidences do not build up connection
LR3: Hebbian Learning
„ A purely feed forward, unsupervised learning
„ The learning signal is equal to the neuron’s output

ƒThe weight initialisation at small random values around

wi=0 prior to learning
ƒIf the cross product of output and input (or correlation) is
positive, it results in an increase of the weight, otherwise
the weight decreases
ƒIt can be seen that the output is strengthened in turn for
each input presented.
LR3: Hebbian Learning…
Therefore, frequent input patterns will have most influence
at the neuron’s weight vector and will eventually produce
the largest output.
LR3: Hebbian Learning…
ƒ In some cases, the Hebbian rule needs to be modified to
counteract unconstrained growth of weight values, which
takes place when excitations and responses consistently
agree in sign.
ƒ This corresponds to the Hebbian learning rule with
saturation of the weights at a certain, preset level.
„ Single Layer Network with Hebb Rule Learning of a set
of input-output training vectors is called a HEBB NET
LR3: Hebbian Learning
„ If two neurons of a connection are activated
– simultaneously (synchronously), then its strength is
– asynchronously, then the strength is weakened or
„ Hebbian synapse
– time dependent
• depend on exact time of occurrence of two signals
– local
• locally available information is used
– interactive mechanism
• learning is done by two signal interaction
– conjunctional or correlational mechanism
• cooccurrence of two signals
„ Hebbian learning is found in Hippocampus
presynaptic &
postsynaptic signals
Special case: Correlation LR
„ Supervised learning, applicable for recording data
in memory networks with binary response
„ The learning signal r is simply equal to the
desired output di

A special case of the Hebbian learning rule with a binary

activation function and for oi=di
The weight initialization at small random values around
wi=0 prior to learning (just like Hebbian rule)
Special case: Correlation LR…
Learning Rule 4

Competitive Learning Rule =

Winner-Take-All LR
LR4: Competitive Learning
ƒ Unsupervised network training, and applicable for an
ensemble of neurons (e.g. a layer of p neurons), not
for a single neuron.
ƒ Output neurons of NN compete to become active
ƒ Adapt the neuron m which has the maximum
response due to input x
ƒ Only single neuron is active at any one time
– salient feature for pattern classification
– Neurons learn to specialize on ensembles of
similar patterns; Therefore,
– They become feature detectors
LR4: Competitive Learning…
ƒ Basic Elements
– A set of neurons that are all same except
synaptic weight distribution
• respond differently to a given set of input
• A mechanism to compete to respond to
a given input
• The winner that wins the competition is
called “winner-takes-all”
LR4: Competitive NN…
is excitatory

connection ( )
is inhibitory
- lateral inhibition

layer of Single layer of

source Input output neurons
LR4: Competitive Learning…
„ Competitive Learning Rule: Adapt the neuron m
which has the maximum response due to input x

„ Weights are typically initialised at random values and

their strengths are normalized during learning.
= 1 for all m

„ If neuron does not respond to a particular input, no

learning takes place
LR4: Competitive Learning…
„ x has some constant Euclidean length and ∑ w mj
= 1 for all m

„ perform clustering thru competitive learning

LR4: Competitive Learning…
ƒ What is required for the net to encode the training set is
that the weight vectors become aligned with any clusters
present in this set and that each cluster is represented by at
least one node. Then, when a vector is presented to the net
there will be a node, or group of nodes, which respond
maximally to the input and which respond in this way only
when this vector is shown at the input

ƒIf the net can learn a weight vector configuration like this,
without being told explicitly of the existence of clusters at
the input, then it is said to undergo a process of self-
organised or unsupervised learning. This is to be contrasted
with nets which were trained with the delta rule for e.g.
where a target vector or output had to be supplied.
LR4: Competitive Learning…
ƒ In order to achieve this goal, the weight vectors must be
rotated around the sphere so that they line up with the
training set.
ƒThe first thing to notice is that this may be achieved in a
gradual and efficient way by moving the weight vector
which is closest (in an angular sense) to the current input
vector towards that vector slightly.
ƒThe node k with the closest vector is that which gives the
greatest input excitation v=w.x since this is just the dot
product of the weight and input vectors. As shown below,
the weight vector of node k may be aligned more closely
with the input if a change is made according to
∆ wmj = α (x j − wmj )
LR4: Winner-Take-All learning..

The winner neighbourhood is sometimes extended to

beyond the single neuron winner to include the
neighbouring neurons
Learning Rule 5

Boltzman Learning Rule

LR5: Boltzman Learning
„ Rooted from statistical mechanics
„ Boltzman Machine : NN on the basis of Boltzman
„ The neurons constitute a recurrent structure (see
next slide)
– They are stochastic neurons
– operate in binary manner: “on”: +1 and “off”: -1
– Visible neurons and hidden neurons
– energy function of the machine (xj = state of
neuron j):
E=− ∑ ∑
2 j k
wkj x k x j
– j≠k means no self feedback
Boltzman Machine

Fig: Architecture of Boltzmann machine. K is the

number of visible neurons and L is the number of
hidden neurons
Boltzman Machine Operation
„ choosing a neuron at random, k, then flip the state of the
neuron from state xk to state -xk (random perturbation)
with probability
P ( xk → − xk ) =
− ∆E k
1 + exp( )

where ∆ E k is energy change of the machine resulting

from such a flip (flip from state xk to state –xk)
„ If this rule is applied repeatedly, the machine reaches
thermal equilibrium (note that T is a pseudo-temperature).
„ Two modes of operation
–Clamped condition : visible neurons are clamped onto
specific states determined by environment (i.e. under the
influence of training set).

–Free-running condition: all neurons (visible and hidden)
are allowed to operate freely (i.e. with no envir. input)
Boltzman Machine operation…
„ Such a network can be used for pattern completion.
„ Goal of Boltzman Learning is to maximize likelihood
function (using gradient descent)

L ( w ) = log ∏ P( X α
xα ∈ℑ
= xα )

= ∑ log P ( X α
xα ∈ℑ
= xα )

„ ℑ denotes the set of training examples drawn from a pdf of

„ xα represents the state of the visible neurons
„ x β represents the state of the hidden neurons
„ set of synaptic weights is called a model of the environment
if it leads the same probability distribution of the states of
visible units
LR5: Boltzman Learning Rule…
„ Let ρ kj denote the correlation between the states of
neurons j and k with network in a clamped condition
ρ kj+ = ∑ ∑ p(X
x α ∈ℑ x β
β = x β | X α = x α ) xk x j

„ Let ρ kj denote the correlation between the states of
neurons j and k with network in free-running condition
ρ kj− = ∑ ∑ p(X
x α ∈ℑ x
= x ) xk x j

„ Boltzman Learning Rule (Hinton and Sejnowski 86)

∆ w kj = η (ρ +kj − ρ −kj ), j ≠ k
where η is a learning-rate
„ ρ +kj and ρ −kj range in value from –1 to +1.

Note: DON’T PANIC. Boltzmann machine will be presented in details in future lectures.
End of Learning Rules (LR)
Network complexity
„ No formal methods exist for determining
network architecture. For e.g. the number of
layers in a feed forward network, the number
of nodes in each layer…
„ The next lectures will focus on specific
Suggested Reading.
ƒS. Haykin, “Neural Networks”, Prentice-Hall, 1999,
chapter 2, and section 11.7, chapter 11 (for Boltzmann
ƒL. Fausett, “Fundamentals of Neural Networks”,
Prentice-Hall, 1994, Chapter 2, and Section 7.2.2. of
chapter 7 (for Boltzmann machine).
ƒR.P. Lippmann, “An Introduction to Computing with
Neural Nets”, IEEE Magazine on Acoustics, Signal and
Speech Processing, April 1987: 4-22.
ƒB. Widrow, “Generalization and Information Storage in
Networks of Adaline “neurons”, Self-Organizing
Systems, 1962, ed. MC. Jovitz, G.T. Jacobi, G.
Goldstein, Spartan Books, 435-461
In addition to the references of the previous slide, the
following references were also used to prepare these
lecture notes.

1.Berlin Chen Lecture notes: Normal University, Taipei, Taiwan,

2. Jin Hyung Kim, KAIST Computer Science Dept., CS679
Neural Network lecture notes
3. Kevin Gurney lecture notes, “Neural Nets”, Univ. of Sheffield,
4.Dr John A. Bullinaria, Course Material, Introduction to
Neural Networks,
5.Richard Caruana, lecture notes, Cornell Univ.
7. Rothrock-Ling, Wright State Univ. lecture notes:
8. L. Jin, N. Koudas, C. Li, “NNH: Improving Performance of
Nearest-Neighbor Searches Using Histograms”:
9. Ajay Jain, UCSF:

