Multi Perceptor

Multilayer neural
networks
Agenda
 Some historical notes

 Multilayer neural networks
 Backpropagation
 Error surfaces and feature mapping
 Speed ups
 Conclusions
Some historical notes
Rosenblatt’s perceptron 1958: a single neuron
with adjustable synaptic weights and bias
Perceptron convergence theorem 1962
(Rosenblatt): If patterns used to train the
perceptron are drawn from two linearly
separable classes, then the algorithm
converges and positions the decision surface in
the form of hyperplane between two classes.
The limitations of networks implementing linear
discriminants were well known in the 1950s and
1960s.
 A single neuron -> the class of solutions that
can be obtained is not general enough ->
multilayer perceptron
 Widrow & Hoff (1960): least-mean square

algorithm (LMS) = delta rule: three-layer
networks designed by hand (fixed input-to-
hidden weights, trained hidden-to-output
weights).
 The development of backpropagation: Kalman
filtering (Kalman 1960, Bryson,Denham,Dreyfus
1963,1969).
 Werbos : Beyond Regression: New
Tools for Prediction and Analysis in
the Behavioral Science, Ph.D. thesis,
Harvard University, 1974
 Parker 1982,1985
 Rumelhart, Hinton, Williams: Learning
internal representations by
backpropagating errors, Nature
323(99),pp533-536,1986
Multilayer Neural Networks
 Input layer
 Hidden layer
 Output layer
 Bias unit = neuron
 netj = wtjx
 yj = f(netj)
 f(net) = Sgn(net)
 zk = f(netk)
 The XOR problem:
(x1 OR x2)
AND NOT (x1 AND
x2)
 y1 : x1 + x2 +0.5 = 0
 > 0 -> y1 = 1,
otherwise –1
 y2 : x1 + x2 -1.5 = 0
 > 0 -> y2 = 1,
otherwise –1
 Any desired
continuous function
can be implemented by
a three-layer network
given sufficient number
of hidden units, proper
nonlinearitiers and
weights (Kolmogorov)
 Feedforward
 Learning
 (Demo chapter 11)
 Nonlinear multilayer
networks
 How to set the

weights of a three-
layer neural network
based on training
patterns and the
desired output?
Backpropagation Algorithm
 The LMS
algorithm exists
for linear
systems
 Training error
 Learning rule
 Learning rule
 Hidden-to-output
 Input-to-hidden
 Note, that wij are
initialized with
random values
 Demo Chapter 11
 Compare with LMS
algoritms
 1) Method of Steepest
Descent
 The direction of
steepest descent is in
direction opposite to
the gradient vector g =
E(w)
 w(n+1) = w(n) –g(n)
  is the stepsize or
learning-rate parameter
 Training set = a set of patterns with known
labels
 Stochastic training = patterns are chosen
randomly from the training set
 Batch training = all patterns are presented to
the network before learning takes place
 On-line protocol = each pattern is presented
once and only once, no memory in use
 Epoch = a single presentation of all patterns in
the training set. The number of epochs is an
indication of the relative amount of learning.
 Stopping criterion
Learning set
Validation set
Test set
Stopping criterion
Learning curve, the
average error per
pattern
Cross-validation
Error Surfaces and Feature
Mapping
 Note, error backpropagation is based
on gradient descent in a criterion
function J(w) that is represented
represented
Mapping
 The total training
error is minimized.
 It usually
decreases
monotonically,
even though this is
not the case for the
error on each
individual pattern.
Mapping
 Hidden-to-output
weights ~ a linear
discriminant
 Input-to-hidden
weights ~ ”matched
filter”
Practical Techniques for
Improving Backpropagation
 How to improve
convergence,
performance, and
results?
 Neuron:
Sigmoid function =
centered zero and
antisymmetric
Scaling input variables
= the input patterns should be shifted so
that the average over the training set of
each feature is zero.
= the full data set should be scaled to have
the same variance in each feature
component
Note, this standardization can only be done
for stochastic and batch learning!
 When the training set is small one can generate
surrogate training patterns.
 In the absence of problem-specific information,
the surrogate patterns should be made by
adding d-dimensional Gaussian noise to true
training points, the category label should be left
unchanged.
 If we know about the source of variation among
patterns we can manufacture training data.
 The number of hidden units should be less than
the total number of training points n, say roughly
n/10.
 We cannot initialize the weights to 0.
 Initializing weights = uniform learning ->
choose weights randomly from a
single distribution
 Input-to-hidden weights: -1 / d < wij < + 1 /
d ,where d is the number of input units
 Hidden-to-output weights: -1 /  nH < wkj < +
1 / nH ,where d is the number of hidden
units
 Learning Rates
 Demo Chapter 9
 The optimal rate
 Momentum : allows
the network to learn
more quickly when
plateaus in the
error surface exist.
Demo chapter 12.
0.9
 Weight Decay to avoid overfitting = to
start with a network with too many
weights and decay all the weights
during training wnew =
wold(1-), where 0 <  < 1
 The weights that are not needed for
reducing the error function become
smaller and smaller -> eliminated
 If we have insufficient
training data for the
desired classification
accuracy.
 Learning with hints is
to add output units for
addressing an ancillary
problem, one different
from but related to the
specific classification
problem at hand.
 Stopped Training
 Number of hidden layers: typically 2-3.
 Criterion Function: cross entropy,

Minkowski Error
 Second-Order Methods
to speed up the
learning rate:
 Newton’s Method,
demo chapter 9
 Quickprop
 Conjugate Gradient
Descent requires batch
training, demo chapter
9
 2) Newton’s method
 The idea is to minimize the quadratic
approximation of the cost function E(w) around
the current point w(n).
 Using a second-order Taylor series expansion
of the cost function around the point w(n).
  Ew(n)  gT(n)w(n) +½ wT(n) H(n) w(n)
 g(n) is the m-by-1 gradient vector of the cost
function E(w) evaluated at the point w(n). The
matrix H(n) is the m-by-m Hessian matrix of
E(w) (second derivative), H = ²E(w)
 H = ²E(w) requires the cost function E(w) to
be twice continuously differentiable with
respect to the elements of w.
 Differentiating  Ew(n)  gT(n)w(n) +½
wT(n) H(n) w(n) with respect to w, the
change E(n) is minimized when
 g(n) + H(n)w(n) = 0 -> w(n) = H-1(n)g(n)
 w(n+1) = w(n) + w(n)
 w(n+1) = w(n)+H-1(n)g(n)
 where H-1(n) is the inverse of the Hessian of
E(w).
 Newton’s method converges quickly
asymtotically and does not exhibit the
zigzagging behavior.
 Newton’s method requires that the
Hessian H(n) has to be a positive
definite matrix for all n!
 3) Gauss-Newton Method
 It is applicable to a cost function that is expressed
as the sum of error squares.
 E(w) = ½ n e²(i), note that all the error terms are
i=1
calculated on the basis of a weight vector w that is
fixed over the entire observation interval 1 i  n.
 The error signal e(i) is a function of the adjustable
weight vector w. Given an operating point w(n), we
linearize the dependence of e(i) on w by writing
e’(i,w) = e(i) + [e(i)/w]Tw=w(n) (w-w(n)), i=1,2,...,n
e’(n,w) = e(n) + J(n)(w-w(n))
where e(n) is the error vector e(n) =
[e(1),e(2),...,e(n)]T and J(n) is the n-by-m
Jacobian matrix of e(n) (The Jacobian J(n) is
the transpose of the m-by-n gradient matrix
e(n), where e(n) =[e(1), e(2), ...,e(n)].
w(n+1) = arg min w {½e’(n,w)²}
= ½e(n)² +eT(n)J(n)(w-w(n)) +
½(w-w(n))TJT(n)J(n)(w-w(n))
Differentiating the expression with respect w and
setting the result equal to zero
JT(n)e(n) + JT(n)e(n) (w-w(n)) = 0
w(n+1) = w(n) – (JT(n)J(n))-1JT(n)e(n)
The Gauss-Newton requires only the Jacobian
matrix of the error function e(n).
For the Gauss-Newton iteration to be
computable, the matrix product JT(n)J(n) must
be nonsigular. JT(n)J(n) is always nonnegative
definite but to ensure that it is nonsingular, the
Jacobian J(n) must have row rank n. -> add
the diagonal matrix I to the matrix JT(n)J(n),
the parameter  is a small positive constant.
 JT(n)J(n)+ I ; positve definite for all n.
 -> The Gauss-Newton method is
implemented in the following form:
w(n+1) = w(n) – (JT(n)J(n) + I )-
J (n)e(n)
1 T
 This is the solution to the modified cost

function:
 E(w) = ½{w-w(0)²+  n e²(i)}
i=1
 where w(0) is the initial value of w.
Regularization, Complexity
Adjustment and Pruning
 If we have too many
weight and train too
long -> overfitting
 Wald Statistic: we can
estimate the
importance of a
parameter in a model
 The Optimal Brain
Damage, The Optimal
Brain Surgeon:
eliminate the
parameter having the
least importance
Summary
 Multilayer nonlinear neural networks trained
by gradient descent methods such as
backpropagation perform a maximum-
likelihood estimation of the weight values in
the model defined by the network topology.
 f(net) at the hidden units allows the
networks to form an arbitary decision
boundary, so long as there are sufficiently
many hidden units.

Multi Perceptor

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi Perceptor

Uploaded by

Copyright:

Available Formats

Multilayer neural

 Some historical notes

 Error surfaces and feature mapping

 Widrow & Hoff (1960): least-mean square

 How to set the

 Criterion Function: cross entropy,

 This is the solution to the modified cost

You might also like