Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 11

Adaline Schematic

i1
i2 w0 + w1i1 + … + wnin

 Output
i
n

Adjust
weights Compare with
desired value
class(i) (1 or -1)

September 28, 2010 Neural Networks 1


Lecture 7: Perceptron Modifications
The Adaline Learning Algorithm
The Adaline uses gradient descent to determine the
weight vector that leads to minimal error.
Error is defined as the MSE between the neuron’s net
input netj and its desired output dj (= class(ij)) across
all training samples ij.
The idea is to pick samples in random order and
perform (slow) gradient descent in their individual
error functions.
This technique allows incremental learning, i.e.,
refining of the weights as more training samples are
added.

September 28, 2010 Neural Networks 2


Lecture 7: Perceptron Modifications
The Adaline Learning Algorithm
E  (d j  net j ) 2

E 
 2(d j  net j ) (net j )
wk wk

  n 
 2(d j  net j )    wl il , j 
wk  l 0 

 2(d j  net j )ik , j


The Adaline uses gradient descent to determine the
weight vector that leads to minimal error.
September 28, 2010 Neural Networks 3
Lecture 7: Perceptron Modifications
The Adaline Learning Algorithm
The gradient is then given by
 E 
 w 
 0
 :   2(d j  net j )i j
 E 
 wn 
 
For gradient descent, w should be a negative
multiple of the gradient:
w   (d j  net j )i j , with positive step - size parameter

September 28, 2010 Neural Networks 4


Lecture 7: Perceptron Modifications
The Widrow-Hoff Delta Rule
In the original learning rule
w   (d j  net j )i j

Longer input vectors result in greater weight changes,


which can cause problems if there are extreme
differences in vector length in the training set.
Widrow and Hoff (1960) suggested the following
modification of the learning rule:
ij
w   (d j  net j )
|| i j ||

September 28, 2010 Neural Networks 5


Lecture 7: Perceptron Modifications
Multiclass Discrimination

Often, our classification problems involve more than


two classes.
For example, character recognition requires at least
26 different classes.
We can perform such tasks using layers of
perceptrons or Adalines.

September 28, 2010 Neural Networks 6


Lecture 7: Perceptron Modifications
Multiclass Discrimination
w11
o1
i1 w12

o2
i2
..
. o3

in
w4n o4

A four-node perceptron for a four-class problem in n-


dimensional
September 28, 2010
input space
Neural Networks 7
Lecture 7: Perceptron Modifications
Multiclass Discrimination
Each perceptron learns to recognize one particular
class, i.e., output 1 if the input is in that class, and 0
otherwise.
The units can be trained separately and in parallel.
In production mode, the network decides that its current
input is in the k-th class if and only if ok = 1, and for all j
 k, oj = 0, otherwise it is misclassified.
For units with real-valued output, the neuron with
maximal output can be picked to indicate the class of
the input.
This maximum should be significantly greater than all
other outputs, otherwise the input is misclassified.
September 28, 2010 Neural Networks 8
Lecture 7: Perceptron Modifications
Multilayer Networks
Although single-layer perceptron networks can
distinguish between any number of classes, they still
require linear separability of inputs.
To overcome this serious limitation, we can use
multiple layers of neurons.
Rosenblatt first suggested this idea in 1961, but he
used perceptrons.
However, their non-differentiable output function led to
an inefficient and weak learning algorithm.
The idea that eventually led to a breakthrough was the
use of continuous output functions and gradient
descent.
September 28, 2010 Neural Networks 9
Lecture 7: Perceptron Modifications
Multilayer Networks
The resulting backpropagation algorithm was
popularized by Rumelhart, Hinton, and Williams
(1986).
This algorithm solved the “credit assignment”
problem, i.e., crediting or blaming individual neurons
across layers for particular outputs.
The error at the output layer is propagated backwards
to units at lower layers, so that the weights of all
neurons can be adapted appropriately.
The gradient descent technique is similar to the
Adaline, but propagating the error requires some
additional computations.
September 28, 2010 Neural Networks 10
Lecture 7: Perceptron Modifications
Terminology
Example: Network function f: R3  R2
output vector
o1 o2
output layer
w ( 2 ,1)
1,1
w2( 2, 4,1)

hidden layer

w1(,11, 0 ) w4(1,3, 0 )
input layer
x1 x2 x3
input vector
September 28, 2010 Neural Networks 11
Lecture 7: Perceptron Modifications

You might also like