Perceptron 2015

Basic Concepts
A Neural Network maps a set of inputs to a Input 0 Input 1 ... Input n
set of outputs
Number of inputs/outputs is variable
The Network itself is composed of an
Neural Network
arbitrary number of nodes or units, connected
by links, with an arbitrary topology.
A link from unit i to unit j serves to propagate
the activation aj to j, and it has a weight Output 0 Output 1 ... Output m
Wij.
What can a neural networks do?
Compute a known function / Approximate an unknown function
Pattern Recognition / Signal Processing
Learn to do any of the above Carla P. Gomes
CS4700
Different
types of nodes
Carla P. Gomes
CS4700
An Artificial Neuron
Node or Unit:
A Mathematical Abstraction
Artificial Neuron,
Node or unit ,
Processing Unit i
Input
Input edges, Output
function(ini): Output edges,
each with weights Activation
weighted sum n
a  g (  W j ,i a j ) each with weights
(positive, negative, and of its inputs, function (g) i j 0
including applied to (positive, negative, and
change over time,
fixed input a0. input function change over time,
learning) n (typically learning)
ini   W j ,i a j
j 0
non-linear).
 a processing element producing an output based on a function of its inputs
Note: the fixed input and bias weight are conventional; some authors instead, e.g., or a 0=1 and -W0i
Carla P. Gomes
CS4700
Activation Functions
(a) Threshold activation function  a step function or threshold function

(outputs 1 when the input is positive; 0 otherwise).
(b) Sigmoid (or logistics function) activation function (key advantage: differentiable)
(c) Sign function, +1 if input is positive, otherwise -1.
These functions have a threshold (either hard or soft) at zero.

 Changing the bias weight W0,i moves the threshold location. Carla P. Gomes
CS4700
Threshold Activation Function
n n
ini   W j ,i a j  0;  ini   w j ,i a j  w0,i a0  0;
j 0 j 1
n
defining a0  1 we get  W j ,i a j  w0,i ,  i  w0,i
j 1
n
defining a0  1 we get  W j ,i a j   w0,i ,  i   w0,i
j 1
Input edges,
each with weights
(positive, negative, and
change over time,
learning)
i threshold value
associated with
unit i
i=0 i=t Carla P. Gomes
CS4700
Implementing Boolean Functions
Units with a threshold activation function

can act as logic gates; we can use these units
to compute Boolean function of its inputs.
Activation of
threshold units when:
n
W
j 1
j ,i a j  W0,i
Carla P. Gomes
CS4700
Boolean AND
input x1 input x2 ouput
W0= 1.5
0 0 0
0 1 0 -1
w1=1 w2=1
1 0 0
1 1 1 x1 x2
Activation of
n
W
j 1
j ,i a j  W0,i
Carla P. Gomes
CS4700
Boolean OR
input x1 input x2 ouput
w0= 0.5
0 0 0
0 1 1 -1
w1=1 w2=1
1 0 1
1 1 1 x1 x2
Activation of
n
W
j 1
j ,i a j  W0,i
Carla P. Gomes
CS4700
Inverter
input x1 output
w0= -
0 1
1 0 -1 w1= 1
x1
Activation of
n
So, units with a threshold activation function
W
j 1
j ,i a j  W0,i can act as logic gates given the appropriate input and
bias weights.
Carla P. Gomes
CS4700
Network Structures
Acyclic or Feed-forward networks Our focus

Activation flows from input layer to
output layer
– single-layer perceptrons
– multi-layer perceptrons
Feed-forward
Recurrent networks networks implement functions,
have no internal state (only weights).
– Feed the outputs back into own inputs
Network is a dynamical system
(stable state, oscillations, chaotic behavior)
Response of the network depends on initial state
– Can support short-term memory
– More difficult to understand
Carla P. Gomes
CS4700
Feed-forward Network:
Represents a function of Its Input
Two input units Two hidden units One Output
Each unit receives input only

from units in the immediately
preceding layer.
(Bias unit omitted
for simplicity)
Given an input vector x = (x1,x2), the activations of the input units are set to values of the
input vector, i.e., (a1,a2)=(x1,x2), and the network computes:
Weights are the parameters of the function
Feed-forward network computes a parameterized family of functions hW(x)

By adjusting the weights we get different functions:
that is how learning is done in neural networks!
Carla P. Gomes
Note: the input layer in general does not include computing units. CS4700
Perceptron
Cornell Aeronautical Laboratory

Perceptron
– Invented by Frank Rosenblatt in 1957 in an
attempt to understand human memory, learning,
and cognitive processes.
– The first neural network model by computation,
with a remarkable learning algorithm:
• If function can be represented by perceptron, the
learning algorithm is guaranteed to quickly
converge to the hidden function!
– Became the foundation of pattern recognition
research
Rosenblatt &
Mark I Perceptron:
the first machine that could One of the earliest and most influential neural networks:
"learn" to recognize and An important milestone in AI.
identify optical patterns. Carla P. Gomes
CS4700
Perceptron
ROSENBLATT, Frank.
(Cornell Aeronautical Laboratory at Cornell
University )
The Perceptron: A Probabilistic Model for
Information Storage and Organization in the Brain.
In, Psychological Review, Vol. 65, No. 6, pp. 386-

408, November, 1958.
Carla P. Gomes
CS4700
Single Layer Feed-forward Neural Networks
Perceptrons
Single-layer neural network (perceptron network)
A network with all the inputs connected directly to the outputs
–Output units all operate separately: no shared weights
Since each output unit is

independent of the others,
we can limit our study
to single output perceptrons.
Carla P. Gomes
CS4700
Perceptron to Learn to Identify Digits
(From Pat. Winston, MIT)
Digit x0 x1 x2 x3 x4 x5 x6
0 0 1 1 1 1 1 1
9 1 1 1 1 1 1 0
Seven line segments 8 1 1 1 1 1 1 1
are enough to produce
7 0 0 1 1 1 0 0
all 10 digits
6 1 1 1 0 1 1 1
2 5 1 1 1 0 1 1 0
4 1 1 0 1 1 0 0
1
3 1 0 1 1 1 1 0
0 2 1 0 1 1 0 1 1
6
1 0 0 0 1 1 0 0
5 Carla P. Gomes
CS4700
Perceptron to Learn to Identify Digits
(From Pat. Winston, MIT)
Seven line segments

are enough to produce
all 10 digits
2
1
0
A vision system reports which of the seven segments
6
in the display are on, therefore producing the inputs

5 for the perceptron.
Carla P. Gomes
CS4700
Perceptron to Learn to Identify Digit 0
Digit x0 x1 x2 x3 x4 x5 x6 X7
(fixed input)
0 0 1 1 1 1 1 1 1
When the input digit is 0,
what’s the value of
-1 sum?
Seven line segments 0
are enough to produce 0
all 10 digits 0
0
0
2
1
0
1
0 Sum>0  output=1
Else output=0
6
A vision system reports which of the seven segments

5 in the display are on, therefore producing the inputs for the perceptron.
Carla P. Gomes
CS4700
Single Layer Feed-forward Neural Networks
Perceptrons
Two input perceptron unit with a sigmoid (logistics) activation function:

adjusting weights moves the location, orientation, and steepness of cliff
Carla P. Gomes
CS4700
Perceptron Learning:
Intuition
Weight Update
 Input Ij (j=1,2,…,n)
 Single output O: target output, T.
Consider some initial weights
Define example error: Err = T – O
Now just move weights in right direction!
If the error is positive, then we need to increase O.
Err >0  need to increase O;
Err <0  need to decrease O;
Each input unit j, contributes Wj Ij to total input:
if Ij is positive, increasing Wj tends to increase O;
if Ij is negative, decreasing Wj tends to increase O;
So, use:
Wj  Wj +   Ij  Err
Perceptron Learning Rule (Rosenblatt 1960)
 is the learning rate.

Carla P. Gomes
CS4700
Simple Example
Let’s consider an example (adapted from Patrick Wintson book, MIT)
Framework and notation:
0/1 signals
Input vector: 
X  x0 , x1 , x2  , xn 
Weight vector: 
W  w0 , w1 , w2  , wn 
x0 = 1 and 0=-w0, simulate the threshold.
O is output (0 or 1) (single output).
Learning rate = 1.
Threshold function: k n
S   wk xk S  0 then O  1 else O  0
k 0
Carla P. Gomes
CS4700
Err = T – O Perceptron Learning:
Wj  Wj +   Ij  Err Simple Example

Set of examples, each example is a pair ( x i , yi ) This procedure provably converges
i.e., an input vector and a label y (0 or 1). (polynomial number of steps)
if the function is represented
by a perceptron
(i.e., linearly separable)
Learning procedure, called the “error correcting method”
• Start with all zero weight vector.

• Cycle (repeatedly) through examples and for each example do:
– If perceptron is 0 while it should be 1,
add the input vector to the weight vector Intuitively correct,
– If perceptron is 1 while it should be 0, (e.g., if output is 0
subtract the input vector to the weight vector but it should be 1,
– Otherwise do nothing. the weights are
increased) !
Carla P. Gomes
CS4700
Simple Example
Consider learning the logical OR function.

Our examples are:
Sample x0 x1 x2 label
1 1 0 0 0
2 1 0 1 1
3 1 1 0 1
4 1 1 1 1
k n
Activation Function S   wk xk S  0 then O  1 else O  0
k 0
Carla P. Gomes
CS4700
k n
S   wk xk S  0 then O  1 else O  0
k 0
Error correcting method
If perceptron is 0 while it should be 1,
add the input vector to the weight vector
If perceptron is 1 while it should be 0, Simple Example
subtract the input vector to the weight vector
Otherwise do nothing.
1
We’ll use a single perceptron with three inputs. I0 w0
We’ll start with all weights 0 W= <0,0,0>
I1 w O
1
Example 1 I= < 1 0 0> label=0 W= <0,0,0>
Perceptron (10+ 00+ 00 =0, S=0) output  0 I2 w2
it classifies it as 0, so correct, do nothing
Example 2 I=<1 0 1> label=1 W= <0,0,0>

Perceptron (10+ 00+ 10 = 0) output 0
it classifies it as 0, while it should be 1, so we add input to weights
W = <0,0,0> + <1,0,1>= <1,0,1>
Carla P. Gomes
CS4700
1
Example 3 I=<1 1 0> label=1 W= <1,0,1> I0 w0
Perceptron (10+ 10+ 00 > 0) output = 1
it classifies it as 1, correct, do nothing I1 w O
1
W = <1,0,1>
I2 w2
Example 4 I=<1 1 1> label=1 W= <1,0,1>

it classifies it as 1, correct, do nothing
W = <1,0,1>
Carla P. Gomes
CS4700
Error correcting method
add the input vector to the weight vector
subtract the input vector from the weight vector Simple Example
Otherwise do nothing.
1 I0 w0
Epoch 2, through the examples, W = <1,0,1> .
I1 w O
Example 1 I = <1,0,0> label=0 W = <1,0,1> 1
Perceptron (11+ 00+ 01 >0) output  1 I2 w2

it classifies it as 1, while it should be 0, so subtract input from weights
W = <1,0,1> - <1,0,0> = <0, 0, 1>
Example 2 I=<1 0 1> label=1 W= <0,0,1>

Perceptron (10+ 00+ 11 > 0) output 1
Carla P. Gomes
CS4700
Example 3 I=<1 1 0> label=1 W= <0,0,1>
it classifies it as 0, while it should be 1, so add input to weights
W = <0,0,1> + W = <1,1,0> = <1, 1, 1>
Example 4 I=<1 1 1> label=1 W= <1,1,1>

W = <1,1,1>
Carla P. Gomes
CS4700
Simple Example
1 I0 w0
Epoch 3, through the examples, W = <1,1,1> .
I1 w O
Example 1 I=<1,0,0> label=0 W = <1,1,1> 1
Perceptron (11+ 01+ 01 >0) output  1 I2 w2

it classifies it as 1, while it should be 0, so subtract input from weights
W = <1,1,1> - W = <1,0,0> = <0, 1, 1>
Example 2 I=<1 0 1> label=1 W= <0, 1, 1>

Perceptron (10+ 01+ 11 > 0) output 1
Carla P. Gomes
CS4700
Example 3 I=<1 1 0> label=1 W= <0, 1, 1>
Example 4 I=<1 1 1> label=1 W= <0, 1, 1>

W = <1,1,1>
Carla P. Gomes
CS4700
Simple Example
Epoch 4, through the examples, W= <0, 1, 1>. 1

I0 W0 =0
Example 1 I= <1,0,0> label=0 W = <0,1,1>
Perceptron (10+ 01+ 01 = 0) output  0 I1 W1=1 O
I2 W2=1
So the final weight vector W= <0, 1, 1> classifies all OR

examples correctly, and the perceptron has learned the function!
Aside: in more realistic cases the bias (W0) will not be 0.

(This was just a toy example!)
Also, in general, many more inputs (100 to 1000) Carla P. Gomes
CS4700
Epoch x0 x1 x2 Desired w0 w1 w2 Output Error New New New
Target
w0 w1 w2
1 example 1 1 0 0 0 0 0 0 0 0 0 0 0
1 0 1 1 0 0 0 0 1 1 0 1
1 1 0 1 1 0 1 1 0 1 0 1
1 1 1 1 1 0 1 1 0 1 0 1
2 1 0 0 0 1 0 1 1 -1 0 0 1
1 0 1 1 0 0 1 1 0 0 0 1
1 1 0 1 0 0 1 0 1 1 1 1
1 1 1 1 1 1 1 1 0 1 1 1
3 1 0 0 0 1 1 1 1 -1 0 1 1
1 0 1 1 0 1 1 1 0 0 1 1
1 1 0 1 0 1 1 1 0 0 1 1
1 1 1 1 0 1 1 1 0 0 1 1
4 1 0 0 0 0 1 1 0 0 0 1 1
Carla P. Gomes
CS4700
Target
w0 w1 w2
1 example 1 1 0 0 0 0 0 0 0 0 0 0 0
example 2
1 0 1 1 0 0 0 0 1 1 0 1
1 1 0 1 1 0 1 1 0 1 0 1
1 1 1 1 1 0 1 1 0 1 0 1
2 1 0 0 0 1 0 1 1 -1 0 0 1
1 0 1 1 0 0 1 1 0 0 0 1
1 1 0 1 0 0 1 0 1 1 1 1
1 1 1 1 1 1 1 1 0 1 1 1
3 1 0 0 0 1 1 1 1 -1 0 1 1
1 0 1 1 0 1 1 1 0 0 1 1
1 1 0 1 0 1 1 1 0 0 1 1
1 1 1 1 0 1 1 1 0 0 1 1
4 1 0 0 0 0 1 1 0 0 0 1 1
Carla P. Gomes
CS4700
Target
w0 w1 w2
1 example 1 1 0 0 0 0 0 0 0 0 0 0 0
example 2
1 0 1 1 0 0 0 0 1 1 0 1
example 3
1 1 0 1 1 0 1 1 0 1 0 1
1 1 1 1 1 0 1 1 0 1 0 1
2 1 0 0 0 1 0 1 1 -1 0 0 1
1 0 1 1 0 0 1 1 0 0 0 1
1 1 0 1 0 0 1 0 1 1 1 1
1 1 1 1 1 1 1 1 0 1 1 1
3 1 0 0 0 1 1 1 1 -1 0 1 1
1 0 1 1 0 1 1 1 0 0 1 1
1 1 0 1 0 1 1 1 0 0 1 1
1 1 1 1 0 1 1 1 0 0 1 1
4 1 0 0 0 0 1 1 0 0 0 1 1
Carla P. Gomes
CS4700
Target
w0 w1 w2
1 example 1 1 0 0 0 0 0 0 0 0 0 0 0
example 2
1 0 1 1 0 0 0 0 1 1 0 1
example 3
1 1 0 1 1 0 1 1 0 1 0 1
example 4
1 1 1 1 1 0 1 1 0 1 0 1
2 1 0 0 0 1 0 1 1 -1 0 0 1
1 0 1 1 0 0 1 1 0 0 0 1
1 1 0 1 0 0 1 0 1 1 1 1
1 1 1 1 1 1 1 1 0 1 1 1
3 1 0 0 0 1 1 1 1 -1 0 1 1
1 0 1 1 0 1 1 1 0 0 1 1
1 1 0 1 0 1 1 1 0 0 1 1
1 1 1 1 0 1 1 1 0 0 1 1
4 1 0 0 0 0 1 1 0 0 0 1 1
Carla P. Gomes
CS4700
Target
w0 w1 w2
1 example 1 1 0 0 0 0 0 0 0 0 0 0 0
example 2
1 0 1 1 0 0 0 0 1 1 0 1
example 3
1 1 0 1 1 0 1 1 0 1 0 1
example 4
1 1 1 1 1 0 1 1 0 1 0 1
2 example 1 1 0 0 0 1 0 1 1 -1 0 0 1
1 0 1 1 0 0 1 1 0 0 0 1
1 1 0 1 0 0 1 0 1 1 1 1
1 1 1 1 1 1 1 1 0 1 1 1
3 1 0 0 0 1 1 1 1 -1 0 1 1
1 0 1 1 0 1 1 1 0 0 1 1
1 1 0 1 0 1 1 1 0 0 1 1
1 1 1 1 0 1 1 1 0 0 1 1
4 1 0 0 0 0 1 1 0 0 0 1 1
Carla P. Gomes
CS4700
Target
w0 w1 w2
1 example 1 1 0 0 0 0 0 0 0 0 0 0 0
example 2
1 0 1 1 0 0 0 0 1 1 0 1
example 3
1 1 0 1 1 0 1 1 0 1 0 1
example 4 1 1 1 1 1 0 1 1 0 1 0 1
2 example 1 1 0 0 0 1 0 1 1 -1 0 0 1
example 2
1 0 1 1 0 0 1 1 0 0 0 1
1 1 0 1 0 0 1 0 1 1 1 1
1 1 1 1 1 1 1 1 0 1 1 1
3 1 0 0 0 1 1 1 1 -1 0 1 1
1 0 1 1 0 1 1 1 0 0 1 1
1 1 0 1 0 1 1 1 0 0 1 1
1 1 1 1 0 1 1 1 0 0 1 1
4 1 0 0 0 0 1 1 0 0 0 1 1
Carla P. Gomes
CS4700
Target
w0 w1 w2
1 example 1 1 0 0 0 0 0 0 0 0 0 0 0
example 2
1 0 1 1 0 0 0 0 1 1 0 1
example 3
1 1 0 1 1 0 1 1 0 1 0 1
example 4 1 1 1 1 1 0 1 1 0 1 0 1
2 example 1 1 0 0 0 1 0 1 1 -1 0 0 1
example 2
1 0 1 1 0 0 1 1 0 0 0 1
example 3
1 1 0 1 0 0 1 0 1 1 1 1
1 1 1 1 1 1 1 1 0 1 1 1
3 1 0 0 0 1 1 1 1 -1 0 1 1
1 0 1 1 0 1 1 1 0 0 1 1
1 1 0 1 0 1 1 1 0 0 1 1
1 1 1 1 0 1 1 1 0 0 1 1
4 1 0 0 0 0 1 1 0 0 0 1 1
Carla P. Gomes
CS4700
Target
w0 w1 w2
1 example 1 1 0 0 0 0 0 0 0 0 0 0 0
example 2
1 0 1 1 0 0 0 0 1 1 0 1
example 3
1 1 0 1 1 0 1 1 0 1 0 1
example 4 1 1 1 1 1 0 1 1 0 1 0 1
2 example 1 1 0 0 0 1 0 1 1 -1 0 0 1
example 2
1 0 1 1 0 0 1 1 0 0 0 1
example 3
1 1 0 1 0 0 1 0 1 1 1 1
example 4
1 1 1 1 1 1 1 1 0 1 1 1
3 1 0 0 0 1 1 1 1 -1 0 1 1
1 0 1 1 0 1 1 1 0 0 1 1
1 1 0 1 0 1 1 1 0 0 1 1
1 1 1 1 0 1 1 1 0 0 1 1
4 1 0 0 0 0 1 1 0 0 0 1 1
Carla P. Gomes
CS4700
Target
w0 w1 w2
1 example 1 1 0 0 0 0 0 0 0 0 0 0 0
example 2
1 0 1 1 0 0 0 0 1 1 0 1
example 3
1 1 0 1 1 0 1 1 0 1 0 1
example 4 1 1 1 1 1 0 1 1 0 1 0 1
2 example 1 1 0 0 0 1 0 1 1 -1 0 0 1
example 2
1 0 1 1 0 0 1 1 0 0 0 1
example 3
1 1 0 1 0 0 1 0 1 1 1 1
example 4
1 1 1 1 1 1 1 1 0 1 1 1
3 example 1 1 0 0 0 1 1 1 1 -1 0 1 1
1 0 1 1 0 1 1 1 0 0 1 1
1 1 0 1 0 1 1 1 0 0 1 1
1 1 1 1 0 1 1 1 0 0 1 1
4 1 0 0 0 0 1 1 0 0 0 1 1
Carla P. Gomes
CS4700
Target
w0 w1 w2
1 example 1 1 0 0 0 0 0 0 0 0 0 0 0
example 2
1 0 1 1 0 0 0 0 1 1 0 1
example 3
1 1 0 1 1 0 1 1 0 1 0 1
example 4 1 1 1 1 1 0 1 1 0 1 0 1
2 example 1 1 0 0 0 1 0 1 1 -1 0 0 1
example 2
1 0 1 1 0 0 1 1 0 0 0 1
example 3
1 1 0 1 0 0 1 0 1 1 1 1
example 4
1 1 1 1 1 1 1 1 0 1 1 1
3 example 1 1 0 0 0 1 1 1 1 -1 0 1 1
example 2
1 0 1 1 0 1 1 1 0 0 1 1
1 1 0 1 0 1 1 1 0 0 1 1
1 1 1 1 0 1 1 1 0 0 1 1
4 1 0 0 0 0 1 1 0 0 0 1 1
Carla P. Gomes
CS4700
Target
w0 w1 w2
1 example 1 1 0 0 0 0 0 0 0 0 0 0 0
example 2
1 0 1 1 0 0 0 0 1 1 0 1
example 3
1 1 0 1 1 0 1 1 0 1 0 1
example 4 1 1 1 1 1 0 1 1 0 1 0 1
2 example 1 1 0 0 0 1 0 1 1 -1 0 0 1
example 2
1 0 1 1 0 0 1 1 0 0 0 1
example 3
1 1 0 1 0 0 1 0 1 1 1 1
example 4
1 1 1 1 1 1 1 1 0 1 1 1
3 example 1 1 0 0 0 1 1 1 1 -1 0 1 1
example 2
1 0 1 1 0 1 1 1 0 0 1 1
example 3
1 1 0 1 0 1 1 1 0 0 1 1
1 1 1 1 0 1 1 1 0 0 1 1
4 1 0 0 0 0 1 1 0 0 0 1 1
Carla P. Gomes
CS4700
Target
w0 w1 w2
1 example 1 1 0 0 0 0 0 0 0 0 0 0 0
example 2
1 0 1 1 0 0 0 0 1 1 0 1
example 3
1 1 0 1 1 0 1 1 0 1 0 1
example 4 1 1 1 1 1 0 1 1 0 1 0 1
2 example 1 1 0 0 0 1 0 1 1 -1 0 0 1
example 2
1 0 1 1 0 0 1 1 0 0 0 1
example 3
1 1 0 1 0 0 1 0 1 1 1 1
example 4
1 1 1 1 1 1 1 1 0 1 1 1
3 example 1 1 0 0 0 1 1 1 1 -1 0 1 1
example 2
1 0 1 1 0 1 1 1 0 0 1 1
example 3
1 1 0 1 0 1 1 1 0 0 1 1
example 4
1 1 1 1 0 1 1 1 0 0 1 1
4 1 0 0 0 0 1 1 0 0 0 1 1
Carla P. Gomes
CS4700
Target
w0 w1 w2
1 example 1 1 0 0 0 0 0 0 0 0 0 0 0
example 2
1 0 1 1 0 0 0 0 1 1 0 1
example 3
1 1 0 1 1 0 1 1 0 1 0 1
example 4 1 1 1 1 1 0 1 1 0 1 0 1
2 example 1 1 0 0 0 1 0 1 1 -1 0 0 1
example 2
1 0 1 1 0 0 1 1 0 0 0 1
example 3
1 1 0 1 0 0 1 0 1 1 1 1
example 4
1 1 1 1 1 1 1 1 0 1 1 1
3 example 1 1 0 0 0 1 1 1 1 -1 0 1 1
example 2
1 0 1 1 0 1 1 1 0 0 1 1
example 3
1 1 0 1 0 1 1 1 0 0 1 1
example 4
1 1 1 1 0 1 1 1 0 0 1 1
4 example 1 1 0 0 0 0 1 1 0 0 0 1 1
Carla P. Gomes
CS4700
Derivation of a learning rule for
Perceptrons Minimizing Squared Errors
Threshold perceptrons have some advantages , in particular
 Simple learning algorithm that fits a threshold perceptron to any

linearly separable training set.
Key idea: Learn by adjusting weights to reduce error on training set.

 update weights repeatedly (epochs) for each example.
We’ll use:
Sum of squared errors (e.g., used in linear regression), classical error measure
Learning is an optimization search problem in weight space.

Carla P. Gomes
CS4700
Let S = {(xi, yi): i = 1, 2, ..., m} be a training set. (Note, x is a vector of

inputs, and y is the vector of the true outputs.)
Let hw be the perceptron classifier represented by the weight vector w.
Definition:
1
E (x)  Squared Error (x)  ( y  hw (x)) 2
2
Carla P. Gomes
CS4700
The squared error for a single training example with input x and true output y is:
Where hw (x) is the output of the perceptron on the example and y is the true output
value.
We can use the gradient descent to reduce the squared error by calculating the
partial derivatives of E with respect to each weight.
Note: g’(in) derivative of the activation function. For sigmoid g’=g(1-g). For threshold perceptrons,
Where g’(n) is undefined, the original perceptron rule simply omitted it. Carla P. Gomes
CS4700
Gradient descent algorithm  we want to reduce , E, for each weight wi , change weight in
direction of steepest descent:
 learning rate
Intuitively:
Err = y – hW(x) positive
output is too small  weights are increased for positive inputs and decreased for
negative inputs.
Err = y – hW(x) negative

 opposite
Carla P. Gomes
CS4700
Intuition
Rule is intuitively correct!

Greedy Search:
Gradient descent through weight space!
Surprising proof of convergence:
Weight space has no local minima!
With enough examples, it will find the target function!

(provide  not too large)
Carla P. Gomes
CS4700
Gradient descent in weight space
From T. M. Mitchell, Machine Learning

Carla P. Gomes
CS4700
Perceptron learning rule:
1. Start with random weights, w = (w1, w2, ... , wn).
2. Select a training example (x,y)  S.
Epoch  cycle through the examples

3. Run the perceptron with input x and weights w to obtain g
4. Let  be the training rate (a user-set parameter).
wi , wi  wi  wi ,
where
wi   ( y  g (in)) g ' (in) xi
5. Go to 2.
Epochs are repeated until some stopping criterion is reached—

typically, that the weight changes have become very small.
The stochastic gradient method selects examples randomly from the

training set rather than cycling through them. Carla P. Gomes
CS4700
Gradient Descent Learning Algorithm
Carla P. Gomes
CS4700
Expressiveness of Perceptrons
Carla P. Gomes
CS4700
What hypothesis space can a perceptron represent?
Even more complex Booelan functions such as majority

function .
But can it represent any arbitrary Boolean function?

Carla P. Gomes
CS4700
A threshold perceptron returns 1 iff the weighted sum of its

inputs (including the bias) is positive, i.e.,:
I.e., iff the input is on one side of the hyperplane it defines.
Perceptron  Linear Separator
Linear discriminant function or linear decision surface.
Weights determine slope and bias determines offset.

Carla P. Gomes
CS4700
Linear Separability
Consider example with two inputs, x1, x2:

x2
+ +
+ Can view trained network
+ as defining a “separation line”.
+ +
+
What is its equation?
 w0  w1 x1  w2 x2  0
x1
w1 w0
x2   x1 
w2 w2
Percepton used for classification
Carla P. Gomes
CS4700
Linear Separability
x2
 
OR
  x1
Carla P. Gomes
CS4700
Linear Separability
x2
 
AND
  x1
Carla P. Gomes
CS4700
Linear Separability
x2
 
XOR
  x1
Carla P. Gomes
CS4700
Linear Separability
x2
Not linearly separable
 
XOR
  x1
Minsky & Papert (1969)

Bad News: Perceptrons can only represent linearly separable functions.
Carla P. Gomes
CS4700
Linear Separability:
XOR
Consider a threshold perceptron for the logical XOR function (two inputs):
w1 x1  w2 x2  T
Our examples are:

Given our examples, we have the following inequalities
for the perceptron:
x1 x2 label
From (1) 0 + 0 ≤ T  T0
1 0 0 0
From (2) w1+ 0 > T  w1 > T
2 1 0 1 w1 + w2 > 2T
From (3) 0 + w2 > T  w2 > T
3 0 1 1 From (4) w1 + w2 ≤ T
4 1 1 0 contradiction
So, XOR is not linearly separable

Carla P. Gomes
CS4700
Convergence of Perceptron
Learning Algorithm
Perceptron converges to a consistent function, if…
… training data linearly separable

… step size  sufficiently small
… no “hidden” units
Carla P. Gomes
CS4700
Perceptron learns majority function easily,
DTL is hopeless Carla P. Gomes
CS4700
DTL learns restaurant function easily,
perceptron cannot represent it
Carla P. Gomes
CS4700
Good news: Adding hidden layer allows more target
functions to be represented.
Minsky & Papert (1969)
Carla P. Gomes
CS4700

Perceptron 2015

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Perceptron 2015

Uploaded by

Copyright:

Available Formats

Basic Concepts

A Neural Network maps a set of inputs to a Input 0 Input 1 ... Input n

(a) Threshold activation function  a step function or threshold function

These functions have a threshold (either hard or soft) at zero.

Units with a threshold activation function

input x1 input x2 ouput

input x1 input x2 ouput

Acyclic or Feed-forward networks Our focus

Two input units Two hidden units One Output

Each unit receives input only

Feed-forward network computes a parameterized family of functions hW(x)

Cornell Aeronautical Laboratory

In, Psychological Review, Vol. 65, No. 6, pp. 386-

Single-layer neural network (perceptron network)

A network with all the inputs connected directly to the outputs

–Output units all operate separately: no shared weights

Since each output unit is

Seven line segments

in the display are on, therefore producing the inputs

A vision system reports which of the seven segments

Two input perceptron unit with a sigmoid (logistics) activation function:

 is the learning rate.

O is output (0 or 1) (single output).

• Start with all zero weight vector.

Consider learning the logical OR function.

Example 2 I=<1 0 1> label=1 W= <0,0,0>

Example 4 I=<1 1 1> label=1 W= <1,0,1>

Perceptron (11+ 00+ 01 >0) output  1 I2 w2

Example 2 I=<1 0 1> label=1 W= <0,0,1>

Example 4 I=<1 1 1> label=1 W= <1,1,1>

Perceptron (11+ 01+ 01 >0) output  1 I2 w2

Example 2 I=<1 0 1> label=1 W= <0, 1, 1>

Example 4 I=<1 1 1> label=1 W= <0, 1, 1>

Epoch 4, through the examples, W= <0, 1, 1>. 1

So the final weight vector W= <0, 1, 1> classifies all OR

Aside: in more realistic cases the bias (W0) will not be 0.

Threshold perceptrons have some advantages , in particular

 Simple learning algorithm that fits a threshold perceptron to any

Key idea: Learn by adjusting weights to reduce error on training set.

Learning is an optimization search problem in weight space.

Let S = {(xi, yi): i = 1, 2, ..., m} be a training set. (Note, x is a vector of

Let hw be the perceptron classifier represented by the weight vector w.

Err = y – hW(x) negative

Rule is intuitively correct!

Weight space has no local minima!

With enough examples, it will find the target function!

From T. M. Mitchell, Machine Learning

2. Select a training example (x,y)  S.

Epoch  cycle through the examples

4. Let  be the training rate (a user-set parameter).

Epochs are repeated until some stopping criterion is reached—

The stochastic gradient method selects examples randomly from the

What hypothesis space can a perceptron represent?

Even more complex Booelan functions such as majority

But can it represent any arbitrary Boolean function?

A threshold perceptron returns 1 iff the weighted sum of its

I.e., iff the input is on one side of the hyperplane it defines.

Perceptron  Linear Separator

Linear discriminant function or linear decision surface.

Weights determine slope and bias determines offset.

Consider example with two inputs, x1, x2:

Minsky & Papert (1969)

Our examples are:

So, XOR is not linearly separable