Lecture 2

CS7015 (Deep Learning) : Lecture 2
McCulloch Pitts Neuron, Thresholding Logic, Perceptrons, Perceptron

Learning Algorithm and Convergence, Multilayer Perceptrons (MLPs),
Representation Power of MLPs
Mitesh M. Khapra
Department of Computer Science and Engineering

Indian Institute of Technology Madras
1/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.1: Biological Neurons
2/69
y The most fundamental unit of a deep
neural network is called an artificial
neuron
σ
w1 w2 w3
x1 x2 x3
Artificial Neuron
3/69
neuron
σ Why is it called a neuron ? Where
does the inspiration come from ?
w1 w2 w3
x1 x2 x3
Artificial Neuron
3/69
neuron
The inspiration comes from biology
w1 w2 w3
(more specifically, from the brain)
x1 x2 x3
Artificial Neuron
3/69
neuron
w1 w2 w3
x1 x2 x3 biological neurons = neural cells =
neural processing units
Artificial Neuron
3/69
neuron
w1 w2 w3
x1 x2 x3 biological neurons = neural cells =
neural processing units
Artificial Neuron
We will first see what a biological
neuron looks like ...
3/69
Biological Neurons∗
∗
Image adapted from
https://cdn.vectorstock.com/i/composite/12,25/neuron-cell-vector-81225.jpg
4/69
dendrite: receives signals from other
neurons
∗
Image adapted from
4/69
neurons
synapse: point of connection to
other neurons
∗
Image adapted from
4/69
neurons
other neurons
soma: processes the information
∗
Image adapted from
4/69
neurons
other neurons
soma: processes the information
axon: transmits the output of this
neuron
∗
Image adapted from
4/69
Let us see a very cartoonish illustra-
tion of how a neuron works
5/69
Our sense organs interact with the
outside world
5/69
outside world
They relay information to the neur-
ons
5/69
outside world
They relay information to the neur-
ons
The neurons (may) get activated and
produces a response (laughter in this
case)
5/69
Of course, in reality, it is not just a single
neuron which does all this
6/69
There is a massively parallel interconnected
network of neurons
6/69
network of neurons
The sense organs relay information to the low-
est layer of neurons
6/69
network of neurons
Some of these neurons may fire (in red) in re-
sponse to this information and in turn relay
information to other neurons they are connec-
ted to
6/69
network of neurons
ted to
These neurons may also fire (again, in red)
and the process continues
6/69
network of neurons
ted to
and the process continues eventually resulting
in a response (laughter in this case)
6/69
network of neurons
ted to
and the process continues eventually resulting
in a response (laughter in this case)
An average human brain has around 1011 (100
billion) neurons!
6/69
This massively parallel network also ensures
that there is division of work
7/69
Each neuron may perform a certain role or
respond to a certain stimulus
7/69
A simplified illustration
7/69
7/69
7/69
7/69
7/69
The neurons in the brain are arranged
in a hierarchy
8/69
in a hierarchy
We illustrate this with the help of
visual cortex (part of the brain) which
deals with processing visual informa-
tion
8/69
in a hierarchy
tion
Starting from the retina, the informa-
tion is relayed to several layers (follow
the arrows)
8/69
in a hierarchy
tion
Starting from the retina, the informa-
tion is relayed to several layers (follow
the arrows)
We observe that the layers V 1, V 2 to
AIT form a hierarchy (from identify-
ing simple visual forms to high level
objects)
8/69
Sample illustration of hierarchical
processing∗
∗
Idea borrowed from Hugo Larochelle’s lecture slides
9/69
processing∗
∗
9/69
processing∗
∗
9/69
Disclaimer
I understand very little about how the brain works!
What you saw so far is an overly simplified explanation of how the brain works!
But this explanation suffices for the purpose of this course!
10/69
Module 2.2: McCulloch Pitts Neuron
11/69
McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)
12/69
x1
12/69
x1 x2
12/69
x1 x2 .. ..
12/69
x1 x2 .. .. xn
12/69
x1 x2 .. .. xn ∈ {0, 1}
12/69
g aggregates the inputs
x1 x2 .. .. xn ∈ {0, 1}
12/69
f g aggregates the inputs and the function f
takes a decision based on this aggregation
g
x1 x2 .. .. xn ∈ {0, 1}
12/69
y ∈ {0, 1} McCulloch (neuroscientist) and Pitts (logi-
g
x1 x2 .. .. xn ∈ {0, 1}
12/69
g The inputs can be excitatory or inhibitory
x1 x2 .. .. xn ∈ {0, 1}
12/69
y = 0 if any xi is inhibitory, else
x1 x2 .. .. xn ∈ {0, 1}
12/69
x1 x2 .. .. xn ∈ {0, 1} n
X
g(x1 , x2 , ..., xn ) = g(x) = xi
i=1
12/69
x1 x2 .. .. xn ∈ {0, 1} n
X
g(x1 , x2 , ..., xn ) = g(x) = xi
i=1
y = f (g(x)) = 1 if g(x) ≥ θ
12/69
x1 x2 .. .. xn ∈ {0, 1} n
X
g(x1 , x2 , ..., xn ) = g(x) = xi
i=1
y = f (g(x)) = 1 if g(x) ≥ θ
= 0 if g(x) < θ
12/69
x1 x2 .. .. xn ∈ {0, 1} n
X
g(x1 , x2 , ..., xn ) = g(x) = xi
i=1
y = f (g(x)) = 1 if g(x) ≥ θ
= 0 if g(x) < θ
θ is called the thresholding parameter
12/69
x1 x2 .. .. xn ∈ {0, 1} n
X
g(x1 , x2 , ..., xn ) = g(x) = xi
i=1
y = f (g(x)) = 1 if g(x) ≥ θ
= 0 if g(x) < θ
θ is called the thresholding parameter
This is called Thresholding Logic
12/69
Let us implement some boolean functions using this McCulloch Pitts (MP) neuron
...
13/69
y ∈ {0, 1}
x1 x2 x3
A McCulloch Pitts unit
14/69
y ∈ {0, 1} y ∈ {0, 1}
x1 x2 x3 x1 x2 x3
A McCulloch Pitts unit AND function
14/69
y ∈ {0, 1} y ∈ {0, 1}
θ 3
x1 x2 x3 x1 x2 x3
A McCulloch Pitts unit AND function
14/69
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}
θ 3
x1 x2 x3 x1 x2 x3 x1 x2 x3
A McCulloch Pitts unit AND function OR function
14/69
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}
θ 3 1
x1 x2 x3 x1 x2 x3 x1 x2 x3
14/69
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}
θ 3 1
x1 x2 x3 x1 x2 x3 x1 x2 x3
y ∈ {0, 1}
x1 x2
x1 AND !x2 ∗
∗
circle at the end indicates inhibitory input: if any inhibitory input is 1 the output will be 0 14/69
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}
θ 3 1
x1 x2 x3 x1 x2 x3 x1 x2 x3
y ∈ {0, 1}
x1 x2
x1 AND !x2 ∗
∗
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}
θ 3 1
x1 x2 x3 x1 x2 x3 x1 x2 x3
y ∈ {0, 1} y ∈ {0, 1}
x1 x2 x1 x2
x1 AND !x2 ∗ NOR function
∗
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}
θ 3 1
x1 x2 x3 x1 x2 x3 x1 x2 x3
y ∈ {0, 1} y ∈ {0, 1}
1 0
x1 x2 x1 x2
x1 AND !x2 ∗ NOR function
∗
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}
θ 3 1
x1 x2 x3 x1 x2 x3 x1 x2 x3
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}
1 0
x1 x2 x1 x2 x1
x1 AND !x2 ∗ NOR function NOT function
∗
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}
θ 3 1
x1 x2 x3 x1 x2 x3 x1 x2 x3
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}
1 0 0
x1 x2 x1 x2 x1
x1 AND !x2 ∗ NOR function NOT function
∗
Can any boolean function be represented using a McCulloch Pitts unit ?
15/69
Can any boolean function be represented using a McCulloch Pitts unit ?
Before answering this question let us first see the geometric interpretation of a
MP unit ...
15/69
y ∈ {0, 1}
x1 x2
OR function
P2
x1 + x2 = i=1 xi ≥ 1
16/69
y ∈ {0, 1}
x1 x2
OR function
P2
x1 + x2 = i=1 xi ≥ 1
x2
(0, 1) (1, 1)
(0, 0) (1, 0) x1
16/69
y ∈ {0, 1}
x1 x2
OR function
P2
x1 + x2 = i=1 xi ≥ 1
x2
(0, 1) (1, 1)
x1 + x2 = θ = 1
(0, 0) (1, 0) x1
16/69
y ∈ {0, 1}
A single MP neuron splits the input points (4
points for 2 binary inputs) into two halves
x1 x2
OR function
P2
x1 + x2 = i=1 xi ≥ 1
x2
(0, 1) (1, 1)
x1 + x2 = θ = 1
(0, 0) (1, 0) x1
16/69
y ∈ {0, 1}
Points lying on or above the line ni=1 xi −θ =
P
1
0 and points lying below this line
x1 x2
OR function
P2
x1 + x2 = i=1 xi ≥ 1
x2
(0, 1) (1, 1)
x1 + x2 = θ = 1
(0, 0) (1, 0) x1
16/69
y ∈ {0, 1}
P
1
x1 x2 In other words, all inputs which Pproduce an
output 0 will be on one side ( ni=1 xi < θ)
OR function
P2 of the line and all inputs which produce an
x1 + x2 = i=1 xi ≥ 1 Pn
output 1 will lie on the other side ( i=1 xi ≥
x2 θ) of this line
(0, 1) (1, 1)
x1 + x2 = θ = 1
(0, 0) (1, 0) x1
16/69
y ∈ {0, 1}
P
1
x1 x2 In other words, all inputs which Pproduce an
output 0 will be on one side ( ni=1 xi < θ)
OR function
P2 of the line and all inputs which produce an
x1 + x2 = i=1 xi ≥ 1 Pn
output 1 will lie on the other side ( i=1 xi ≥
x2 θ) of this line
Let us convince ourselves about this with a
(0, 1) (1, 1) few more examples (if it is not already clear
from the math)
x1 + x2 = θ = 1
(0, 0) (1, 0) x1
16/69
y ∈ {0, 1}
x1 x2
AND function
P2
x1 + x2 = i=1 xi ≥ 2
17/69
y ∈ {0, 1}
x1 x2
AND function
P2
x1 + x2 = i=1 xi ≥ 2
x2
(0, 1) (1, 1)
(0, 0) (1, 0) x1
17/69
y ∈ {0, 1}
x1 x2
AND function
P2
x1 + x2 = i=1 xi ≥ 2
x2
(0, 1) (1, 1)
x1 + x2 = θ = 2
(0, 0) (1, 0) x1
17/69
y ∈ {0, 1} y ∈ {0, 1}
x1 x2 x1 x2
AND function
P2 Tautology (always ON)
x1 + x2 = i=1 xi ≥ 2
x2
(0, 1) (1, 1)
x1 + x2 = θ = 2
(0, 0) (1, 0) x1
17/69
y ∈ {0, 1} y ∈ {0, 1}
2 0
x1 x2 x1 x2
AND function
x1 + x2 = i=1 xi ≥ 2
x2
(0, 1) (1, 1)
x1 + x2 = θ = 2
(0, 0) (1, 0) x1
17/69
y ∈ {0, 1} y ∈ {0, 1}
2 0
x1 x2 x1 x2
AND function
x1 + x2 = i=1 xi ≥ 2
x2
x2
(0, 1) (1, 1)
(0, 1) (1, 1)
x1 + x2 = θ = 2
(0, 0) (1, 0) x1
(0, 0) (1, 0) x1
17/69
y ∈ {0, 1} y ∈ {0, 1}
2 0
x1 x2 x1 x2
AND function
x1 + x2 = i=1 xi ≥ 2
x2
x2
(0, 1) (1, 1)
(0, 1) (1, 1)
x1 + x2 = θ = 0
x1 + x2 = θ = 2
(0, 0) (1, 0) x1
(0, 0) (1, 0) x1
17/69
y ∈ {0, 1} What if we have more than 2 inputs?
1 OR
x1 x2 x3
18/69
1 OR
x1 x2 x3
x2
(0, 1, 0) (1, 1, 0)
(0, 1, 1) (1, 1, 1)
(0, 0, 0) (1, 0, 0) x1
(0, 0, 1) (1, 0, 1)
x3 18/69
Well, instead of a line we will have a
plane
1 OR
x1 x2 x3
x2
(0, 1, 0) (1, 1, 0)
(0, 1, 1) (1, 1, 1)
(0, 0, 0) (1, 0, 0) x1
(0, 0, 1) (1, 0, 1)
x3 18/69
plane
1 OR
For the OR function, we want a plane
x1 x2 x3 such that the point (0,0,0) lies on one
x2 side and the remaining 7 points lie on
the other side of the plane
(0, 1, 0) (1, 1, 0)
(0, 1, 1) (1, 1, 1)
(0, 0, 0) (1, 0, 0) x1
(0, 0, 1) (1, 0, 1)
x3 18/69
plane
1 OR
For the OR function, we want a plane
x1 x2 x3 such that the point (0,0,0) lies on one
x2 side and the remaining 7 points lie on
the other side of the plane
(0, 1, 0) (1, 1, 0)
(0, 1, 1) (1, 1, 1)x1 + x2 + x3 = θ = 1
(0, 0, 0) (1, 0, 0) x1
(0, 0, 1) (1, 0, 1)
x3 18/69
The story so far ...
A single McCulloch Pitts Neuron can be used to represent boolean functions
which are linearly separable
19/69
A single McCulloch Pitts Neuron can be used to represent boolean functions
which are linearly separable
Linear separability (for boolean functions) : There exists a line (plane) such
that all inputs which produce a 1 lie on one side of the line (plane) and all
inputs which produce a 0 lie on other side of the line (plane)
19/69
Module 2.3: Perceptron
20/69
The story ahead ...
What about non-boolean (say, real) inputs ?
21/69
The story ahead ...
Do we always need to hand code the threshold ?
21/69
The story ahead ...
Are all inputs equal ? What if we want to assign more weight (importance) to
some inputs ?
21/69
The story ahead ...
some inputs ?
What about functions which are not linearly separable ?
21/69
Frank Rosenblatt, an American psychologist,
proposed the classical perceptron model
(1958)
22/69
y Frank Rosenblatt, an American psychologist,
(1958)
w1 w2 .. .. wn
x1 x2 .. .. xn
22/69
(1958)
A more general computational model than
McCulloch–Pitts neurons
w1 w2 .. .. wn
x1 x2 .. .. xn
22/69
(1958)
Main differences: Introduction of numer-
w1 w2 .. .. wn
ical weights for inputs and a mechanism for
x1 x2 .. .. xn learning these weights
22/69
(1958)
w1 w2 .. .. wn
Inputs are no longer limited to boolean values
22/69
(1958)
w1 w2 .. .. wn
Inputs are no longer limited to boolean values
Refined and carefully analyzed by Minsky and
Papert (1969) - their model is referred to as
the perceptron model here
22/69
y
w1 w2 .. .. wn
x1 x2 .. .. xn
23/69
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
w1 w2 .. .. wn
x1 x2 .. .. xn
23/69
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w1 w2 .. .. wn
x1 x2 .. .. xn
23/69
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w1 w2 .. .. wn
Rewriting the above,
x1 x2 .. .. xn
23/69
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w1 w2 .. .. wn
x1 x2 .. .. xn
n
X
y = 1 if wi ∗ x i − θ ≥ 0
i=1
23/69
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w1 w2 .. .. wn
x1 x2 .. .. xn
n
X
y = 1 if wi ∗ x i − θ ≥ 0
i=1
Xn
= 0 if wi ∗ x i − θ < 0
i=1
23/69
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w1 w2 .. .. wn
x1 x2 .. .. xn
n
A more accepted convention, X
y = 1 if wi ∗ x i − θ ≥ 0
i=1
Xn
= 0 if wi ∗ x i − θ < 0
i=1
23/69
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w1 w2 .. .. wn
x1 x2 .. .. xn
n
n y = 1 if wi ∗ x i − θ ≥ 0
X
y = 1 if wi ∗ x i ≥ 0 i=1
n
i=0 X
= 0 if wi ∗ x i − θ < 0
i=1
23/69
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w1 w2 .. .. wn
x1 x2 .. .. xn
n
n y = 1 if wi ∗ x i − θ ≥ 0
X
y = 1 if wi ∗ x i ≥ 0 i=1
n
i=0 X
= 0 if wi ∗ x i − θ < 0
i=1
where, x0 = 1 and w0 = −θ
23/69
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w0 = −θ w1 w2 .. .. wn
x0 = 1 x1 x2 .. .. xn
n
n y = 1 if wi ∗ x i − θ ≥ 0
X
y = 1 if wi ∗ x i ≥ 0 i=1
n
i=0 X
= 0 if wi ∗ x i − θ < 0
i=1
23/69
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w0 = −θ w1 w2 .. .. wn
x0 = 1 x1 x2 .. .. xn
n
n y = 1 if wi ∗ x i − θ ≥ 0
X
y = 1 if wi ∗ x i ≥ 0 i=1
n
i=0 X
n = 0 if wi ∗ x i − θ < 0
X
= 0 if wi ∗ x i < 0 i=1
i=0
23/69
We will now try to answer the following questions:
Why are we trying to implement boolean functions?
Why do we need weights ?
Why is w0 = −θ called the bias ?
24/69
y Consider the task of predicting whether we would like
a movie or not
w0 = −θ w1 w2 w3
x0 = 1 x1 x2 x3
25/69
a movie or not
Suppose, we base our decision on 3 inputs (binary, for
simplicity)
w0 = −θ w1 w2 w3
x0 = 1 x1 x2 x3
25/69
a movie or not
simplicity)
Based on our past viewing experience (data), we may
give a high weight to isDirectorNolan as compared to
w0 = −θ w1 w2 w3 the other inputs
x0 = 1 x1 x2 x3
x1 = isActorDamon
x2 = isGenreT hriller
x3 = isDirectorN olan
25/69
a movie or not
simplicity)
Based on our past viewing experience (data), we may
give a high weight to isDirectorNolan as compared to
w0 = −θ w1 w2 w3 the other inputs
x0 = 1 x1 x2 x3 Specifically, even if the actor is not Matt Damon and
the genre is not thriller we would still want to cross
the threshold θ by assigning a high weight to isDirect-
orNolan
x1 = isActorDamon
25/69
y w0 is called the bias as it represents the prior (preju-
dice)
w0 = −θ w1 w2 w3
x0 = 1 x1 x2 x3
x1 = isActorDamon
25/69
dice)
A movie buff may have a very low threshold and may
watch any movie irrespective of the genre, actor, dir-
ector [θ = 0]
w0 = −θ w1 w2 w3
x0 = 1 x1 x2 x3
x1 = isActorDamon
25/69
dice)
ector [θ = 0]
On the other hand, a selective viewer may only watch
w0 = −θ w1 w2 w3 thrillers starring Matt Damon and directed by Nolan
x0 = 1 x1 x2 x3 [θ = 3]
x1 = isActorDamon
25/69
dice)
ector [θ = 0]
On the other hand, a selective viewer may only watch
w0 = −θ w1 w2 w3 thrillers starring Matt Damon and directed by Nolan
x0 = 1 x1 x2 x3 [θ = 3]
The weights (w1 , w2 , ..., wn ) and the bias (w0 ) will de-
pend on the data (viewer history in this case)
x1 = isActorDamon
25/69
What kind of functions can be implemented using the perceptron? Any difference
from McCulloch Pitts neurons?
26/69
McCulloch Pitts Neuron
(assuming no inhibitory inputs)
n
X
y = 1 if xi ≥ θ
i=0
Xn
= 0 if xi < θ
i=0
Perceptron
n
X
y = 1 if wi ∗ x i ≥ θ
i=0
n
X
= 0 if wi ∗ x i < θ
i=0
27/69
From the equations it should be clear that
even a perceptron separates the input space
n
X into two halves
y = 1 if xi ≥ θ
i=0
Xn
= 0 if xi < θ
i=0
Perceptron
n
X
y = 1 if wi ∗ x i ≥ θ
i=0
n
X
= 0 if wi ∗ x i < θ
i=0
27/69
n
X into two halves
y = 1 if xi ≥ θ All inputs which produce a 1 lie on one side
i=0
n
and all inputs which produce a 0 lie on the
other side
X
= 0 if xi < θ
i=0
Perceptron
n
X
y = 1 if wi ∗ x i ≥ θ
i=0
n
X
= 0 if wi ∗ x i < θ
i=0
27/69
n
X into two halves
i=0
n
other side
X
= 0 if xi < θ
i=0 In other words, a single perceptron can only
be used to implement linearly separable func-
tions
Perceptron
n
X
y = 1 if wi ∗ x i ≥ θ
i=0
n
X
= 0 if wi ∗ x i < θ
i=0
27/69
n
X into two halves
i=0
n
other side
X
= 0 if xi < θ
tions
Perceptron Then what is the difference?
n
X
y = 1 if wi ∗ x i ≥ θ
i=0
n
X
= 0 if wi ∗ x i < θ
i=0
27/69
n
X into two halves
i=0
n
other side
X
= 0 if xi < θ
tions
Perceptron Then what is the difference? The weights (in-
n
X cluding threshold) can be learned and the in-
y = 1 if wi ∗ x i ≥ θ puts can be real valued
i=0
n
X
= 0 if wi ∗ x i < θ
i=0
27/69
n
X into two halves
i=0
n
other side
X
= 0 if xi < θ
tions
Perceptron Then what is the difference? The weights (in-
n
X cluding threshold) can be learned and the in-
y = 1 if wi ∗ x i ≥ θ puts can be real valued
i=0 We will first revisit some boolean functions
n
X and then see the perceptron learning al-
= 0 if wi ∗ x i < θ
gorithm (for learning weights)
i=0
27/69
x1 x2 OR
0 0
28/69
x1 x2 OR
0 0 0
28/69
x1 x2 OR
P2
0 0 0 w0 + i=1 wi xi
28/69
x1 x2 OR
P2
0 0 0 w0 + i=1 wi xi <0
28/69
x1 x2 OR
P2
0 0 0 w0 + i=1 wi xi <0
1 0 1
28/69
x1 x2 OR
w0 + 2i=1 wi xi < 0
P
0 0 0
P2
1 0 1 w0 + i=1 wi xi ≥ 0
28/69
x1 x2 OR
w0 + 2i=1 wi xi < 0
P
0 0 0
P2
1 0 1 w0 + i=1 wi xi ≥ 0
0 1 1
28/69
x1 x2 OR
w0 + 2i=1 wi xi < 0
P
0 0 0
P2
1 0 1 w0 + i=1 wi xi ≥ 0
w0 + 2i=1 wi xi ≥ 0
P
0 1 1
28/69
x1 x2 OR
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0
28/69
x1 x2 OR
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
28/69
x1 x2 OR
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
28/69
x1 x2 OR
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
28/69
x1 x2 OR
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 > −w0
28/69
x1 x2 OR
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 > −w0
One possible solution to this set of inequalities

is w0 = −1, w1 = 1.1, , w2 = 1.1 (and various
other solutions are possible)
28/69
x1 x2 OR x2
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0 (0, 1) (1, 1)
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
(0, 0) (1, 0) x1
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 > −w0

28/69
x1 x2 OR x2
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0 (0, 1) (1, 1)
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0
−1 + 1.1x1 + 1.1x2 = 0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
(0, 0) (1, 0) x1
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 > −w0

28/69
x1 x2 OR x2
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0 (0, 1) (1, 1)
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0
−1 + 1.1x1 + 1.1x2 = 0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
(0, 0) (1, 0) x1
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 > −w0
Note that we can come up
One possible solution to this set of inequalities with a similar set of inequal-
is w0 = −1, w1 = 1.1, , w2 = 1.1 (and various ities and find the value of θ
other solutions are possible) for a McCulloch Pitts neuron
also
28/69
x1 x2 OR x2
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0 (0, 1) (1, 1)
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0
−1 + 1.1x1 + 1.1x2 = 0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
(0, 0) (1, 0) x1
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 > −w0
Note that we can come up
One possible solution to this set of inequalities with a similar set of inequal-
is w0 = −1, w1 = 1.1, , w2 = 1.1 (and various ities and find the value of θ
other solutions are possible) for a McCulloch Pitts neuron
also (Try it!)
28/69
Module 2.4: Errors and Error Surfaces
29/69
Let us fix the threshold (−w0 = 1) and try
different values of w1 , w2
x2
−1 + 1.1x1 + 1.1x2 = 0
(0, 1) (1, 1)
(0, 0) (1, 0) x1
30/69
Say, w1 = −1, w2 = −1 x2
−1 + 1.1x1 + 1.1x2 = 0
(0, 1) (1, 1)
(0, 0) (1, 0) x1
−1 + (−1)x1 + (−1)x2 = 0
30/69
Say, w1 = −1, w2 = −1 x2
What is wrong with this line? −1 + 1.1x1 + 1.1x2 = 0
(0, 1) (1, 1)
(0, 0) (1, 0) x1
−1 + (−1)x1 + (−1)x2 = 0
30/69
Say, w1 = −1, w2 = −1 x2
What is wrong with this line? We make an −1 + 1.1x1 + 1.1x2 = 0
error on 1 out of the 4 inputs (0, 1) (1, 1)
(0, 0) (1, 0) x1
−1 + (−1)x1 + (−1)x2 = 0
30/69
Say, w1 = −1, w2 = −1 x2
Let us try some more values of w1 , w2 and note
how many errors we make
(0, 0) (1, 0) x1
−1 + (−1)x1 + (−1)x2 = 0
30/69
Say, w1 = −1, w2 = −1 x2
w1 w2 errors
-1 -1 1 x1
(0, 0) (1, 0)
−1 + (−1)x1 + (−1)x2 = 0
30/69
Say, w1 = −1, w2 = −1 x2
w1 w2 errors
-1 -1 1 x1
(0, 0) (1, 0)
1.5 0 1
−1 + (1.5)x1 + (0)x2 = 0
−1 + (−1)x1 + (−1)x2 = 0
30/69
different values of w1 , w2 −1 + (0.45)x1 + (0.45)x2 = 0
Say, w1 = −1, w2 = −1 x2
w1 w2 errors
-1 -1 1 x1
(0, 0) (1, 0)
1.5 0 1
0.45 0.45 3 −1 + (1.5)x1 + (0)x2 = 0
−1 + (−1)x1 + (−1)x2 = 0
30/69
Say, w1 = −1, w2 = −1 x2
w1 w2 errors
-1 -1 1 x1
(0, 0) (1, 0)
1.5 0 1
0.45 0.45 3 −1 + (1.5)x1 + (0)x2 = 0
We are interested in those values of w0 , w1 , w2
which result in 0 error −1 + (−1)x1 + (−1)x2 = 0
30/69
Say, w1 = −1, w2 = −1 x2
w1 w2 errors
-1 -1 1 x1
(0, 0) (1, 0)
1.5 0 1
0.45 0.45 3 −1 + (1.5)x1 + (0)x2 = 0
We are interested in those values of w0 , w1 , w2
which result in 0 error −1 + (−1)x1 + (−1)x2 = 0
Let us plot the error surface corresponding to
different values of w0 , w1 , w2
30/69
For ease of analysis, we will keep w0
fixed (-1) and plot the error for dif-
ferent values of w1 , w2
31/69
For a given w0 , w1 , w2 we will com-
pute −w0 +w1 ∗x1 +w2 ∗x2 for all com-
binations of (x1 , x2 ) and note down
31/69
For the OR function, an error occurs
if (x1 , x2 ) = (0, 0) but −w0 +w1 ∗x1 +
w2 ∗ x2 ≥ 0 or if (x1 , x2 ) 6= (0, 0) but
−w0 + w1 ∗ x1 + w2 ∗ x2 < 0
31/69
if (x1 , x2 ) = (0, 0) but −w0 +w1 ∗x1 +
w2 ∗ x2 ≥ 0 or if (x1 , x2 ) 6= (0, 0) but
−w0 + w1 ∗ x1 + w2 ∗ x2 < 0
31/69
if (x1 , x2 ) = (0, 0) but −w0 +w1 ∗x1 +
w2 ∗ x2 ≥ 0 or if (x1 , x2 ) 6= (0, 0) but
−w0 + w1 ∗ x1 + w2 ∗ x2 < 0
We are interested in finding an al-
gorithm which finds the values of
w1 , w2 which minimize this error
31/69
Module 2.5: Perceptron Learning Algorithm
32/69
We will now see a more principled approach for learning these weights and
threshold but before that let us revisit our movie example again
33/69
Let us reconsider our problem of deciding
whether to watch a movie or not
34/69
Suppose we are given a list of m movies and
a label (class) associated with each movie in-
dicating whether the user liked this movie or
not : binary decision
34/69
Further, suppose we represent each movie
with n features (some boolean, some real val-
ued)
34/69
with n features (some boolean, some real val-
ued)
x1 = isActorDamon
x4 = imdbRating(scaled to 0 to 1)
... ...
xn = criticsRating(scaled to 0 to 1)
34/69
y Let us reconsider our problem of deciding
x0 = 1 x1 x2 .. .. xn with n features (some boolean, some real val-
ued)
x1 = isActorDamon
We will assume that the data is linearly sep-
x2 = isGenreT hriller arable and we want a perceptron to learn how
x3 = isDirectorN olan to make this decision
x4 = imdbRating(scaled to 0 to 1)
... ...
34/69
y Let us reconsider our problem of deciding
w0 = −θ w1 w2 .. .. wn Further, suppose we represent each movie
x0 = 1 x1 x2 .. .. xn with n features (some boolean, some real val-
ued)
x1 = isActorDamon
We will assume that the data is linearly sep-
x2 = isGenreT hriller arable and we want a perceptron to learn how
x3 = isDirectorN olan to make this decision
x4 = imdbRating(scaled to 0 to 1) In other words, we want the perceptron to find
... ... the equation of this separating plane (or find
the values of w0 , w1 , w2 , .., wm )
34/69
Algorithm: Perceptron Learning Algorithm
35/69
P ← inputs with label 1;
35/69
N ← inputs with label 0;
35/69
Initialize w = [w0 , w1 , . . . , wn ] randomly;
35/69
while !convergence do
end
35/69
end
//the algorithm converges when all the
inputs are classified correctly
35/69
Pick random x ∈ P ∪ N ;
end
35/69
Pick random x ∈ P P ∪N ;
n
if x ∈ P and i=0 wi ∗ xi < 0 then
end
end
35/69
n
w =w+x ;
end
end
35/69
n
w =w+x ;
end Pn
if x ∈ N and i=0 wi ∗ xi ≥ 0 then
end
end
35/69
n
w =w+x ;
end Pn
w =w−x ;
end
end
35/69
Algorithm: Perceptron Learning Algorithm Why would this work ?
n
w =w+x ;
end Pn
w =w−x ;
end
end
35/69
Algorithm: Perceptron Learning Algorithm Why would this work ?
P ← inputs with label 1; To understand why this works
N ← inputs with label 0; we will have to get into a bit of
Initialize w = [w0 , w1 , . . . , wn ] randomly; Linear Algebra and a bit of geo-
while !convergence do metry...
n
w =w+x ;
end Pn
w =w−x ;
end
end
35/69
Consider two vectors w and x
36/69
w = [w0 , w1 , w2 , ..., wn ]
x = [1, x1 , x2 , ..., xn ]
36/69
w = [w0 , w1 , w2 , ..., wn ]
x = [1, x1 , x2 , ..., xn ]
X n
w · x = wT x = wi ∗ x i
i=0
36/69
w = [w0 , w1 , w2 , ..., wn ]
x = [1, x1 , x2 , ..., xn ]
X n
w · x = wT x = wi ∗ x i
i=0
We can thus rewrite the perceptron

rule as
36/69
w = [w0 , w1 , w2 , ..., wn ]
x = [1, x1 , x2 , ..., xn ]
X n
w · x = wT x = wi ∗ x i
i=0

rule as
y = 1 if wT x ≥ 0
= 0 if wT x < 0
36/69
Consider two vectors w and x We are interested in finding the line
wT x = 0 which divides the input
w = [w0 , w1 , w2 , ..., wn ] space into two halves
x = [1, x1 , x2 , ..., xn ]
X n
w · x = wT x = wi ∗ x i
i=0

rule as
y = 1 if wT x ≥ 0
= 0 if wT x < 0
36/69
x = [1, x1 , x2 , ..., xn ] Every point (x) on this line satisfies
X n the equation wT x = 0
w · x = wT x = wi ∗ x i
i=0

rule as
y = 1 if wT x ≥ 0
= 0 if wT x < 0
36/69
w · x = wT x = wi ∗ x i
What can you tell about the angle (α)
i=0
between w and any point (x) which
We can thus rewrite the perceptron lies on this line ?
rule as
y = 1 if wT x ≥ 0
= 0 if wT x < 0
36/69
w · x = wT x = wi ∗ x i
i=0
rule as wT x
The angle is 90◦ (∵ cosα = ||w||||x|| =
0)
y = 1 if wT x ≥ 0
= 0 if wT x < 0
36/69
w · x = wT x = wi ∗ x i
i=0
rule as wT x
The angle is 90◦ (∵ cosα = ||w||||x|| =
0)
y = 1 if wT x ≥ 0
Since the vector w is perpendicular to
= 0 if wT x < 0
every point on the line it is actually
perpendicular to the line itself
36/69
x2
w
wT x =0
x1
37/69
Consider some points (vectors) which lie in x2
the positive half space of this line (i.e., wT x ≥
0) p2 w
wT x =0
p1
p3
x1
37/69
0) p2 w
What will be the angle between any such vec- wT x =0
tor and w ? p1
p3
x1
37/69
0) p2 w
tor and w ? Obviously, less than 90◦ p1
p3
x1
37/69
0) p2 w
p3
What about points (vectors) which lie in the
negative half space of this line (i.e., wT x < 0) n1
x1
n2 n3
37/69
0) p2 w
p3
What will be the angle between any such vec- x1
tor and w ?
n2 n3
37/69
0) p2 w
p3
tor and w ? Obviously, greater than 90◦
n2 n3
37/69
0) p2 w
p3
Of course, this also follows from the formula
wT x
(cosα = ||w||||x|| )
n2 n3
37/69
0) p2 w
p3
Of course, this also follows from the formula
wT x
(cosα = ||w||||x|| )
Keeping this picture in mind let us revisit the n2 n3
algorithm
37/69
Initialize w randomly;
if x ∈ P and w.x < 0 then
w =w+x ;
end
if x ∈ N and w.x ≥ 0 then
w =w−x ;
end
end
wT x
cosα =
||w||||x||
38/69
Algorithm: Perceptron Learning Algorithm For x ∈ P if w.x < 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is
Initialize w randomly; greater than 90◦
w =w+x ;
end
w =w−x ;
end
end
wT x
cosα =
||w||||x||
38/69
Initialize w randomly; greater than 90◦ (but we want α
while !convergence do to be less than 90◦ )
w =w+x ;
end
w =w−x ;
end
end
wT x
cosα =
||w||||x||
38/69
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w + x
w =w+x ;
end
w =w−x ;
end
end
wT x
cosα =
||w||||x||
38/69
w =w+x ;
end cos(αnew ) ∝ wnew T x
w =w−x ;
end
end
wT x
cosα =
||w||||x||
38/69
w =w+x ;
∝ (w + x)T x
w =w−x ;
end
end
wT x
cosα =
||w||||x||
38/69
w =w+x ;
∝ (w + x)T x
w =w−x ;
end ∝ w T x + xT x
end
wT x
cosα =
||w||||x||
38/69
w =w+x ;
∝ (w + x)T x
w =w−x ;
end ∝ cosα + xT x
wT x
cosα =
||w||||x||
38/69
w =w+x ;
∝ (w + x)T x
w =w−x ;
//the algorithm converges when all the cos(αnew ) > cosα
wT x
cosα =
||w||||x||
38/69
w =w+x ;
∝ (w + x)T x
w =w−x ;
//the algorithm converges when all the cos(αnew ) > cosα
inputs are classified correctly Thus αnew will be less than α and
wT x
cosα = this is exactly what we want
||w||||x||
38/69
Initialize w randomly;
w =w+x ;
end
w =w−x ;
end
end
wT x
cosα =
||w||||x||
39/69
Algorithm: Perceptron Learning Algorithm For x ∈ N if w.x ≥ 0 then it
N ← inputs with label 0; this x and the current w is less
Initialize w randomly; than 90◦
w =w+x ;
end
w =w−x ;
end
end
wT x
cosα =
||w||||x||
39/69
Initialize w randomly; than 90◦ (but we want α to be
while !convergence do greater than 90◦ )
w =w+x ;
end
w =w−x ;
end
end
wT x
cosα =
||w||||x||
39/69
if x ∈ P and w.x < 0 then (αnew ) when wnew = w − x
w =w+x ;
end
w =w−x ;
end
end
wT x
cosα =
||w||||x||
39/69
w =w+x ;
w =w−x ;
end
end
wT x
cosα =
||w||||x||
39/69
w =w+x ;
∝ (w − x)T x
w =w−x ;
end
end
wT x
cosα =
||w||||x||
39/69
w =w+x ;
∝ (w − x)T x
w =w−x ;
end ∝ w T x − xT x
end
wT x
cosα =
||w||||x||
39/69
w =w+x ;
∝ (w − x)T x
w =w−x ;
end ∝ cosα − xT x
wT x
cosα =
||w||||x||
39/69
w =w+x ;
∝ (w − x)T x
w =w−x ;
//the algorithm converges when all the cos(αnew ) < cosα
wT x
cosα =
||w||||x||
39/69
w =w+x ;
∝ (w − x)T x
w =w−x ;
//the algorithm converges when all the cos(αnew ) < cosα
inputs are classified correctly Thus αnew will be greater than α
wT x
cosα = and this is exactly what we want
||w||||x||
39/69
We will now see this algorithm in action for a toy dataset
40/69
x2 We initialized w to a random value
p2
p1
p3
n1
x1
n2 n3
41/69
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1
x1
n2 n3
41/69
p3
n1 We now run the algorithm by randomly going
x1 over the points
n2 n3
41/69
p3
x1 over the points
Randomly pick a point (say, p1 ), apply correc-
tion w = w + x ∵ w · x < 0 (you can check
the angle visually)
n2 n3
41/69
p3
x1 over the points
the angle visually)
n2 n3
41/69
p3
x1 over the points
the angle visually)
n2 n3
41/69
p3
x1 over the points
the angle visually)
n2 n3
41/69
p3
x1 over the points
Randomly pick a point (say, n1 ), apply correc-
tion w = w − x ∵ w · x ≥ 0 (you can check
the angle visually)
n2 n3
41/69
p3
x1 over the points
Randomly pick a point (say, n1 ), apply correc-
tion w = w − x ∵ w · x ≥ 0 (you can check
the angle visually)
n2 n3
41/69
p3
x1 over the points
Randomly pick a point (say, n3 ), no correction
needed ∵ w · x < 0 (you can check the angle
visually)
n2 n3
41/69
p3
x1 over the points
visually)
n2 n3
41/69
p3
x1 over the points
visually)
n2 n3
41/69
p3
x1 over the points
visually)
n2 n3
41/69
p3
x1 over the points
the angle visually)
n2 n3
41/69
p3
x1 over the points
the angle visually)
n2 n3
41/69
p3
x1 over the points
Randomly pick a point (say, p1 ), no correction
needed ∵ w · x ≥ 0 (you can check the angle
visually)
n2 n3
41/69
p3
x1 over the points
visually)
n2 n3
41/69
p3
x1 over the points
visually)
n2 n3
41/69
p3
x1 over the points
visually)
n2 n3
41/69
p3
x1 over the points
visually)
n2 n3
41/69
p3
x1 over the points
visually)
n2 n3
41/69
p3
x1 over the points
visually)
n2 n3
41/69
p3
x1 over the points
visually)
n2 n3
41/69
p3
x1 over the points
visually)
n2 n3
41/69
p3
x1 over the points
visually)
n2 n3
41/69
p3
x1 over the points
visually)
n2 n3
41/69
p3
x1 over the points
The algorithm has converged
n2 n3
41/69
Module 2.6: Proof of Convergence
42/69
Now that we have some faith and intuition about why the algorithm works, we
will see a more formal proof of convergence ...
43/69
Theorem
Definition: Two sets P and N of points in an n-dimensional space are called
absolutely linearly separable if
44/69
Theorem
absolutely linearly separable if n + 1 real numbers w0 , w1 , ..., wn exist
44/69
Theorem
absolutely linearly separable if n + 1 real P numbers w0 , w1 , ..., wn exist such that
every point (x1 , x2 , ..., xn ) ∈ P satisfies ni=1 wi ∗ xi > w0
44/69
Theorem
P satisfies ni=1 wi ∗ xi > w0 and every point
every point (x1 , x2 , ..., xn ) ∈P
(x1 , x2 , ..., xn ) ∈ N satisfies ni=1 wi ∗ xi < w0 .
44/69
Theorem
Proposition:
44/69
Theorem
Proposition: If the sets P and N are finite and linearly separable,
44/69
Theorem
Proposition: If the sets P and N are finite and linearly separable, the perceptron
learning algorithm updates the weight vector wt a finite number of times.
44/69
Theorem
learning algorithm updates the weight vector wt a finite number of times. In other
words: if the vectors in P and N are tested cyclically one after the other,
44/69
Theorem
words: if the vectors in P and N are tested cyclically one after the other, a weight
vector wt is found after a finite number of steps t which can separate the two sets.
44/69
Theorem
words: if the vectors in P and N are tested cyclically one after the other, a weight
vector wt is found after a finite number of steps t which can separate the two sets.
Proof: On the next slide
44/69
Setup:
If x ∈ N then -x ∈ P (∵
wT x < 0 =⇒ wT (−x) ≥ 0)
45/69
Setup:
If x ∈ N then -x ∈ P (∵
wT x < 0 =⇒ wT (−x) ≥ 0)
We can thus consider a single
set P 0 = P ∪ N − and for
every element p ∈ P 0 ensure
that wT p ≥ 0
45/69
Setup:
If x ∈ N then -x ∈ P (∵ Algorithm: Perceptron Learning Algorithm
wT x < 0 =⇒ wT (−x) ≥ 0) P ← inputs with label 1;
that wT p ≥ 0
45/69
Setup:
that wT p ≥ 0
45/69
Setup:
N − contains negations of all points in N;
that wT p ≥ 0
45/69
Setup:
set P 0 = P ∪ N − and for P 0 ← P ∪ N −;
that wT p ≥ 0
45/69
Setup:
every element p ∈ P 0 ensure Initialize w randomly;
that wT p ≥ 0
45/69
Setup:
that wT p ≥ 0 while !convergence do
end
//the algorithm converges when all the inputs are
classified correctly
45/69
Setup:
Pick random p ∈ P 0 ;
end
45/69
Setup:
if w.p < 0 then
end
end
45/69
Setup:
if w.p < 0 then

w =w+p ;
end
end
//notice that we do not need the other if condition
because by construction we want all points in P 0 to
lie in the positive half space w.p ≥ 0
45/69
Setup:
Further we will normalize all Pick random p ∈ P 0 ;
the p’s so that ||p|| = 1 (no-
tice that this does not affect if w.p < 0 then
p w =w+p ;
the solution ∵ if wT ||p|| ≥
end
T
0 then w p ≥ 0) end
45/69
Setup:
p
p ← ||p|| (so p now becomes unit norm) ;
p w =w+p ;
end
T
45/69
Setup:
p
p ← ||p|| (so p now becomes unit norm) ;
p w =w+p ;
end
T
Let w∗ be the normalized //the algorithm converges when all the inputs are
solution vector (we know one classified correctly
exists as the data is linearly //notice that we do not need the other if condition
separable)
45/69
Observations: Proof:
w∗ is some optimal solution
which exists but we don’t know
what it is
46/69
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi < 0
what it is
46/69
what it is We make a correction wt+1 = wt + pi
46/69
Let β be the angle between w∗ and wt+1
46/69
w∗ · wt+1
cosβ =
||wt+1 ||
46/69
w∗ · wt+1
cosβ =
||wt+1 ||
N umerator = w∗ · wt+1
46/69
w∗ · wt+1
cosβ =
||wt+1 ||
N umerator = w∗ · wt+1 = w∗ · (wt + pi )
46/69
w∗ · wt+1
cosβ =
||wt+1 ||
= w∗ · wt + w∗ · pi
46/69
w∗ · wt+1
cosβ =
||wt+1 ||
= w∗ · wt + w∗ · pi
≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}
46/69
w∗ · wt+1
cosβ =
||wt+1 ||
= w∗ · wt + w∗ · pi
≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}
≥ w∗ · (wt−1 + pj ) + δ
46/69
w∗ · wt+1
cosβ =
||wt+1 ||
= w∗ · wt + w∗ · pi
≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}
≥ w∗ · (wt−1 + pj ) + δ
≥ w∗ · wt−1 + w∗ · pj + δ
46/69
w∗ · wt+1
cosβ =
||wt+1 ||
= w∗ · wt + w∗ · pi
≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}
≥ w∗ · (wt−1 + pj ) + δ
≥ w∗ · wt−1 + w∗ · pj + δ
≥ w∗ · wt−1 + 2δ
46/69
We do not make a correction Let β be the angle between w∗ and wt+1
at every time-step w∗ · wt+1
cosβ =
||wt+1 ||
= w∗ · wt + w∗ · pi
≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}
≥ w∗ · (wt−1 + pj ) + δ
≥ w∗ · wt−1 + w∗ · pj + δ
≥ w∗ · wt−1 + 2δ
46/69
We make a correction only if cosβ =
||wt+1 ||
wT · pi < 0 at that time step N umerator = w∗ · wt+1 = w∗ · (wt + pi )
= w∗ · wt + w∗ · pi
≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}
≥ w∗ · (wt−1 + pj ) + δ
≥ w∗ · wt−1 + w∗ · pj + δ
≥ w∗ · wt−1 + 2δ
46/69
||wt+1 ||
So at time-step t we would = w∗ · wt + w∗ · pi
have made only k (≤ t) cor-
rections ≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}
≥ w∗ · (wt−1 + pj ) + δ
≥ w∗ · wt−1 + w∗ · pj + δ
≥ w∗ · wt−1 + 2δ
46/69
||wt+1 ||
Every time we make a correc- ≥ w∗ · (wt−1 + pj ) + δ
tion a quantity δ gets added to ≥ w∗ · wt−1 + w∗ · pj + δ
the numerator ≥ w∗ · wt−1 + 2δ
46/69
||wt+1 ||
So by time-step t, a quantity
kδ gets added to the numer-
46/69
ator Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
||wt+1 ||
So by time-step t, a quantity ≥ w∗ · w0 + (k)δ (By induction)
kδ gets added to the numer-
46/69
ator Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)
So far we have, w T · pi < 0 (and hence we made the correction)
47/69
Proof (continued:)
So far we have, wT · pi < 0 (and hence we made the correction)

w∗ · wt+1
cosβ = (by definition)
||wt+1 ||
47/69
Proof (continued:)

w∗ · wt+1
||wt+1 ||
N umerator ≥ w∗ · w0 + kδ (proved by induction)
47/69
Proof (continued:)

w∗ · wt+1
||wt+1 ||
Denominator2 = ||wt+1 ||2
47/69
Proof (continued:)

w∗ · wt+1
||wt+1 ||
= (wt + pi ) · (wt + pi )
47/69
Proof (continued:)

w∗ · wt+1
||wt+1 ||
= (wt + pi ) · (wt + pi )
= ||wt ||2 + 2wt · pi + ||pi ||2
47/69
Proof (continued:)

w∗ · wt+1
||wt+1 ||
= (wt + pi ) · (wt + pi )
= ||wt ||2 + 2wt · pi + ||pi ||2
< ||wt ||2 + ||pi ||2 (∵ wt · pi < 0)
47/69
Proof (continued:)

w∗ · wt+1
||wt+1 ||
= (wt + pi ) · (wt + pi )
= ||wt ||2 + 2wt · pi + ||pi ||2
< ||wt ||2 + ||pi ||2 (∵ wt · pi < 0)
2
< ||wt || + 1 (∵ ||pi ||2 = 1)
47/69
Proof (continued:)

w∗ · wt+1
||wt+1 ||
= (wt + pi ) · (wt + pi )
= ||wt ||2 + 2wt · pi + ||pi ||2
< ||wt ||2 + ||pi ||2 (∵ wt · pi < 0)
2
< ||wt || + 1 (∵ ||pi ||2 = 1)
< (||wt−1 ||2 + 1) + 1
47/69
Proof (continued:)

w∗ · wt+1
||wt+1 ||
= (wt + pi ) · (wt + pi )
= ||wt ||2 + 2wt · pi + ||pi ||2
< ||wt ||2 + ||pi ||2 (∵ wt · pi < 0)
2
< ||wt || + 1 (∵ ||pi ||2 = 1)
< (||wt−1 ||2 + 1) + 1
< ||wt−1 ||2 + 2
47/69
Proof (continued:)

w∗ · wt+1
||wt+1 ||
= (wt + pi ) · (wt + pi )
= ||wt ||2 + 2wt · pi + ||pi ||2
< ||wt ||2 + ||pi ||2 (∵ wt · pi < 0)
2
< ||wt || + 1 (∵ ||pi ||2 = 1)
< (||wt−1 ||2 + 1) + 1
< ||wt−1 ||2 + 2
< ||w0 ||2 + (k) (By same observation that we made about δ)
47/69
Proof (continued:)
So far we have, wT · pi ≤ 0 (and hence we made the correction)

w∗ · wt+1
||wt+1 ||
Denominator2 < ||w0 ||2 + k (By same observation that we made about δ)
48/69
Proof (continued:)

w∗ · wt+1
||wt+1 ||
w∗ · w0 + kδ
cosβ ≥ p
||w0 ||2 + k
48/69
Proof (continued:)

w∗ · wt+1
||wt+1 ||
w∗ · w0 + kδ
cosβ ≥ p
||w0 ||2 + k
√
cosβ thus grows proportional to k
48/69
Proof (continued:)

w∗ · wt+1
||wt+1 ||
w∗ · w0 + kδ
cosβ ≥ p
||w0 ||2 + k
√
As k (number of corrections) increases cosβ can become arbitrarily large
48/69
Proof (continued:)

w∗ · wt+1
||wt+1 ||
w∗ · w0 + kδ
cosβ ≥ p
||w0 ||2 + k
√
But since cosβ ≤ 1, k must be bounded by a maximum number
48/69
Proof (continued:)

w∗ · wt+1
||wt+1 ||
w∗ · w0 + kδ
cosβ ≥ p
||w0 ||2 + k
√
But since cosβ ≤ 1, k must be bounded by a maximum number
Thus, there can only be a finite number of corrections (k) to w and the algorithm
will converge!
48/69
Coming back to our questions ...
some inputs ?
49/69
What about non-boolean (say, real) inputs ? Real valued inputs are allowed in
perceptron
some inputs ?
49/69
perceptron
Do we always need to hand code the threshold ? No, we can learn the threshold
some inputs ?
49/69
perceptron
some inputs ? A perceptron allows weights to be assigned to inputs
49/69
perceptron
some inputs ? A perceptron allows weights to be assigned to inputs
What about functions which are not linearly separable ? Not possible with a
single perceptron but we will see how to handle this ..
49/69
Module 2.7: Linearly Separable Boolean Functions
50/69
So what do we do about functions which are not linearly separable ?
51/69
So what do we do about functions which are not linearly separable ?
Let us see one such simple boolean function first ?
51/69
x1 x2 XOR
w0 + 2i=1 wi xi
P
0 0 0 <0
w0 + 2i=1 wi xi
P
1 0 1 ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
w0 + 2i=1 wi xi
P
1 1 0 <0
52/69
x1 x2 XOR
w0 + 2i=1 wi xi
P
0 0 0 <0
w0 + 2i=1 wi xi
P
1 0 1 ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
w0 + 2i=1 wi xi
P
1 1 0 <0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
52/69
x1 x2 XOR
w0 + 2i=1 wi xi
P
0 0 0 <0
w0 + 2i=1 wi xi
P
1 0 1 ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
w0 + 2i=1 wi xi
P
1 1 0 <0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
52/69
x1 x2 XOR
w0 + 2i=1 wi xi
P
0 0 0 <0
w0 + 2i=1 wi xi
P
1 0 1 ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
w0 + 2i=1 wi xi
P
1 1 0 <0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
52/69
x1 x2 XOR
w0 + 2i=1 wi xi
P
0 0 0 <0
w0 + 2i=1 wi xi
P
1 0 1 ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
w0 + 2i=1 wi xi
P
1 1 0 <0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 < −w0
52/69
x1 x2 XOR
w0 + 2i=1 wi xi
P
0 0 0 <0
w0 + 2i=1 wi xi
P
1 0 1 ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
w0 + 2i=1 wi xi
P
1 1 0 <0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 < −w0
The fourth condition contradicts conditions 2

and 3
52/69
x1 x2 XOR x2
w0 + 2i=1 wi xi
P
0 0 0 <0
w0 + 2i=1 wi xi
P
1 0 1 ≥0 (0, 1) (1, 1)
w0 + 2i=1 wi xi
P
0 1 1 ≥0
w0 + 2i=1 wi xi
P
1 1 0 <0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
(0, 0) (1, 0) x1
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 < −w0
The fourth condition contradicts conditions 2

and 3
Hence we cannot have a solution to this set of
inequalities
52/69
x1 x2 XOR x2
w0 + 2i=1 wi xi
P
0 0 0 <0
w0 + 2i=1 wi xi
P
1 0 1 ≥0 (0, 1) (1, 1)
w0 + 2i=1 wi xi
P
0 1 1 ≥0
w0 + 2i=1 wi xi
P
1 1 0 <0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
(0, 0) (1, 0) x1
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
And indeed you can see that
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 < −w0 it is impossible to draw a
line which separates the red
The fourth condition contradicts conditions 2 points from the blue points
and 3
Hence we cannot have a solution to this set of
inequalities
52/69
Most real world data is not linearly separable
and will always contain some outliers
53/69
In fact, sometimes there may not be any out-
liers but still the data may not be linearly sep-
arable
53/69
arable
We need computational units (models) which
can deal with such data
o
oo o o oo
oo + + + + oo
o + oo
o +
o + + o
o + + o
o + + o
o + + oo
o +
oo + + + oo
oo o o oo
o 53/69
arable
We need computational units (models) which
can deal with such data
While a single perceptron cannot deal with
such data, we will show that a network of per-
o ceptrons can indeed deal with such data
oo o o oo
oo + + + + oo
o + oo
o +
o + + o
o + + o
o + + o
o + + oo
o +
oo + + + oo
oo o o oo
o 53/69
Before seeing how a network of perceptrons can deal with linearly inseparable
data, we will discuss boolean functions in some more detail ...
54/69
How many boolean functions can you design from 2 inputs ?
55/69
Let us begin with some easy ones which you already know ..
55/69
x1 x2
0 0
0 1
1 0
1 1
55/69
x1 x2 f1
0 0 0
0 1 0
1 0 0
1 1 0
55/69
x1 x2 f1 f16
0 0 0 1
0 1 0 1
1 0 0 1
1 1 0 1
55/69
x1 x2 f1 f2 f16
0 0 0 0 1
0 1 0 0 1
1 0 0 0 1
1 1 0 1 1
55/69
x1 x2 f1 f2 f8 f16
0 0 0 0 0 1
0 1 0 0 1 1
1 0 0 0 1 1
1 1 0 1 1 1
55/69
x1 x2 f1 f2 f3 f8 f16
0 0 0 0 0 0 1
0 1 0 0 0 1 1
1 0 0 0 1 1 1
1 1 0 1 0 1 1
55/69
x1 x2 f1 f2 f3 f4 f8 f16
0 0 0 0 0 0 0 1
0 1 0 0 0 0 1 1
1 0 0 0 1 1 1 1
1 1 0 1 0 1 1 1
55/69
x1 x2 f1 f2 f3 f4 f5 f8 f16
0 0 0 0 0 0 0 0 1
0 1 0 0 0 0 1 1 1
1 0 0 0 1 1 0 1 1
1 1 0 1 0 1 0 1 1
55/69
x1 x2 f1 f2 f3 f4 f5 f6 f8 f16
0 0 0 0 0 0 0 0 0 1
0 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 1 1
55/69
x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f16
0 0 0 0 0 0 0 0 0 0 1
0 1 0 0 0 0 1 1 1 1 1
1 0 0 0 1 1 0 0 1 1 1
1 1 0 1 0 1 0 1 0 1 1
55/69
x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f16
0 0 0 0 0 0 0 0 0 0 1 1
0 1 0 0 0 0 1 1 1 1 0 1
1 0 0 0 1 1 0 0 1 1 0 1
1 1 0 1 0 1 0 1 0 1 0 1
55/69
x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f16
0 0 0 0 0 0 0 0 0 0 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 1
1 0 0 0 1 1 0 0 1 1 0 0 1
1 1 0 1 0 1 0 1 0 1 0 1 1
55/69
x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f16

0 0 0 0 0 0 0 0 0 0 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1
55/69
x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f16

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 1
55/69
x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f16

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
55/69
x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f16

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1
55/69
x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
55/69

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
Of these, how many are linearly separable ?
55/69

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
Of these, how many are linearly separable ? (turns out all except XOR and
!XOR - feel free to verify)
55/69

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
In general, how many boolean functions can you have for n inputs ?
55/69

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
n
In general, how many boolean functions can you have for n inputs ? 22
55/69

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
n
n
How many of these 22 functions are not linearly separable ?
55/69

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
n
n
How many of these 22 functions are not linearly separable ? For the time being,
it suffices to know that at least some of these may not be linearly inseparable
(I encourage you to figure out the exact answer :-) )
55/69
Module 2.8: Network of Perceptrons
56/69
We will now see how to implement any boolean function using a network of
perceptrons ...
57/69
For this discussion, we will assume True
= +1 and False = -1
58/69
= +1 and False = -1
We consider 2 inputs and 4 perceptrons
x1 x2
58/69
= +1 and False = -1
Each input is connected to all the 4 per-
ceptrons with specific weights
x1 x2
58/69
= +1 and False = -1
x1 x2
red edge indicates w = -1
blue edge indicates w = +1
58/69
= +1 and False = -1
x1 x2
58/69
= +1 and False = -1
x1 x2
58/69
= +1 and False = -1
x1 x2
58/69
= +1 and False = -1
x1 x2
58/69
= +1 and False = -1
The bias (w0 ) of each perceptron is -2
(i.e., each perceptron will fire only if the
weighted sum of its input is ≥ 2)
bias =-2
x1 x2
58/69
= +1 and False = -1
Each of these perceptrons is connected to
bias =-2 an output perceptron by weights (which
need to be learned)
x1 x2
58/69
= +1 and False = -1
need to be learned)
x1 x2
58/69
= +1 and False = -1
w1 w2 w3 w4 The bias (w0 ) of each perceptron is -2
need to be learned)
x1 x2
58/69
= +1 and False = -1
y We consider 2 inputs and 4 perceptrons
w1 w2 w3 w4 The bias (w0 ) of each perceptron is -2
need to be learned)
x1 x2
The output of this perceptron (y) is the
red edge indicates w = -1 output of this network
58/69
Terminology:
This network contains 3 layers
y
w1 w2 w3 w4
bias =-2
x1 x2
59/69
Terminology:
y The layer containing the inputs (x1 , x2 ) is
called the input layer
w1 w2 w3 w4
bias =-2
x1 x2
59/69
Terminology:
The middle layer containing the 4 per-
ceptrons is called the hidden layer
w1 w2 w3 w4
bias =-2
x1 x2
59/69
Terminology:
w1 w2 w3 w4
The final layer containing one output
neuron is called the output layer
bias =-2
x1 x2
59/69
Terminology:
w1 w2 w3 w4
h1 h2 h3 h4 The final layer containing one output
The outputs of the 4 perceptrons in the
bias =-2 hidden layer are denoted by h1 , h2 , h3 , h4
x1 x2
59/69
Terminology:
w1 w2 w3 w4
The red and blue edges are called layer 1
x1 x2 weights
59/69
Terminology:
w1 w2 w3 w4
The red and blue edges are called layer 1
x1 x2 weights
red edge indicates w = -1 w1 , w2 , w3 , w4 are called layer 2 weights
59/69
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
w1 w2 w3 w4
h1 h2 h3 h4
bias =-2
x1 x2
60/69
In other words, we can find w1 , w2 , w3 , w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4 work
h1 h2 h3 h4
bias =-2
x1 x2
60/69
w1 w2 w3 w4 work
h1 h2 h3 h4
Astonishing claim!
bias =-2
x1 x2
60/69
w1 w2 w3 w4 work
h1 h2 h3 h4
Astonishing claim! Well, not really, if you
understand what is going on
bias =-2
x1 x2
60/69
w1 w2 w3 w4 work
h1 h2 h3 h4
bias =-2 Each perceptron in the middle layer fires
only for a specific input (and no two per-
x1 x2 ceptrons fire for the same input)
60/69
w1 w2 w3 w4 work
h1 h2 h3 h4
-1,-1
red edge indicates w = -1 the first perceptron fires for {-1,-1}
60/69
w1 w2 w3 w4 work
h1 h2 h3 h4
-1,-1 -1,1
red edge indicates w = -1 the second perceptron fires for {-1,1}
60/69
w1 w2 w3 w4 work
h1 h2 h3 h4
-1,-1 -1,1 1,-1
red edge indicates w = -1 the third perceptron fires for {1,-1}
60/69
w1 w2 w3 w4 work
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1
red edge indicates w = -1 the fourth perceptron fires for {1,1}
60/69
w1 w2 w3 w4 work
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1
red edge indicates w = -1 Let us see why this network works by tak-
blue edge indicates w = +1 ing an example of the XOR function
60/69
Let w0 be the biasPoutput of the neuron
(i.e., it will fire if 4i=1 wi hi ≥ w0 )
y
w1 w2 w3 w4
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1
bias =-2
x1 x2
61/69
y
P4
x1 x2 XOR h1 h2 h3 h4 i=1 wi hi
0 0 0 1 0 0 0 w1
w1 w2 w3 w4
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1
bias =-2
x1 x2
61/69
y
P4
0 0 0 1 0 0 0 w1
w1 w2 w3 w4 0 1 1 0 1 0 0 w2
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1
bias =-2
x1 x2
61/69
y
P4
0 0 0 1 0 0 0 w1
w1 w2 w3 w4 0 1 1 0 1 0 0 w2
h1 h2 h3 h4 1 0 1 0 0 1 0 w3
-1,-1 -1,1 1,-1 1,1
bias =-2
x1 x2
61/69
y
P4
0 0 0 1 0 0 0 w1
w1 w2 w3 w4 0 1 1 0 1 0 0 w2
h1 h2 h3 h4 1 0 1 0 0 1 0 w3
-1,-1 -1,1 1,-1 1,1 1 1 0 0 0 0 1 w4
bias =-2
x1 x2
61/69
y
P4
0 0 0 1 0 0 0 w1
w1 w2 w3 w4 0 1 1 0 1 0 0 w2
h1 h2 h3 h4 1 0 1 0 0 1 0 w3
-1,-1 -1,1 1,-1 1,1 1 1 0 0 0 0 1 w4
This results in the following four conditions

bias =-2 to implement XOR: w1 < w0 , w2 ≥ w0 , w3 ≥
w0 , w4 < w0
x1 x2
61/69
y
P4
0 0 0 1 0 0 0 w1
w1 w2 w3 w4 0 1 1 0 1 0 0 w2
h1 h2 h3 h4 1 0 1 0 0 1 0 w3
-1,-1 -1,1 1,-1 1,1 1 1 0 0 0 0 1 w4

w0 , w4 < w0
x1 x2
Unlike before, there are no contradictions now
red edge indicates w = -1 and the system of inequalities can be satisfied
61/69
y
P4
0 0 0 1 0 0 0 w1
w1 w2 w3 w4 0 1 1 0 1 0 0 w2
h1 h2 h3 h4 1 0 1 0 0 1 0 w3
-1,-1 -1,1 1,-1 1,1 1 1 0 0 0 0 1 w4

w0 , w4 < w0
x1 x2
Unlike before, there are no contradictions now
red edge indicates w = -1 and the system of inequalities can be satisfied
Essentially each wi is now responsible for one
of the 4 possible inputs and can be adjusted
to get the desired output for that input 61/69
It should be clear that the same network
can be used to represent the remaining 15
y boolean functions also
w1 w2 w3 w4
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1
bias =-2
x1 x2
62/69
Each boolean function will result in a dif-
ferent set of non-contradicting inequalit-
ies which can be satisfied by appropriately
w1 w2 w3 w4 setting w1 , w2 , w3 , w4
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1
bias =-2
x1 x2
62/69
Each boolean function will result in a dif-
ferent set of non-contradicting inequalit-
ies which can be satisfied by appropriately
w1 w2 w3 w4 setting w1 , w2 , w3 , w4
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1
Try it!
bias =-2
x1 x2
62/69
What if we have more than 3 inputs ?
63/69
Again each of the 8 perceptorns will fire only for one of the 8 inputs
Each of the 8 weights in the second layer is responsible for one of the 8 inputs
and can be adjusted to produce the desired output for that input
y
w1 w2 w3 w4 w5 w6 w7 w8
bias =-3
x1 x2 x3
64/69
What if we have n inputs ?
65/69
Theorem
Any boolean function of n inputs can be represented exactly by a network of
perceptrons containing 1 hidden layer with 2n perceptrons and one output layer
containing 1 perceptron
66/69
Theorem
Proof (informal:) We just saw how to construct such a network
Note: A network of 2n + 1 perceptrons is not necessary but sufficient. For

example, we already saw how to represent AND function with just 1 perceptron
66/69
Theorem
Proof (informal:) We just saw how to construct such a network
Note: A network of 2n + 1 perceptrons is not necessary but sufficient. For

example, we already saw how to represent AND function with just 1 perceptron
Catch: As n increases the number of perceptrons in the hidden layers obviously

increases exponentially
66/69
Again, why do we care about boolean functions ?
67/69
How does this help us with our original problem: which was to predict whether
we like a movie or not?
67/69
How does this help us with our original problem: which was to predict whether
we like a movie or not? Let us see!
67/69
We are given this data about our past movie
experience
 
p1 x11 x12 . . . x1n y1 = 1
p2 x21
 x22 . . . x2n y2 = 1
..  .. .. .. .. .. 
.  .
 . . . . 
n1 xk1
 xk2 . . . xkn yi = 0 

n2  xj1
 xj2 . . . xjn yj = 0
.. .. .. .. .. ..
. . . . . .
68/69
experience
For each movie, we are given the values of the
various factors (x1 , x2 , . . . , xn ) that we base
our decision on and we are also also given the
value of y (like/dislike)
 
p1 x11 x12 . . . x1n y1 = 1
p2 x21
 x22 . . . x2n y2 = 1
..  .. .. .. .. .. 
.  .
 . . . . 
n1 xk1
 xk2 . . . xkn yi = 0 

n2  xj1
 xj2 . . . xjn yj = 0
.. .. .. .. .. ..
. . . . . .
68/69
experience
  pi ’s are the points for which the output was 1
p1 x11 x12 . . . x1n y1 = 1 and ni ’s are the points for which it was 0
p2 x21
 x22 . . . x2n y2 = 1
..  .. .. .. .. .. 
.  .
 . . . . 
n1 xk1
 xk2 . . . xkn yi = 0 

n2  xj1
 xj2 . . . xjn yj = 0
.. .. .. .. .. ..
. . . . . .
68/69
experience
p2 x21
 x22 . . . x2n y2 = 1 The data may or may not be linearly separable
..  .. .. .. .. .. 
.  .
 . . . . 
n1 xk1
 xk2 . . . xkn yi = 0 

n2  xj1
 xj2 . . . xjn yj = 0
.. .. .. .. .. ..
. . . . . .
68/69
y
w1 w2 w3 w4 w5 w6 w7 w8
experience
bias =-3 For each movie, we are given the values of the
x1 x2 x3
p2 x21
 x22 . . . x2n y2 = 1 The data may or may not be linearly separable
..  .. .. .. .. .. 
.  .
 . . . .  But the proof that we just saw tells us that it
n1 xk1
 xk2 . . . xkn yi = 0 
 is possible to have a network of perceptrons
n2  xj1
 xj2 . . . xjn yj = 0 and learn the weights in this network such
.. .. .. .. .. .. that for any given pi or nj the output of the
. . . . . .
network will be the same as yi or yj (i.e.,
we can separate the positive and the negat-
ive points)
68/69
Networks of the form that we just saw (containing, an input, output and one
or more hidden layers) are called Multilayer Perceptrons (MLP, in short)
69/69
More appropriate terminology would be“Multilayered Network of Perceptrons”
but MLP is the more commonly used name
69/69
The theorem that we just saw gives us the representation power of a MLP with
a single hidden layer
69/69
The theorem that we just saw gives us the representation power of a MLP with
a single hidden layer
Specifically, it tells us that a MLP with a single hidden layer can represent any
boolean function
69/69

Lecture 2

Uploaded by

Copyright:

Available Formats

You might also like

Lecture 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 2

Uploaded by

Copyright:

Available Formats

CS7015 (Deep Learning) : Lecture 2

McCulloch Pitts Neuron, Thresholding Logic, Perceptrons, Perceptron

Department of Computer Science and Engineering

y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}

y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}

(0, 1, 1) (1, 1, 1)x1 + x2 + x3 = θ = 1

One possible solution to this set of inequalities

One possible solution to this set of inequalities

One possible solution to this set of inequalities

We can thus rewrite the perceptron

We can thus rewrite the perceptron

We can thus rewrite the perceptron

We can thus rewrite the perceptron

Proposition: If the sets P and N are finite and linearly separable,

Proof: On the next slide

if w.p < 0 then

if w.p < 0 then

So far we have, w T · pi < 0 (and hence we made the correction)

So far we have, wT · pi < 0 (and hence we made the correction)

So far we have, wT · pi < 0 (and hence we made the correction)

So far we have, wT · pi < 0 (and hence we made the correction)

So far we have, wT · pi < 0 (and hence we made the correction)

So far we have, wT · pi < 0 (and hence we made the correction)

So far we have, wT · pi < 0 (and hence we made the correction)

So far we have, wT · pi < 0 (and hence we made the correction)

So far we have, wT · pi < 0 (and hence we made the correction)

So far we have, wT · pi < 0 (and hence we made the correction)

So far we have, wT · pi < 0 (and hence we made the correction)

So far we have, wT · pi ≤ 0 (and hence we made the correction)

So far we have, wT · pi ≤ 0 (and hence we made the correction)

So far we have, wT · pi ≤ 0 (and hence we made the correction)

So far we have, wT · pi ≤ 0 (and hence we made the correction)

So far we have, wT · pi ≤ 0 (and hence we made the correction)

So far we have, wT · pi ≤ 0 (and hence we made the correction)

Do we always need to hand code the threshold ?

The fourth condition contradicts conditions 2

The fourth condition contradicts conditions 2

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f16

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f16

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f16

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f16

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16

Of these, how many are linearly separable ?

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16

This results in the following four conditions

This results in the following four conditions

This results in the following four conditions

Proof (informal:) We just saw how to construct such a network

Note: A network of 2n + 1 perceptrons is not necessary but sufficient. For

Proof (informal:) We just saw how to construct such a network

Note: A network of 2n + 1 perceptrons is not necessary but sufficient. For

Catch: As n increases the number of perceptrons in the hidden layers obviously

You might also like