Lecture 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 419

CS7015 (Deep Learning) : Lecture 2

McCulloch Pitts Neuron, Thresholding Logic, Perceptrons, Perceptron


Learning Algorithm and Convergence, Multilayer Perceptrons (MLPs),
Representation Power of MLPs

Mitesh M. Khapra

Department of Computer Science and Engineering


Indian Institute of Technology Madras

1/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.1: Biological Neurons

2/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y The most fundamental unit of a deep
neural network is called an artificial
neuron
σ

w1 w2 w3

x1 x2 x3
Artificial Neuron

3/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y The most fundamental unit of a deep
neural network is called an artificial
neuron
σ Why is it called a neuron ? Where
does the inspiration come from ?

w1 w2 w3

x1 x2 x3
Artificial Neuron

3/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y The most fundamental unit of a deep
neural network is called an artificial
neuron
σ Why is it called a neuron ? Where
does the inspiration come from ?
The inspiration comes from biology
w1 w2 w3
(more specifically, from the brain)
x1 x2 x3
Artificial Neuron

3/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y The most fundamental unit of a deep
neural network is called an artificial
neuron
σ Why is it called a neuron ? Where
does the inspiration come from ?
The inspiration comes from biology
w1 w2 w3
(more specifically, from the brain)
x1 x2 x3 biological neurons = neural cells =
neural processing units
Artificial Neuron

3/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y The most fundamental unit of a deep
neural network is called an artificial
neuron
σ Why is it called a neuron ? Where
does the inspiration come from ?
The inspiration comes from biology
w1 w2 w3
(more specifically, from the brain)
x1 x2 x3 biological neurons = neural cells =
neural processing units
Artificial Neuron
We will first see what a biological
neuron looks like ...

3/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Biological Neurons∗


Image adapted from
https://cdn.vectorstock.com/i/composite/12,25/neuron-cell-vector-81225.jpg

4/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
dendrite: receives signals from other
neurons

Biological Neurons∗


Image adapted from
https://cdn.vectorstock.com/i/composite/12,25/neuron-cell-vector-81225.jpg

4/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
dendrite: receives signals from other
neurons
synapse: point of connection to
other neurons

Biological Neurons∗


Image adapted from
https://cdn.vectorstock.com/i/composite/12,25/neuron-cell-vector-81225.jpg

4/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
dendrite: receives signals from other
neurons
synapse: point of connection to
other neurons
soma: processes the information

Biological Neurons∗


Image adapted from
https://cdn.vectorstock.com/i/composite/12,25/neuron-cell-vector-81225.jpg

4/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
dendrite: receives signals from other
neurons
synapse: point of connection to
other neurons
soma: processes the information
axon: transmits the output of this
neuron

Biological Neurons∗


Image adapted from
https://cdn.vectorstock.com/i/composite/12,25/neuron-cell-vector-81225.jpg

4/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us see a very cartoonish illustra-
tion of how a neuron works

5/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us see a very cartoonish illustra-
tion of how a neuron works
Our sense organs interact with the
outside world

5/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us see a very cartoonish illustra-
tion of how a neuron works
Our sense organs interact with the
outside world
They relay information to the neur-
ons

5/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us see a very cartoonish illustra-
tion of how a neuron works
Our sense organs interact with the
outside world
They relay information to the neur-
ons
The neurons (may) get activated and
produces a response (laughter in this
case)

5/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Of course, in reality, it is not just a single
neuron which does all this

6/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Of course, in reality, it is not just a single
neuron which does all this
There is a massively parallel interconnected
network of neurons

6/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Of course, in reality, it is not just a single
neuron which does all this
There is a massively parallel interconnected
network of neurons
The sense organs relay information to the low-
est layer of neurons

6/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Of course, in reality, it is not just a single
neuron which does all this
There is a massively parallel interconnected
network of neurons
The sense organs relay information to the low-
est layer of neurons
Some of these neurons may fire (in red) in re-
sponse to this information and in turn relay
information to other neurons they are connec-
ted to

6/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Of course, in reality, it is not just a single
neuron which does all this
There is a massively parallel interconnected
network of neurons
The sense organs relay information to the low-
est layer of neurons
Some of these neurons may fire (in red) in re-
sponse to this information and in turn relay
information to other neurons they are connec-
ted to
These neurons may also fire (again, in red)
and the process continues

6/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Of course, in reality, it is not just a single
neuron which does all this
There is a massively parallel interconnected
network of neurons
The sense organs relay information to the low-
est layer of neurons
Some of these neurons may fire (in red) in re-
sponse to this information and in turn relay
information to other neurons they are connec-
ted to
These neurons may also fire (again, in red)
and the process continues eventually resulting
in a response (laughter in this case)

6/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Of course, in reality, it is not just a single
neuron which does all this
There is a massively parallel interconnected
network of neurons
The sense organs relay information to the low-
est layer of neurons
Some of these neurons may fire (in red) in re-
sponse to this information and in turn relay
information to other neurons they are connec-
ted to
These neurons may also fire (again, in red)
and the process continues eventually resulting
in a response (laughter in this case)
An average human brain has around 1011 (100
billion) neurons!
6/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
This massively parallel network also ensures
that there is division of work

7/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
This massively parallel network also ensures
that there is division of work
Each neuron may perform a certain role or
respond to a certain stimulus

7/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
This massively parallel network also ensures
that there is division of work
Each neuron may perform a certain role or
respond to a certain stimulus

A simplified illustration
7/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
This massively parallel network also ensures
that there is division of work
Each neuron may perform a certain role or
respond to a certain stimulus

A simplified illustration
7/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
This massively parallel network also ensures
that there is division of work
Each neuron may perform a certain role or
respond to a certain stimulus

A simplified illustration
7/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
This massively parallel network also ensures
that there is division of work
Each neuron may perform a certain role or
respond to a certain stimulus

A simplified illustration
7/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
This massively parallel network also ensures
that there is division of work
Each neuron may perform a certain role or
respond to a certain stimulus

A simplified illustration
7/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The neurons in the brain are arranged
in a hierarchy

8/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The neurons in the brain are arranged
in a hierarchy
We illustrate this with the help of
visual cortex (part of the brain) which
deals with processing visual informa-
tion

8/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The neurons in the brain are arranged
in a hierarchy
We illustrate this with the help of
visual cortex (part of the brain) which
deals with processing visual informa-
tion
Starting from the retina, the informa-
tion is relayed to several layers (follow
the arrows)

8/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The neurons in the brain are arranged
in a hierarchy
We illustrate this with the help of
visual cortex (part of the brain) which
deals with processing visual informa-
tion
Starting from the retina, the informa-
tion is relayed to several layers (follow
the arrows)
We observe that the layers V 1, V 2 to
AIT form a hierarchy (from identify-
ing simple visual forms to high level
objects)

8/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Sample illustration of hierarchical
processing∗

Idea borrowed from Hugo Larochelle’s lecture slides
9/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Sample illustration of hierarchical
processing∗

Idea borrowed from Hugo Larochelle’s lecture slides
9/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Sample illustration of hierarchical
processing∗

Idea borrowed from Hugo Larochelle’s lecture slides
9/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Disclaimer
I understand very little about how the brain works!
What you saw so far is an overly simplified explanation of how the brain works!
But this explanation suffices for the purpose of this course!

10/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.2: McCulloch Pitts Neuron

11/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)

12/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)

x1

12/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)

x1 x2

12/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)

x1 x2 .. ..

12/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)

x1 x2 .. .. xn

12/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)

x1 x2 .. .. xn ∈ {0, 1}

12/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)
g aggregates the inputs

x1 x2 .. .. xn ∈ {0, 1}

12/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)
f g aggregates the inputs and the function f
takes a decision based on this aggregation
g

x1 x2 .. .. xn ∈ {0, 1}

12/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)
f g aggregates the inputs and the function f
takes a decision based on this aggregation
g

x1 x2 .. .. xn ∈ {0, 1}

12/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)
f g aggregates the inputs and the function f
takes a decision based on this aggregation
g The inputs can be excitatory or inhibitory

x1 x2 .. .. xn ∈ {0, 1}

12/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)
f g aggregates the inputs and the function f
takes a decision based on this aggregation
g The inputs can be excitatory or inhibitory
y = 0 if any xi is inhibitory, else

x1 x2 .. .. xn ∈ {0, 1}

12/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)
f g aggregates the inputs and the function f
takes a decision based on this aggregation
g The inputs can be excitatory or inhibitory
y = 0 if any xi is inhibitory, else

x1 x2 .. .. xn ∈ {0, 1} n
X
g(x1 , x2 , ..., xn ) = g(x) = xi
i=1

12/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)
f g aggregates the inputs and the function f
takes a decision based on this aggregation
g The inputs can be excitatory or inhibitory
y = 0 if any xi is inhibitory, else

x1 x2 .. .. xn ∈ {0, 1} n
X
g(x1 , x2 , ..., xn ) = g(x) = xi
i=1
y = f (g(x)) = 1 if g(x) ≥ θ

12/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)
f g aggregates the inputs and the function f
takes a decision based on this aggregation
g The inputs can be excitatory or inhibitory
y = 0 if any xi is inhibitory, else

x1 x2 .. .. xn ∈ {0, 1} n
X
g(x1 , x2 , ..., xn ) = g(x) = xi
i=1
y = f (g(x)) = 1 if g(x) ≥ θ
= 0 if g(x) < θ

12/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)
f g aggregates the inputs and the function f
takes a decision based on this aggregation
g The inputs can be excitatory or inhibitory
y = 0 if any xi is inhibitory, else

x1 x2 .. .. xn ∈ {0, 1} n
X
g(x1 , x2 , ..., xn ) = g(x) = xi
i=1
y = f (g(x)) = 1 if g(x) ≥ θ
= 0 if g(x) < θ
θ is called the thresholding parameter

12/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)
f g aggregates the inputs and the function f
takes a decision based on this aggregation
g The inputs can be excitatory or inhibitory
y = 0 if any xi is inhibitory, else

x1 x2 .. .. xn ∈ {0, 1} n
X
g(x1 , x2 , ..., xn ) = g(x) = xi
i=1
y = f (g(x)) = 1 if g(x) ≥ θ
= 0 if g(x) < θ
θ is called the thresholding parameter
This is called Thresholding Logic
12/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us implement some boolean functions using this McCulloch Pitts (MP) neuron
...

13/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1}

x1 x2 x3
A McCulloch Pitts unit

14/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} y ∈ {0, 1}

x1 x2 x3 x1 x2 x3
A McCulloch Pitts unit AND function

14/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} y ∈ {0, 1}

θ 3

x1 x2 x3 x1 x2 x3
A McCulloch Pitts unit AND function

14/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}

θ 3

x1 x2 x3 x1 x2 x3 x1 x2 x3
A McCulloch Pitts unit AND function OR function

14/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}

θ 3 1

x1 x2 x3 x1 x2 x3 x1 x2 x3
A McCulloch Pitts unit AND function OR function

14/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}

θ 3 1

x1 x2 x3 x1 x2 x3 x1 x2 x3
A McCulloch Pitts unit AND function OR function

y ∈ {0, 1}

x1 x2
x1 AND !x2 ∗

circle at the end indicates inhibitory input: if any inhibitory input is 1 the output will be 0 14/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}

θ 3 1

x1 x2 x3 x1 x2 x3 x1 x2 x3
A McCulloch Pitts unit AND function OR function

y ∈ {0, 1}

x1 x2
x1 AND !x2 ∗

circle at the end indicates inhibitory input: if any inhibitory input is 1 the output will be 0 14/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}

θ 3 1

x1 x2 x3 x1 x2 x3 x1 x2 x3
A McCulloch Pitts unit AND function OR function

y ∈ {0, 1} y ∈ {0, 1}

x1 x2 x1 x2
x1 AND !x2 ∗ NOR function

circle at the end indicates inhibitory input: if any inhibitory input is 1 the output will be 0 14/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}

θ 3 1

x1 x2 x3 x1 x2 x3 x1 x2 x3
A McCulloch Pitts unit AND function OR function

y ∈ {0, 1} y ∈ {0, 1}

1 0

x1 x2 x1 x2
x1 AND !x2 ∗ NOR function

circle at the end indicates inhibitory input: if any inhibitory input is 1 the output will be 0 14/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}

θ 3 1

x1 x2 x3 x1 x2 x3 x1 x2 x3
A McCulloch Pitts unit AND function OR function

y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}

1 0

x1 x2 x1 x2 x1
x1 AND !x2 ∗ NOR function NOT function

circle at the end indicates inhibitory input: if any inhibitory input is 1 the output will be 0 14/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}

θ 3 1

x1 x2 x3 x1 x2 x3 x1 x2 x3
A McCulloch Pitts unit AND function OR function

y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}

1 0 0

x1 x2 x1 x2 x1
x1 AND !x2 ∗ NOR function NOT function

circle at the end indicates inhibitory input: if any inhibitory input is 1 the output will be 0 14/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Can any boolean function be represented using a McCulloch Pitts unit ?

15/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Can any boolean function be represented using a McCulloch Pitts unit ?
Before answering this question let us first see the geometric interpretation of a
MP unit ...

15/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1}

x1 x2
OR function
P2
x1 + x2 = i=1 xi ≥ 1

16/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1}

x1 x2
OR function
P2
x1 + x2 = i=1 xi ≥ 1
x2

(0, 1) (1, 1)

(0, 0) (1, 0) x1
16/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1}

x1 x2
OR function
P2
x1 + x2 = i=1 xi ≥ 1
x2

(0, 1) (1, 1)

x1 + x2 = θ = 1

(0, 0) (1, 0) x1
16/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1}
A single MP neuron splits the input points (4
points for 2 binary inputs) into two halves

x1 x2
OR function
P2
x1 + x2 = i=1 xi ≥ 1
x2

(0, 1) (1, 1)

x1 + x2 = θ = 1

(0, 0) (1, 0) x1
16/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1}
A single MP neuron splits the input points (4
points for 2 binary inputs) into two halves
Points lying on or above the line ni=1 xi −θ =
P
1
0 and points lying below this line
x1 x2
OR function
P2
x1 + x2 = i=1 xi ≥ 1
x2

(0, 1) (1, 1)

x1 + x2 = θ = 1

(0, 0) (1, 0) x1
16/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1}
A single MP neuron splits the input points (4
points for 2 binary inputs) into two halves
Points lying on or above the line ni=1 xi −θ =
P
1
0 and points lying below this line
x1 x2 In other words, all inputs which Pproduce an
output 0 will be on one side ( ni=1 xi < θ)
OR function
P2 of the line and all inputs which produce an
x1 + x2 = i=1 xi ≥ 1 Pn
output 1 will lie on the other side ( i=1 xi ≥
x2 θ) of this line

(0, 1) (1, 1)

x1 + x2 = θ = 1

(0, 0) (1, 0) x1
16/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1}
A single MP neuron splits the input points (4
points for 2 binary inputs) into two halves
Points lying on or above the line ni=1 xi −θ =
P
1
0 and points lying below this line
x1 x2 In other words, all inputs which Pproduce an
output 0 will be on one side ( ni=1 xi < θ)
OR function
P2 of the line and all inputs which produce an
x1 + x2 = i=1 xi ≥ 1 Pn
output 1 will lie on the other side ( i=1 xi ≥
x2 θ) of this line
Let us convince ourselves about this with a
(0, 1) (1, 1) few more examples (if it is not already clear
from the math)

x1 + x2 = θ = 1

(0, 0) (1, 0) x1
16/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1}

x1 x2
AND function
P2
x1 + x2 = i=1 xi ≥ 2

17/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1}

x1 x2
AND function
P2
x1 + x2 = i=1 xi ≥ 2
x2

(0, 1) (1, 1)

(0, 0) (1, 0) x1
17/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1}

x1 x2
AND function
P2
x1 + x2 = i=1 xi ≥ 2
x2

(0, 1) (1, 1)

x1 + x2 = θ = 2

(0, 0) (1, 0) x1
17/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} y ∈ {0, 1}

x1 x2 x1 x2
AND function
P2 Tautology (always ON)
x1 + x2 = i=1 xi ≥ 2
x2

(0, 1) (1, 1)

x1 + x2 = θ = 2

(0, 0) (1, 0) x1
17/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} y ∈ {0, 1}

2 0

x1 x2 x1 x2
AND function
P2 Tautology (always ON)
x1 + x2 = i=1 xi ≥ 2
x2

(0, 1) (1, 1)

x1 + x2 = θ = 2

(0, 0) (1, 0) x1
17/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} y ∈ {0, 1}

2 0

x1 x2 x1 x2
AND function
P2 Tautology (always ON)
x1 + x2 = i=1 xi ≥ 2
x2
x2
(0, 1) (1, 1)
(0, 1) (1, 1)

x1 + x2 = θ = 2

(0, 0) (1, 0) x1
(0, 0) (1, 0) x1
17/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} y ∈ {0, 1}

2 0

x1 x2 x1 x2
AND function
P2 Tautology (always ON)
x1 + x2 = i=1 xi ≥ 2
x2
x2
(0, 1) (1, 1)
(0, 1) (1, 1)

x1 + x2 = θ = 0
x1 + x2 = θ = 2

(0, 0) (1, 0) x1
(0, 0) (1, 0) x1
17/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} What if we have more than 2 inputs?

1 OR

x1 x2 x3

18/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} What if we have more than 2 inputs?

1 OR

x1 x2 x3
x2
(0, 1, 0) (1, 1, 0)

(0, 1, 1) (1, 1, 1)

(0, 0, 0) (1, 0, 0) x1

(0, 0, 1) (1, 0, 1)
x3 18/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} What if we have more than 2 inputs?
Well, instead of a line we will have a
plane
1 OR

x1 x2 x3
x2
(0, 1, 0) (1, 1, 0)

(0, 1, 1) (1, 1, 1)

(0, 0, 0) (1, 0, 0) x1

(0, 0, 1) (1, 0, 1)
x3 18/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} What if we have more than 2 inputs?
Well, instead of a line we will have a
plane
1 OR
For the OR function, we want a plane
x1 x2 x3 such that the point (0,0,0) lies on one
x2 side and the remaining 7 points lie on
the other side of the plane
(0, 1, 0) (1, 1, 0)

(0, 1, 1) (1, 1, 1)

(0, 0, 0) (1, 0, 0) x1

(0, 0, 1) (1, 0, 1)
x3 18/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} What if we have more than 2 inputs?
Well, instead of a line we will have a
plane
1 OR
For the OR function, we want a plane
x1 x2 x3 such that the point (0,0,0) lies on one
x2 side and the remaining 7 points lie on
the other side of the plane
(0, 1, 0) (1, 1, 0)

(0, 1, 1) (1, 1, 1)x1 + x2 + x3 = θ = 1

(0, 0, 0) (1, 0, 0) x1

(0, 0, 1) (1, 0, 1)
x3 18/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The story so far ...
A single McCulloch Pitts Neuron can be used to represent boolean functions
which are linearly separable

19/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The story so far ...
A single McCulloch Pitts Neuron can be used to represent boolean functions
which are linearly separable
Linear separability (for boolean functions) : There exists a line (plane) such
that all inputs which produce a 1 lie on one side of the line (plane) and all
inputs which produce a 0 lie on other side of the line (plane)

19/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.3: Perceptron

20/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The story ahead ...
What about non-boolean (say, real) inputs ?

21/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The story ahead ...
What about non-boolean (say, real) inputs ?
Do we always need to hand code the threshold ?

21/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The story ahead ...
What about non-boolean (say, real) inputs ?
Do we always need to hand code the threshold ?
Are all inputs equal ? What if we want to assign more weight (importance) to
some inputs ?

21/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The story ahead ...
What about non-boolean (say, real) inputs ?
Do we always need to hand code the threshold ?
Are all inputs equal ? What if we want to assign more weight (importance) to
some inputs ?
What about functions which are not linearly separable ?

21/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Frank Rosenblatt, an American psychologist,
proposed the classical perceptron model
(1958)

22/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y Frank Rosenblatt, an American psychologist,
proposed the classical perceptron model
(1958)

w1 w2 .. .. wn

x1 x2 .. .. xn

22/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y Frank Rosenblatt, an American psychologist,
proposed the classical perceptron model
(1958)
A more general computational model than
McCulloch–Pitts neurons

w1 w2 .. .. wn

x1 x2 .. .. xn

22/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y Frank Rosenblatt, an American psychologist,
proposed the classical perceptron model
(1958)
A more general computational model than
McCulloch–Pitts neurons
Main differences: Introduction of numer-
w1 w2 .. .. wn
ical weights for inputs and a mechanism for
x1 x2 .. .. xn learning these weights

22/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y Frank Rosenblatt, an American psychologist,
proposed the classical perceptron model
(1958)
A more general computational model than
McCulloch–Pitts neurons
Main differences: Introduction of numer-
w1 w2 .. .. wn
ical weights for inputs and a mechanism for
x1 x2 .. .. xn learning these weights
Inputs are no longer limited to boolean values

22/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y Frank Rosenblatt, an American psychologist,
proposed the classical perceptron model
(1958)
A more general computational model than
McCulloch–Pitts neurons
Main differences: Introduction of numer-
w1 w2 .. .. wn
ical weights for inputs and a mechanism for
x1 x2 .. .. xn learning these weights
Inputs are no longer limited to boolean values
Refined and carefully analyzed by Minsky and
Papert (1969) - their model is referred to as
the perceptron model here

22/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y

w1 w2 .. .. wn

x1 x2 .. .. xn

23/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1

w1 w2 .. .. wn

x1 x2 .. .. xn

23/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w1 w2 .. .. wn

x1 x2 .. .. xn

23/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w1 w2 .. .. wn
Rewriting the above,
x1 x2 .. .. xn

23/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w1 w2 .. .. wn
Rewriting the above,
x1 x2 .. .. xn
n
X
y = 1 if wi ∗ x i − θ ≥ 0
i=1

23/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w1 w2 .. .. wn
Rewriting the above,
x1 x2 .. .. xn
n
X
y = 1 if wi ∗ x i − θ ≥ 0
i=1
Xn
= 0 if wi ∗ x i − θ < 0
i=1

23/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w1 w2 .. .. wn
Rewriting the above,
x1 x2 .. .. xn
n
A more accepted convention, X
y = 1 if wi ∗ x i − θ ≥ 0
i=1
Xn
= 0 if wi ∗ x i − θ < 0
i=1

23/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w1 w2 .. .. wn
Rewriting the above,
x1 x2 .. .. xn
n
A more accepted convention, X
n y = 1 if wi ∗ x i − θ ≥ 0
X
y = 1 if wi ∗ x i ≥ 0 i=1
n
i=0 X
= 0 if wi ∗ x i − θ < 0
i=1

23/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w1 w2 .. .. wn
Rewriting the above,
x1 x2 .. .. xn
n
A more accepted convention, X
n y = 1 if wi ∗ x i − θ ≥ 0
X
y = 1 if wi ∗ x i ≥ 0 i=1
n
i=0 X
= 0 if wi ∗ x i − θ < 0
i=1

where, x0 = 1 and w0 = −θ
23/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w0 = −θ w1 w2 .. .. wn
Rewriting the above,
x0 = 1 x1 x2 .. .. xn
n
A more accepted convention, X
n y = 1 if wi ∗ x i − θ ≥ 0
X
y = 1 if wi ∗ x i ≥ 0 i=1
n
i=0 X
= 0 if wi ∗ x i − θ < 0
i=1

where, x0 = 1 and w0 = −θ
23/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w0 = −θ w1 w2 .. .. wn
Rewriting the above,
x0 = 1 x1 x2 .. .. xn
n
A more accepted convention, X
n y = 1 if wi ∗ x i − θ ≥ 0
X
y = 1 if wi ∗ x i ≥ 0 i=1
n
i=0 X
n = 0 if wi ∗ x i − θ < 0
X
= 0 if wi ∗ x i < 0 i=1
i=0
where, x0 = 1 and w0 = −θ
23/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We will now try to answer the following questions:
Why are we trying to implement boolean functions?
Why do we need weights ?
Why is w0 = −θ called the bias ?

24/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y Consider the task of predicting whether we would like
a movie or not

w0 = −θ w1 w2 w3

x0 = 1 x1 x2 x3

25/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y Consider the task of predicting whether we would like
a movie or not
Suppose, we base our decision on 3 inputs (binary, for
simplicity)

w0 = −θ w1 w2 w3

x0 = 1 x1 x2 x3

25/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y Consider the task of predicting whether we would like
a movie or not
Suppose, we base our decision on 3 inputs (binary, for
simplicity)
Based on our past viewing experience (data), we may
give a high weight to isDirectorNolan as compared to
w0 = −θ w1 w2 w3 the other inputs
x0 = 1 x1 x2 x3

x1 = isActorDamon
x2 = isGenreT hriller
x3 = isDirectorN olan

25/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y Consider the task of predicting whether we would like
a movie or not
Suppose, we base our decision on 3 inputs (binary, for
simplicity)
Based on our past viewing experience (data), we may
give a high weight to isDirectorNolan as compared to
w0 = −θ w1 w2 w3 the other inputs
x0 = 1 x1 x2 x3 Specifically, even if the actor is not Matt Damon and
the genre is not thriller we would still want to cross
the threshold θ by assigning a high weight to isDirect-
orNolan
x1 = isActorDamon
x2 = isGenreT hriller
x3 = isDirectorN olan

25/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y w0 is called the bias as it represents the prior (preju-
dice)

w0 = −θ w1 w2 w3

x0 = 1 x1 x2 x3

x1 = isActorDamon
x2 = isGenreT hriller
x3 = isDirectorN olan

25/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y w0 is called the bias as it represents the prior (preju-
dice)
A movie buff may have a very low threshold and may
watch any movie irrespective of the genre, actor, dir-
ector [θ = 0]

w0 = −θ w1 w2 w3

x0 = 1 x1 x2 x3

x1 = isActorDamon
x2 = isGenreT hriller
x3 = isDirectorN olan

25/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y w0 is called the bias as it represents the prior (preju-
dice)
A movie buff may have a very low threshold and may
watch any movie irrespective of the genre, actor, dir-
ector [θ = 0]
On the other hand, a selective viewer may only watch
w0 = −θ w1 w2 w3 thrillers starring Matt Damon and directed by Nolan
x0 = 1 x1 x2 x3 [θ = 3]

x1 = isActorDamon
x2 = isGenreT hriller
x3 = isDirectorN olan

25/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y w0 is called the bias as it represents the prior (preju-
dice)
A movie buff may have a very low threshold and may
watch any movie irrespective of the genre, actor, dir-
ector [θ = 0]
On the other hand, a selective viewer may only watch
w0 = −θ w1 w2 w3 thrillers starring Matt Damon and directed by Nolan
x0 = 1 x1 x2 x3 [θ = 3]
The weights (w1 , w2 , ..., wn ) and the bias (w0 ) will de-
pend on the data (viewer history in this case)

x1 = isActorDamon
x2 = isGenreT hriller
x3 = isDirectorN olan

25/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
What kind of functions can be implemented using the perceptron? Any difference
from McCulloch Pitts neurons?

26/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
McCulloch Pitts Neuron
(assuming no inhibitory inputs)
n
X
y = 1 if xi ≥ θ
i=0
Xn
= 0 if xi < θ
i=0

Perceptron
n
X
y = 1 if wi ∗ x i ≥ θ
i=0
n
X
= 0 if wi ∗ x i < θ
i=0
27/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
McCulloch Pitts Neuron
From the equations it should be clear that
(assuming no inhibitory inputs)
even a perceptron separates the input space
n
X into two halves
y = 1 if xi ≥ θ
i=0
Xn
= 0 if xi < θ
i=0

Perceptron
n
X
y = 1 if wi ∗ x i ≥ θ
i=0
n
X
= 0 if wi ∗ x i < θ
i=0
27/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
McCulloch Pitts Neuron
From the equations it should be clear that
(assuming no inhibitory inputs)
even a perceptron separates the input space
n
X into two halves
y = 1 if xi ≥ θ All inputs which produce a 1 lie on one side
i=0
n
and all inputs which produce a 0 lie on the
other side
X
= 0 if xi < θ
i=0

Perceptron
n
X
y = 1 if wi ∗ x i ≥ θ
i=0
n
X
= 0 if wi ∗ x i < θ
i=0
27/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
McCulloch Pitts Neuron
From the equations it should be clear that
(assuming no inhibitory inputs)
even a perceptron separates the input space
n
X into two halves
y = 1 if xi ≥ θ All inputs which produce a 1 lie on one side
i=0
n
and all inputs which produce a 0 lie on the
other side
X
= 0 if xi < θ
i=0 In other words, a single perceptron can only
be used to implement linearly separable func-
tions
Perceptron
n
X
y = 1 if wi ∗ x i ≥ θ
i=0
n
X
= 0 if wi ∗ x i < θ
i=0
27/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
McCulloch Pitts Neuron
From the equations it should be clear that
(assuming no inhibitory inputs)
even a perceptron separates the input space
n
X into two halves
y = 1 if xi ≥ θ All inputs which produce a 1 lie on one side
i=0
n
and all inputs which produce a 0 lie on the
other side
X
= 0 if xi < θ
i=0 In other words, a single perceptron can only
be used to implement linearly separable func-
tions
Perceptron Then what is the difference?
n
X
y = 1 if wi ∗ x i ≥ θ
i=0
n
X
= 0 if wi ∗ x i < θ
i=0
27/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
McCulloch Pitts Neuron
From the equations it should be clear that
(assuming no inhibitory inputs)
even a perceptron separates the input space
n
X into two halves
y = 1 if xi ≥ θ All inputs which produce a 1 lie on one side
i=0
n
and all inputs which produce a 0 lie on the
other side
X
= 0 if xi < θ
i=0 In other words, a single perceptron can only
be used to implement linearly separable func-
tions
Perceptron Then what is the difference? The weights (in-
n
X cluding threshold) can be learned and the in-
y = 1 if wi ∗ x i ≥ θ puts can be real valued
i=0
n
X
= 0 if wi ∗ x i < θ
i=0
27/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
McCulloch Pitts Neuron
From the equations it should be clear that
(assuming no inhibitory inputs)
even a perceptron separates the input space
n
X into two halves
y = 1 if xi ≥ θ All inputs which produce a 1 lie on one side
i=0
n
and all inputs which produce a 0 lie on the
other side
X
= 0 if xi < θ
i=0 In other words, a single perceptron can only
be used to implement linearly separable func-
tions
Perceptron Then what is the difference? The weights (in-
n
X cluding threshold) can be learned and the in-
y = 1 if wi ∗ x i ≥ θ puts can be real valued
i=0 We will first revisit some boolean functions
n
X and then see the perceptron learning al-
= 0 if wi ∗ x i < θ
gorithm (for learning weights)
i=0
27/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR
0 0

28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR
0 0 0

28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR
P2
0 0 0 w0 + i=1 wi xi

28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR
P2
0 0 0 w0 + i=1 wi xi <0

28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR
P2
0 0 0 w0 + i=1 wi xi <0
1 0 1

28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR
w0 + 2i=1 wi xi < 0
P
0 0 0
P2
1 0 1 w0 + i=1 wi xi ≥ 0

28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR
w0 + 2i=1 wi xi < 0
P
0 0 0
P2
1 0 1 w0 + i=1 wi xi ≥ 0
0 1 1

28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR
w0 + 2i=1 wi xi < 0
P
0 0 0
P2
1 0 1 w0 + i=1 wi xi ≥ 0
w0 + 2i=1 wi xi ≥ 0
P
0 1 1

28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0

28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0

w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0

28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0

w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0

28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0

w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0

28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0

w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 > −w0

28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0

w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 > −w0

One possible solution to this set of inequalities


is w0 = −1, w1 = 1.1, , w2 = 1.1 (and various
other solutions are possible)

28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR x2
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0 (0, 1) (1, 1)
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0

w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
(0, 0) (1, 0) x1
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 > −w0

One possible solution to this set of inequalities


is w0 = −1, w1 = 1.1, , w2 = 1.1 (and various
other solutions are possible)

28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR x2
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0 (0, 1) (1, 1)
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0
−1 + 1.1x1 + 1.1x2 = 0

w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
(0, 0) (1, 0) x1
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 > −w0

One possible solution to this set of inequalities


is w0 = −1, w1 = 1.1, , w2 = 1.1 (and various
other solutions are possible)

28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR x2
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0 (0, 1) (1, 1)
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0
−1 + 1.1x1 + 1.1x2 = 0

w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
(0, 0) (1, 0) x1
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 > −w0
Note that we can come up
One possible solution to this set of inequalities with a similar set of inequal-
is w0 = −1, w1 = 1.1, , w2 = 1.1 (and various ities and find the value of θ
other solutions are possible) for a McCulloch Pitts neuron
also
28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR x2
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0 (0, 1) (1, 1)
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0
−1 + 1.1x1 + 1.1x2 = 0

w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
(0, 0) (1, 0) x1
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 > −w0
Note that we can come up
One possible solution to this set of inequalities with a similar set of inequal-
is w0 = −1, w1 = 1.1, , w2 = 1.1 (and various ities and find the value of θ
other solutions are possible) for a McCulloch Pitts neuron
also (Try it!)
28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.4: Errors and Error Surfaces

29/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us fix the threshold (−w0 = 1) and try
different values of w1 , w2
x2
−1 + 1.1x1 + 1.1x2 = 0
(0, 1) (1, 1)

(0, 0) (1, 0) x1

30/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us fix the threshold (−w0 = 1) and try
different values of w1 , w2
Say, w1 = −1, w2 = −1 x2
−1 + 1.1x1 + 1.1x2 = 0
(0, 1) (1, 1)

(0, 0) (1, 0) x1

−1 + (−1)x1 + (−1)x2 = 0

30/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us fix the threshold (−w0 = 1) and try
different values of w1 , w2
Say, w1 = −1, w2 = −1 x2
What is wrong with this line? −1 + 1.1x1 + 1.1x2 = 0
(0, 1) (1, 1)

(0, 0) (1, 0) x1

−1 + (−1)x1 + (−1)x2 = 0

30/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us fix the threshold (−w0 = 1) and try
different values of w1 , w2
Say, w1 = −1, w2 = −1 x2
What is wrong with this line? We make an −1 + 1.1x1 + 1.1x2 = 0
error on 1 out of the 4 inputs (0, 1) (1, 1)

(0, 0) (1, 0) x1

−1 + (−1)x1 + (−1)x2 = 0

30/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us fix the threshold (−w0 = 1) and try
different values of w1 , w2
Say, w1 = −1, w2 = −1 x2
What is wrong with this line? We make an −1 + 1.1x1 + 1.1x2 = 0
error on 1 out of the 4 inputs (0, 1) (1, 1)
Let us try some more values of w1 , w2 and note
how many errors we make

(0, 0) (1, 0) x1

−1 + (−1)x1 + (−1)x2 = 0

30/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us fix the threshold (−w0 = 1) and try
different values of w1 , w2
Say, w1 = −1, w2 = −1 x2
What is wrong with this line? We make an −1 + 1.1x1 + 1.1x2 = 0
error on 1 out of the 4 inputs (0, 1) (1, 1)
Let us try some more values of w1 , w2 and note
how many errors we make
w1 w2 errors
-1 -1 1 x1
(0, 0) (1, 0)

−1 + (−1)x1 + (−1)x2 = 0

30/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us fix the threshold (−w0 = 1) and try
different values of w1 , w2
Say, w1 = −1, w2 = −1 x2
What is wrong with this line? We make an −1 + 1.1x1 + 1.1x2 = 0
error on 1 out of the 4 inputs (0, 1) (1, 1)
Let us try some more values of w1 , w2 and note
how many errors we make
w1 w2 errors
-1 -1 1 x1
(0, 0) (1, 0)
1.5 0 1
−1 + (1.5)x1 + (0)x2 = 0

−1 + (−1)x1 + (−1)x2 = 0

30/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us fix the threshold (−w0 = 1) and try
different values of w1 , w2 −1 + (0.45)x1 + (0.45)x2 = 0
Say, w1 = −1, w2 = −1 x2
What is wrong with this line? We make an −1 + 1.1x1 + 1.1x2 = 0
error on 1 out of the 4 inputs (0, 1) (1, 1)
Let us try some more values of w1 , w2 and note
how many errors we make
w1 w2 errors
-1 -1 1 x1
(0, 0) (1, 0)
1.5 0 1
0.45 0.45 3 −1 + (1.5)x1 + (0)x2 = 0

−1 + (−1)x1 + (−1)x2 = 0

30/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us fix the threshold (−w0 = 1) and try
different values of w1 , w2 −1 + (0.45)x1 + (0.45)x2 = 0
Say, w1 = −1, w2 = −1 x2
What is wrong with this line? We make an −1 + 1.1x1 + 1.1x2 = 0
error on 1 out of the 4 inputs (0, 1) (1, 1)
Let us try some more values of w1 , w2 and note
how many errors we make
w1 w2 errors
-1 -1 1 x1
(0, 0) (1, 0)
1.5 0 1
0.45 0.45 3 −1 + (1.5)x1 + (0)x2 = 0
We are interested in those values of w0 , w1 , w2
which result in 0 error −1 + (−1)x1 + (−1)x2 = 0

30/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us fix the threshold (−w0 = 1) and try
different values of w1 , w2 −1 + (0.45)x1 + (0.45)x2 = 0
Say, w1 = −1, w2 = −1 x2
What is wrong with this line? We make an −1 + 1.1x1 + 1.1x2 = 0
error on 1 out of the 4 inputs (0, 1) (1, 1)
Let us try some more values of w1 , w2 and note
how many errors we make
w1 w2 errors
-1 -1 1 x1
(0, 0) (1, 0)
1.5 0 1
0.45 0.45 3 −1 + (1.5)x1 + (0)x2 = 0
We are interested in those values of w0 , w1 , w2
which result in 0 error −1 + (−1)x1 + (−1)x2 = 0
Let us plot the error surface corresponding to
different values of w0 , w1 , w2
30/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For ease of analysis, we will keep w0
fixed (-1) and plot the error for dif-
ferent values of w1 , w2

31/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For ease of analysis, we will keep w0
fixed (-1) and plot the error for dif-
ferent values of w1 , w2
For a given w0 , w1 , w2 we will com-
pute −w0 +w1 ∗x1 +w2 ∗x2 for all com-
binations of (x1 , x2 ) and note down
how many errors we make

31/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For ease of analysis, we will keep w0
fixed (-1) and plot the error for dif-
ferent values of w1 , w2
For a given w0 , w1 , w2 we will com-
pute −w0 +w1 ∗x1 +w2 ∗x2 for all com-
binations of (x1 , x2 ) and note down
how many errors we make
For the OR function, an error occurs
if (x1 , x2 ) = (0, 0) but −w0 +w1 ∗x1 +
w2 ∗ x2 ≥ 0 or if (x1 , x2 ) 6= (0, 0) but
−w0 + w1 ∗ x1 + w2 ∗ x2 < 0

31/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For ease of analysis, we will keep w0
fixed (-1) and plot the error for dif-
ferent values of w1 , w2
For a given w0 , w1 , w2 we will com-
pute −w0 +w1 ∗x1 +w2 ∗x2 for all com-
binations of (x1 , x2 ) and note down
how many errors we make
For the OR function, an error occurs
if (x1 , x2 ) = (0, 0) but −w0 +w1 ∗x1 +
w2 ∗ x2 ≥ 0 or if (x1 , x2 ) 6= (0, 0) but
−w0 + w1 ∗ x1 + w2 ∗ x2 < 0

31/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For ease of analysis, we will keep w0
fixed (-1) and plot the error for dif-
ferent values of w1 , w2
For a given w0 , w1 , w2 we will com-
pute −w0 +w1 ∗x1 +w2 ∗x2 for all com-
binations of (x1 , x2 ) and note down
how many errors we make
For the OR function, an error occurs
if (x1 , x2 ) = (0, 0) but −w0 +w1 ∗x1 +
w2 ∗ x2 ≥ 0 or if (x1 , x2 ) 6= (0, 0) but
−w0 + w1 ∗ x1 + w2 ∗ x2 < 0
We are interested in finding an al-
gorithm which finds the values of
w1 , w2 which minimize this error

31/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.5: Perceptron Learning Algorithm

32/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We will now see a more principled approach for learning these weights and
threshold but before that let us revisit our movie example again

33/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us reconsider our problem of deciding
whether to watch a movie or not

34/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us reconsider our problem of deciding
whether to watch a movie or not
Suppose we are given a list of m movies and
a label (class) associated with each movie in-
dicating whether the user liked this movie or
not : binary decision

34/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us reconsider our problem of deciding
whether to watch a movie or not
Suppose we are given a list of m movies and
a label (class) associated with each movie in-
dicating whether the user liked this movie or
not : binary decision
Further, suppose we represent each movie
with n features (some boolean, some real val-
ued)

34/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us reconsider our problem of deciding
whether to watch a movie or not
Suppose we are given a list of m movies and
a label (class) associated with each movie in-
dicating whether the user liked this movie or
not : binary decision
Further, suppose we represent each movie
with n features (some boolean, some real val-
ued)
x1 = isActorDamon
x2 = isGenreT hriller
x3 = isDirectorN olan
x4 = imdbRating(scaled to 0 to 1)
... ...
xn = criticsRating(scaled to 0 to 1)
34/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y Let us reconsider our problem of deciding
whether to watch a movie or not
Suppose we are given a list of m movies and
a label (class) associated with each movie in-
dicating whether the user liked this movie or
not : binary decision
Further, suppose we represent each movie
x0 = 1 x1 x2 .. .. xn with n features (some boolean, some real val-
ued)
x1 = isActorDamon
We will assume that the data is linearly sep-
x2 = isGenreT hriller arable and we want a perceptron to learn how
x3 = isDirectorN olan to make this decision
x4 = imdbRating(scaled to 0 to 1)
... ...
xn = criticsRating(scaled to 0 to 1)
34/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y Let us reconsider our problem of deciding
whether to watch a movie or not
Suppose we are given a list of m movies and
a label (class) associated with each movie in-
dicating whether the user liked this movie or
not : binary decision
w0 = −θ w1 w2 .. .. wn Further, suppose we represent each movie
x0 = 1 x1 x2 .. .. xn with n features (some boolean, some real val-
ued)
x1 = isActorDamon
We will assume that the data is linearly sep-
x2 = isGenreT hriller arable and we want a perceptron to learn how
x3 = isDirectorN olan to make this decision
x4 = imdbRating(scaled to 0 to 1) In other words, we want the perceptron to find
... ... the equation of this separating plane (or find
the values of w0 , w1 , w2 , .., wm )
xn = criticsRating(scaled to 0 to 1)
34/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm

35/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm
P ← inputs with label 1;

35/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm
P ← inputs with label 1;
N ← inputs with label 0;

35/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm
P ← inputs with label 1;
N ← inputs with label 0;
Initialize w = [w0 , w1 , . . . , wn ] randomly;

35/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm
P ← inputs with label 1;
N ← inputs with label 0;
Initialize w = [w0 , w1 , . . . , wn ] randomly;
while !convergence do

end

35/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm
P ← inputs with label 1;
N ← inputs with label 0;
Initialize w = [w0 , w1 , . . . , wn ] randomly;
while !convergence do

end
//the algorithm converges when all the
inputs are classified correctly

35/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm
P ← inputs with label 1;
N ← inputs with label 0;
Initialize w = [w0 , w1 , . . . , wn ] randomly;
while !convergence do
Pick random x ∈ P ∪ N ;

end
//the algorithm converges when all the
inputs are classified correctly

35/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm
P ← inputs with label 1;
N ← inputs with label 0;
Initialize w = [w0 , w1 , . . . , wn ] randomly;
while !convergence do
Pick random x ∈ P P ∪N ;
n
if x ∈ P and i=0 wi ∗ xi < 0 then

end

end
//the algorithm converges when all the
inputs are classified correctly

35/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm
P ← inputs with label 1;
N ← inputs with label 0;
Initialize w = [w0 , w1 , . . . , wn ] randomly;
while !convergence do
Pick random x ∈ P P ∪N ;
n
if x ∈ P and i=0 wi ∗ xi < 0 then
w =w+x ;
end

end
//the algorithm converges when all the
inputs are classified correctly

35/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm
P ← inputs with label 1;
N ← inputs with label 0;
Initialize w = [w0 , w1 , . . . , wn ] randomly;
while !convergence do
Pick random x ∈ P P ∪N ;
n
if x ∈ P and i=0 wi ∗ xi < 0 then
w =w+x ;
end Pn
if x ∈ N and i=0 wi ∗ xi ≥ 0 then

end
end
//the algorithm converges when all the
inputs are classified correctly

35/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm
P ← inputs with label 1;
N ← inputs with label 0;
Initialize w = [w0 , w1 , . . . , wn ] randomly;
while !convergence do
Pick random x ∈ P P ∪N ;
n
if x ∈ P and i=0 wi ∗ xi < 0 then
w =w+x ;
end Pn
if x ∈ N and i=0 wi ∗ xi ≥ 0 then
w =w−x ;
end
end
//the algorithm converges when all the
inputs are classified correctly

35/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm Why would this work ?
P ← inputs with label 1;
N ← inputs with label 0;
Initialize w = [w0 , w1 , . . . , wn ] randomly;
while !convergence do
Pick random x ∈ P P ∪N ;
n
if x ∈ P and i=0 wi ∗ xi < 0 then
w =w+x ;
end Pn
if x ∈ N and i=0 wi ∗ xi ≥ 0 then
w =w−x ;
end
end
//the algorithm converges when all the
inputs are classified correctly

35/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm Why would this work ?
P ← inputs with label 1; To understand why this works
N ← inputs with label 0; we will have to get into a bit of
Initialize w = [w0 , w1 , . . . , wn ] randomly; Linear Algebra and a bit of geo-
while !convergence do metry...
Pick random x ∈ P P ∪N ;
n
if x ∈ P and i=0 wi ∗ xi < 0 then
w =w+x ;
end Pn
if x ∈ N and i=0 wi ∗ xi ≥ 0 then
w =w−x ;
end
end
//the algorithm converges when all the
inputs are classified correctly

35/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider two vectors w and x

36/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider two vectors w and x

w = [w0 , w1 , w2 , ..., wn ]
x = [1, x1 , x2 , ..., xn ]

36/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider two vectors w and x

w = [w0 , w1 , w2 , ..., wn ]
x = [1, x1 , x2 , ..., xn ]
X n
w · x = wT x = wi ∗ x i
i=0

36/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider two vectors w and x

w = [w0 , w1 , w2 , ..., wn ]
x = [1, x1 , x2 , ..., xn ]
X n
w · x = wT x = wi ∗ x i
i=0

We can thus rewrite the perceptron


rule as

36/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider two vectors w and x

w = [w0 , w1 , w2 , ..., wn ]
x = [1, x1 , x2 , ..., xn ]
X n
w · x = wT x = wi ∗ x i
i=0

We can thus rewrite the perceptron


rule as

y = 1 if wT x ≥ 0
= 0 if wT x < 0

36/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider two vectors w and x We are interested in finding the line
wT x = 0 which divides the input
w = [w0 , w1 , w2 , ..., wn ] space into two halves
x = [1, x1 , x2 , ..., xn ]
X n
w · x = wT x = wi ∗ x i
i=0

We can thus rewrite the perceptron


rule as

y = 1 if wT x ≥ 0
= 0 if wT x < 0

36/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider two vectors w and x We are interested in finding the line
wT x = 0 which divides the input
w = [w0 , w1 , w2 , ..., wn ] space into two halves
x = [1, x1 , x2 , ..., xn ] Every point (x) on this line satisfies
X n the equation wT x = 0
w · x = wT x = wi ∗ x i
i=0

We can thus rewrite the perceptron


rule as

y = 1 if wT x ≥ 0
= 0 if wT x < 0

36/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider two vectors w and x We are interested in finding the line
wT x = 0 which divides the input
w = [w0 , w1 , w2 , ..., wn ] space into two halves
x = [1, x1 , x2 , ..., xn ] Every point (x) on this line satisfies
X n the equation wT x = 0
w · x = wT x = wi ∗ x i
What can you tell about the angle (α)
i=0
between w and any point (x) which
We can thus rewrite the perceptron lies on this line ?
rule as

y = 1 if wT x ≥ 0
= 0 if wT x < 0

36/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider two vectors w and x We are interested in finding the line
wT x = 0 which divides the input
w = [w0 , w1 , w2 , ..., wn ] space into two halves
x = [1, x1 , x2 , ..., xn ] Every point (x) on this line satisfies
X n the equation wT x = 0
w · x = wT x = wi ∗ x i
What can you tell about the angle (α)
i=0
between w and any point (x) which
We can thus rewrite the perceptron lies on this line ?
rule as wT x
The angle is 90◦ (∵ cosα = ||w||||x|| =
0)
y = 1 if wT x ≥ 0
= 0 if wT x < 0

36/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider two vectors w and x We are interested in finding the line
wT x = 0 which divides the input
w = [w0 , w1 , w2 , ..., wn ] space into two halves
x = [1, x1 , x2 , ..., xn ] Every point (x) on this line satisfies
X n the equation wT x = 0
w · x = wT x = wi ∗ x i
What can you tell about the angle (α)
i=0
between w and any point (x) which
We can thus rewrite the perceptron lies on this line ?
rule as wT x
The angle is 90◦ (∵ cosα = ||w||||x|| =
0)
y = 1 if wT x ≥ 0
Since the vector w is perpendicular to
= 0 if wT x < 0
every point on the line it is actually
perpendicular to the line itself

36/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2

w
wT x =0

x1

37/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider some points (vectors) which lie in x2
the positive half space of this line (i.e., wT x ≥
0) p2 w
wT x =0
p1
p3

x1

37/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider some points (vectors) which lie in x2
the positive half space of this line (i.e., wT x ≥
0) p2 w
What will be the angle between any such vec- wT x =0
tor and w ? p1
p3

x1

37/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider some points (vectors) which lie in x2
the positive half space of this line (i.e., wT x ≥
0) p2 w
What will be the angle between any such vec- wT x =0
tor and w ? Obviously, less than 90◦ p1
p3

x1

37/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider some points (vectors) which lie in x2
the positive half space of this line (i.e., wT x ≥
0) p2 w
What will be the angle between any such vec- wT x =0
tor and w ? Obviously, less than 90◦ p1
p3
What about points (vectors) which lie in the
negative half space of this line (i.e., wT x < 0) n1
x1

n2 n3

37/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider some points (vectors) which lie in x2
the positive half space of this line (i.e., wT x ≥
0) p2 w
What will be the angle between any such vec- wT x =0
tor and w ? Obviously, less than 90◦ p1
p3
What about points (vectors) which lie in the
negative half space of this line (i.e., wT x < 0) n1
What will be the angle between any such vec- x1
tor and w ?

n2 n3

37/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider some points (vectors) which lie in x2
the positive half space of this line (i.e., wT x ≥
0) p2 w
What will be the angle between any such vec- wT x =0
tor and w ? Obviously, less than 90◦ p1
p3
What about points (vectors) which lie in the
negative half space of this line (i.e., wT x < 0) n1
What will be the angle between any such vec- x1
tor and w ? Obviously, greater than 90◦

n2 n3

37/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider some points (vectors) which lie in x2
the positive half space of this line (i.e., wT x ≥
0) p2 w
What will be the angle between any such vec- wT x =0
tor and w ? Obviously, less than 90◦ p1
p3
What about points (vectors) which lie in the
negative half space of this line (i.e., wT x < 0) n1
What will be the angle between any such vec- x1
tor and w ? Obviously, greater than 90◦
Of course, this also follows from the formula
wT x
(cosα = ||w||||x|| )
n2 n3

37/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider some points (vectors) which lie in x2
the positive half space of this line (i.e., wT x ≥
0) p2 w
What will be the angle between any such vec- wT x =0
tor and w ? Obviously, less than 90◦ p1
p3
What about points (vectors) which lie in the
negative half space of this line (i.e., wT x < 0) n1
What will be the angle between any such vec- x1
tor and w ? Obviously, greater than 90◦
Of course, this also follows from the formula
wT x
(cosα = ||w||||x|| )
Keeping this picture in mind let us revisit the n2 n3
algorithm

37/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm
P ← inputs with label 1;
N ← inputs with label 0;
Initialize w randomly;
while !convergence do
Pick random x ∈ P ∪ N ;
if x ∈ P and w.x < 0 then
w =w+x ;
end
if x ∈ N and w.x ≥ 0 then
w =w−x ;
end
end
//the algorithm converges when all the
inputs are classified correctly
wT x
cosα =
||w||||x||
38/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ P if w.x < 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is
Initialize w randomly; greater than 90◦
while !convergence do
Pick random x ∈ P ∪ N ;
if x ∈ P and w.x < 0 then
w =w+x ;
end
if x ∈ N and w.x ≥ 0 then
w =w−x ;
end
end
//the algorithm converges when all the
inputs are classified correctly
wT x
cosα =
||w||||x||
38/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ P if w.x < 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is
Initialize w randomly; greater than 90◦ (but we want α
while !convergence do to be less than 90◦ )
Pick random x ∈ P ∪ N ;
if x ∈ P and w.x < 0 then
w =w+x ;
end
if x ∈ N and w.x ≥ 0 then
w =w−x ;
end
end
//the algorithm converges when all the
inputs are classified correctly
wT x
cosα =
||w||||x||
38/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ P if w.x < 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is
Initialize w randomly; greater than 90◦ (but we want α
while !convergence do to be less than 90◦ )
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w + x
w =w+x ;
end
if x ∈ N and w.x ≥ 0 then
w =w−x ;
end
end
//the algorithm converges when all the
inputs are classified correctly
wT x
cosα =
||w||||x||
38/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ P if w.x < 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is
Initialize w randomly; greater than 90◦ (but we want α
while !convergence do to be less than 90◦ )
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w + x
w =w+x ;
end cos(αnew ) ∝ wnew T x
if x ∈ N and w.x ≥ 0 then
w =w−x ;
end
end
//the algorithm converges when all the
inputs are classified correctly
wT x
cosα =
||w||||x||
38/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ P if w.x < 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is
Initialize w randomly; greater than 90◦ (but we want α
while !convergence do to be less than 90◦ )
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w + x
w =w+x ;
end cos(αnew ) ∝ wnew T x
if x ∈ N and w.x ≥ 0 then
∝ (w + x)T x
w =w−x ;
end
end
//the algorithm converges when all the
inputs are classified correctly
wT x
cosα =
||w||||x||
38/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ P if w.x < 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is
Initialize w randomly; greater than 90◦ (but we want α
while !convergence do to be less than 90◦ )
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w + x
w =w+x ;
end cos(αnew ) ∝ wnew T x
if x ∈ N and w.x ≥ 0 then
∝ (w + x)T x
w =w−x ;
end ∝ w T x + xT x
end
//the algorithm converges when all the
inputs are classified correctly
wT x
cosα =
||w||||x||
38/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ P if w.x < 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is
Initialize w randomly; greater than 90◦ (but we want α
while !convergence do to be less than 90◦ )
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w + x
w =w+x ;
end cos(αnew ) ∝ wnew T x
if x ∈ N and w.x ≥ 0 then
∝ (w + x)T x
w =w−x ;
end ∝ w T x + xT x
end ∝ cosα + xT x
//the algorithm converges when all the
inputs are classified correctly
wT x
cosα =
||w||||x||
38/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ P if w.x < 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is
Initialize w randomly; greater than 90◦ (but we want α
while !convergence do to be less than 90◦ )
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w + x
w =w+x ;
end cos(αnew ) ∝ wnew T x
if x ∈ N and w.x ≥ 0 then
∝ (w + x)T x
w =w−x ;
end ∝ w T x + xT x
end ∝ cosα + xT x
//the algorithm converges when all the cos(αnew ) > cosα
inputs are classified correctly
wT x
cosα =
||w||||x||
38/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ P if w.x < 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is
Initialize w randomly; greater than 90◦ (but we want α
while !convergence do to be less than 90◦ )
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w + x
w =w+x ;
end cos(αnew ) ∝ wnew T x
if x ∈ N and w.x ≥ 0 then
∝ (w + x)T x
w =w−x ;
end ∝ w T x + xT x
end ∝ cosα + xT x
//the algorithm converges when all the cos(αnew ) > cosα
inputs are classified correctly Thus αnew will be less than α and
wT x
cosα = this is exactly what we want
||w||||x||
38/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm
P ← inputs with label 1;
N ← inputs with label 0;
Initialize w randomly;
while !convergence do
Pick random x ∈ P ∪ N ;
if x ∈ P and w.x < 0 then
w =w+x ;
end
if x ∈ N and w.x ≥ 0 then
w =w−x ;
end
end
//the algorithm converges when all the
inputs are classified correctly
wT x
cosα =
||w||||x||
39/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ N if w.x ≥ 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is less
Initialize w randomly; than 90◦
while !convergence do
Pick random x ∈ P ∪ N ;
if x ∈ P and w.x < 0 then
w =w+x ;
end
if x ∈ N and w.x ≥ 0 then
w =w−x ;
end
end
//the algorithm converges when all the
inputs are classified correctly
wT x
cosα =
||w||||x||
39/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ N if w.x ≥ 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is less
Initialize w randomly; than 90◦ (but we want α to be
while !convergence do greater than 90◦ )
Pick random x ∈ P ∪ N ;
if x ∈ P and w.x < 0 then
w =w+x ;
end
if x ∈ N and w.x ≥ 0 then
w =w−x ;
end
end
//the algorithm converges when all the
inputs are classified correctly
wT x
cosα =
||w||||x||
39/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ N if w.x ≥ 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is less
Initialize w randomly; than 90◦ (but we want α to be
while !convergence do greater than 90◦ )
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w − x
w =w+x ;
end
if x ∈ N and w.x ≥ 0 then
w =w−x ;
end
end
//the algorithm converges when all the
inputs are classified correctly
wT x
cosα =
||w||||x||
39/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ N if w.x ≥ 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is less
Initialize w randomly; than 90◦ (but we want α to be
while !convergence do greater than 90◦ )
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w − x
w =w+x ;
end cos(αnew ) ∝ wnew T x
if x ∈ N and w.x ≥ 0 then
w =w−x ;
end
end
//the algorithm converges when all the
inputs are classified correctly
wT x
cosα =
||w||||x||
39/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ N if w.x ≥ 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is less
Initialize w randomly; than 90◦ (but we want α to be
while !convergence do greater than 90◦ )
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w − x
w =w+x ;
end cos(αnew ) ∝ wnew T x
if x ∈ N and w.x ≥ 0 then
∝ (w − x)T x
w =w−x ;
end
end
//the algorithm converges when all the
inputs are classified correctly
wT x
cosα =
||w||||x||
39/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ N if w.x ≥ 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is less
Initialize w randomly; than 90◦ (but we want α to be
while !convergence do greater than 90◦ )
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w − x
w =w+x ;
end cos(αnew ) ∝ wnew T x
if x ∈ N and w.x ≥ 0 then
∝ (w − x)T x
w =w−x ;
end ∝ w T x − xT x
end
//the algorithm converges when all the
inputs are classified correctly
wT x
cosα =
||w||||x||
39/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ N if w.x ≥ 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is less
Initialize w randomly; than 90◦ (but we want α to be
while !convergence do greater than 90◦ )
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w − x
w =w+x ;
end cos(αnew ) ∝ wnew T x
if x ∈ N and w.x ≥ 0 then
∝ (w − x)T x
w =w−x ;
end ∝ w T x − xT x
end ∝ cosα − xT x
//the algorithm converges when all the
inputs are classified correctly
wT x
cosα =
||w||||x||
39/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ N if w.x ≥ 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is less
Initialize w randomly; than 90◦ (but we want α to be
while !convergence do greater than 90◦ )
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w − x
w =w+x ;
end cos(αnew ) ∝ wnew T x
if x ∈ N and w.x ≥ 0 then
∝ (w − x)T x
w =w−x ;
end ∝ w T x − xT x
end ∝ cosα − xT x
//the algorithm converges when all the cos(αnew ) < cosα
inputs are classified correctly
wT x
cosα =
||w||||x||
39/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ N if w.x ≥ 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is less
Initialize w randomly; than 90◦ (but we want α to be
while !convergence do greater than 90◦ )
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w − x
w =w+x ;
end cos(αnew ) ∝ wnew T x
if x ∈ N and w.x ≥ 0 then
∝ (w − x)T x
w =w−x ;
end ∝ w T x − xT x
end ∝ cosα − xT x
//the algorithm converges when all the cos(αnew ) < cosα
inputs are classified correctly Thus αnew will be greater than α
wT x
cosα = and this is exactly what we want
||w||||x||
39/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We will now see this algorithm in action for a toy dataset

40/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value

p2

p1
p3

n1
x1

n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1
x1

n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points

n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, p1 ), apply correc-
tion w = w + x ∵ w · x < 0 (you can check
the angle visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, p1 ), apply correc-
tion w = w + x ∵ w · x < 0 (you can check
the angle visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, p2 ), apply correc-
tion w = w + x ∵ w · x < 0 (you can check
the angle visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, p2 ), apply correc-
tion w = w + x ∵ w · x < 0 (you can check
the angle visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, n1 ), apply correc-
tion w = w − x ∵ w · x ≥ 0 (you can check
the angle visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, n1 ), apply correc-
tion w = w − x ∵ w · x ≥ 0 (you can check
the angle visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, n3 ), no correction
needed ∵ w · x < 0 (you can check the angle
visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, n3 ), no correction
needed ∵ w · x < 0 (you can check the angle
visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, n2 ), no correction
needed ∵ w · x < 0 (you can check the angle
visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, n2 ), no correction
needed ∵ w · x < 0 (you can check the angle
visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, p3 ), apply correc-
tion w = w + x ∵ w · x < 0 (you can check
the angle visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, p3 ), apply correc-
tion w = w + x ∵ w · x < 0 (you can check
the angle visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, p1 ), no correction
needed ∵ w · x ≥ 0 (you can check the angle
visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, p1 ), no correction
needed ∵ w · x ≥ 0 (you can check the angle
visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, p2 ), no correction
needed ∵ w · x ≥ 0 (you can check the angle
visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, p2 ), no correction
needed ∵ w · x ≥ 0 (you can check the angle
visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, n1 ), no correction
needed ∵ w · x < 0 (you can check the angle
visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, n1 ), no correction
needed ∵ w · x < 0 (you can check the angle
visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, n3 ), no correction
needed ∵ w · x < 0 (you can check the angle
visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, n3 ), no correction
needed ∵ w · x < 0 (you can check the angle
visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, n2 ), no correction
needed ∵ w · x < 0 (you can check the angle
visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, n2 ), no correction
needed ∵ w · x < 0 (you can check the angle
visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, p3 ), no correction
needed ∵ w · x ≥ 0 (you can check the angle
visually)
n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly opposite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
The algorithm has converged

n2 n3

41/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.6: Proof of Convergence

42/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Now that we have some faith and intuition about why the algorithm works, we
will see a more formal proof of convergence ...

43/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Theorem
Definition: Two sets P and N of points in an n-dimensional space are called
absolutely linearly separable if

44/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Theorem
Definition: Two sets P and N of points in an n-dimensional space are called
absolutely linearly separable if n + 1 real numbers w0 , w1 , ..., wn exist

44/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Theorem
Definition: Two sets P and N of points in an n-dimensional space are called
absolutely linearly separable if n + 1 real P numbers w0 , w1 , ..., wn exist such that
every point (x1 , x2 , ..., xn ) ∈ P satisfies ni=1 wi ∗ xi > w0

44/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Theorem
Definition: Two sets P and N of points in an n-dimensional space are called
absolutely linearly separable if n + 1 real P numbers w0 , w1 , ..., wn exist such that
P satisfies ni=1 wi ∗ xi > w0 and every point
every point (x1 , x2 , ..., xn ) ∈P
(x1 , x2 , ..., xn ) ∈ N satisfies ni=1 wi ∗ xi < w0 .

44/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Theorem
Definition: Two sets P and N of points in an n-dimensional space are called
absolutely linearly separable if n + 1 real P numbers w0 , w1 , ..., wn exist such that
P satisfies ni=1 wi ∗ xi > w0 and every point
every point (x1 , x2 , ..., xn ) ∈P
(x1 , x2 , ..., xn ) ∈ N satisfies ni=1 wi ∗ xi < w0 .

Proposition:

44/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Theorem
Definition: Two sets P and N of points in an n-dimensional space are called
absolutely linearly separable if n + 1 real P numbers w0 , w1 , ..., wn exist such that
P satisfies ni=1 wi ∗ xi > w0 and every point
every point (x1 , x2 , ..., xn ) ∈P
(x1 , x2 , ..., xn ) ∈ N satisfies ni=1 wi ∗ xi < w0 .

Proposition: If the sets P and N are finite and linearly separable,

44/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Theorem
Definition: Two sets P and N of points in an n-dimensional space are called
absolutely linearly separable if n + 1 real P numbers w0 , w1 , ..., wn exist such that
P satisfies ni=1 wi ∗ xi > w0 and every point
every point (x1 , x2 , ..., xn ) ∈P
(x1 , x2 , ..., xn ) ∈ N satisfies ni=1 wi ∗ xi < w0 .

Proposition: If the sets P and N are finite and linearly separable, the perceptron
learning algorithm updates the weight vector wt a finite number of times.

44/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Theorem
Definition: Two sets P and N of points in an n-dimensional space are called
absolutely linearly separable if n + 1 real P numbers w0 , w1 , ..., wn exist such that
P satisfies ni=1 wi ∗ xi > w0 and every point
every point (x1 , x2 , ..., xn ) ∈P
(x1 , x2 , ..., xn ) ∈ N satisfies ni=1 wi ∗ xi < w0 .

Proposition: If the sets P and N are finite and linearly separable, the perceptron
learning algorithm updates the weight vector wt a finite number of times. In other
words: if the vectors in P and N are tested cyclically one after the other,

44/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Theorem
Definition: Two sets P and N of points in an n-dimensional space are called
absolutely linearly separable if n + 1 real P numbers w0 , w1 , ..., wn exist such that
P satisfies ni=1 wi ∗ xi > w0 and every point
every point (x1 , x2 , ..., xn ) ∈P
(x1 , x2 , ..., xn ) ∈ N satisfies ni=1 wi ∗ xi < w0 .

Proposition: If the sets P and N are finite and linearly separable, the perceptron
learning algorithm updates the weight vector wt a finite number of times. In other
words: if the vectors in P and N are tested cyclically one after the other, a weight
vector wt is found after a finite number of steps t which can separate the two sets.

44/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Theorem
Definition: Two sets P and N of points in an n-dimensional space are called
absolutely linearly separable if n + 1 real P numbers w0 , w1 , ..., wn exist such that
P satisfies ni=1 wi ∗ xi > w0 and every point
every point (x1 , x2 , ..., xn ) ∈P
(x1 , x2 , ..., xn ) ∈ N satisfies ni=1 wi ∗ xi < w0 .

Proposition: If the sets P and N are finite and linearly separable, the perceptron
learning algorithm updates the weight vector wt a finite number of times. In other
words: if the vectors in P and N are tested cyclically one after the other, a weight
vector wt is found after a finite number of steps t which can separate the two sets.

Proof: On the next slide

44/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Setup:
If x ∈ N then -x ∈ P (∵
wT x < 0 =⇒ wT (−x) ≥ 0)

45/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Setup:
If x ∈ N then -x ∈ P (∵
wT x < 0 =⇒ wT (−x) ≥ 0)
We can thus consider a single
set P 0 = P ∪ N − and for
every element p ∈ P 0 ensure
that wT p ≥ 0

45/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Setup:
If x ∈ N then -x ∈ P (∵ Algorithm: Perceptron Learning Algorithm
wT x < 0 =⇒ wT (−x) ≥ 0) P ← inputs with label 1;
We can thus consider a single
set P 0 = P ∪ N − and for
every element p ∈ P 0 ensure
that wT p ≥ 0

45/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Setup:
If x ∈ N then -x ∈ P (∵ Algorithm: Perceptron Learning Algorithm
wT x < 0 =⇒ wT (−x) ≥ 0) P ← inputs with label 1;
N ← inputs with label 0;
We can thus consider a single
set P 0 = P ∪ N − and for
every element p ∈ P 0 ensure
that wT p ≥ 0

45/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Setup:
If x ∈ N then -x ∈ P (∵ Algorithm: Perceptron Learning Algorithm
wT x < 0 =⇒ wT (−x) ≥ 0) P ← inputs with label 1;
N ← inputs with label 0;
We can thus consider a single
N − contains negations of all points in N;
set P 0 = P ∪ N − and for
every element p ∈ P 0 ensure
that wT p ≥ 0

45/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Setup:
If x ∈ N then -x ∈ P (∵ Algorithm: Perceptron Learning Algorithm
wT x < 0 =⇒ wT (−x) ≥ 0) P ← inputs with label 1;
N ← inputs with label 0;
We can thus consider a single
N − contains negations of all points in N;
set P 0 = P ∪ N − and for P 0 ← P ∪ N −;
every element p ∈ P 0 ensure
that wT p ≥ 0

45/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Setup:
If x ∈ N then -x ∈ P (∵ Algorithm: Perceptron Learning Algorithm
wT x < 0 =⇒ wT (−x) ≥ 0) P ← inputs with label 1;
N ← inputs with label 0;
We can thus consider a single
N − contains negations of all points in N;
set P 0 = P ∪ N − and for P 0 ← P ∪ N −;
every element p ∈ P 0 ensure Initialize w randomly;
that wT p ≥ 0

45/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Setup:
If x ∈ N then -x ∈ P (∵ Algorithm: Perceptron Learning Algorithm
wT x < 0 =⇒ wT (−x) ≥ 0) P ← inputs with label 1;
N ← inputs with label 0;
We can thus consider a single
N − contains negations of all points in N;
set P 0 = P ∪ N − and for P 0 ← P ∪ N −;
every element p ∈ P 0 ensure Initialize w randomly;
that wT p ≥ 0 while !convergence do

end
//the algorithm converges when all the inputs are
classified correctly

45/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Setup:
If x ∈ N then -x ∈ P (∵ Algorithm: Perceptron Learning Algorithm
wT x < 0 =⇒ wT (−x) ≥ 0) P ← inputs with label 1;
N ← inputs with label 0;
We can thus consider a single
N − contains negations of all points in N;
set P 0 = P ∪ N − and for P 0 ← P ∪ N −;
every element p ∈ P 0 ensure Initialize w randomly;
that wT p ≥ 0 while !convergence do
Pick random p ∈ P 0 ;

end
//the algorithm converges when all the inputs are
classified correctly

45/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Setup:
If x ∈ N then -x ∈ P (∵ Algorithm: Perceptron Learning Algorithm
wT x < 0 =⇒ wT (−x) ≥ 0) P ← inputs with label 1;
N ← inputs with label 0;
We can thus consider a single
N − contains negations of all points in N;
set P 0 = P ∪ N − and for P 0 ← P ∪ N −;
every element p ∈ P 0 ensure Initialize w randomly;
that wT p ≥ 0 while !convergence do
Pick random p ∈ P 0 ;

if w.p < 0 then

end
end
//the algorithm converges when all the inputs are
classified correctly

45/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Setup:
If x ∈ N then -x ∈ P (∵ Algorithm: Perceptron Learning Algorithm
wT x < 0 =⇒ wT (−x) ≥ 0) P ← inputs with label 1;
N ← inputs with label 0;
We can thus consider a single
N − contains negations of all points in N;
set P 0 = P ∪ N − and for P 0 ← P ∪ N −;
every element p ∈ P 0 ensure Initialize w randomly;
that wT p ≥ 0 while !convergence do
Pick random p ∈ P 0 ;

if w.p < 0 then


w =w+p ;
end
end
//the algorithm converges when all the inputs are
classified correctly
//notice that we do not need the other if condition
because by construction we want all points in P 0 to
lie in the positive half space w.p ≥ 0
45/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Setup:
If x ∈ N then -x ∈ P (∵ Algorithm: Perceptron Learning Algorithm
wT x < 0 =⇒ wT (−x) ≥ 0) P ← inputs with label 1;
N ← inputs with label 0;
We can thus consider a single
N − contains negations of all points in N;
set P 0 = P ∪ N − and for P 0 ← P ∪ N −;
every element p ∈ P 0 ensure Initialize w randomly;
that wT p ≥ 0 while !convergence do
Further we will normalize all Pick random p ∈ P 0 ;
the p’s so that ||p|| = 1 (no-
tice that this does not affect if w.p < 0 then
p w =w+p ;
the solution ∵ if wT ||p|| ≥
end
T
0 then w p ≥ 0) end
//the algorithm converges when all the inputs are
classified correctly
//notice that we do not need the other if condition
because by construction we want all points in P 0 to
lie in the positive half space w.p ≥ 0
45/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Setup:
If x ∈ N then -x ∈ P (∵ Algorithm: Perceptron Learning Algorithm
wT x < 0 =⇒ wT (−x) ≥ 0) P ← inputs with label 1;
N ← inputs with label 0;
We can thus consider a single
N − contains negations of all points in N;
set P 0 = P ∪ N − and for P 0 ← P ∪ N −;
every element p ∈ P 0 ensure Initialize w randomly;
that wT p ≥ 0 while !convergence do
Further we will normalize all Pick random p ∈ P 0 ;
p
p ← ||p|| (so p now becomes unit norm) ;
the p’s so that ||p|| = 1 (no-
tice that this does not affect if w.p < 0 then
p w =w+p ;
the solution ∵ if wT ||p|| ≥
end
T
0 then w p ≥ 0) end
//the algorithm converges when all the inputs are
classified correctly
//notice that we do not need the other if condition
because by construction we want all points in P 0 to
lie in the positive half space w.p ≥ 0
45/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Setup:
If x ∈ N then -x ∈ P (∵ Algorithm: Perceptron Learning Algorithm
wT x < 0 =⇒ wT (−x) ≥ 0) P ← inputs with label 1;
N ← inputs with label 0;
We can thus consider a single
N − contains negations of all points in N;
set P 0 = P ∪ N − and for P 0 ← P ∪ N −;
every element p ∈ P 0 ensure Initialize w randomly;
that wT p ≥ 0 while !convergence do
Further we will normalize all Pick random p ∈ P 0 ;
p
p ← ||p|| (so p now becomes unit norm) ;
the p’s so that ||p|| = 1 (no-
tice that this does not affect if w.p < 0 then
p w =w+p ;
the solution ∵ if wT ||p|| ≥
end
T
0 then w p ≥ 0) end
Let w∗ be the normalized //the algorithm converges when all the inputs are
solution vector (we know one classified correctly
exists as the data is linearly //notice that we do not need the other if condition
because by construction we want all points in P 0 to
separable)
lie in the positive half space w.p ≥ 0
45/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution
which exists but we don’t know
what it is

46/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi < 0
what it is

46/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi < 0
what it is We make a correction wt+1 = wt + pi

46/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi < 0
what it is We make a correction wt+1 = wt + pi
Let β be the angle between w∗ and wt+1

46/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi < 0
what it is We make a correction wt+1 = wt + pi
Let β be the angle between w∗ and wt+1
w∗ · wt+1
cosβ =
||wt+1 ||

46/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi < 0
what it is We make a correction wt+1 = wt + pi
Let β be the angle between w∗ and wt+1
w∗ · wt+1
cosβ =
||wt+1 ||
N umerator = w∗ · wt+1

46/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi < 0
what it is We make a correction wt+1 = wt + pi
Let β be the angle between w∗ and wt+1
w∗ · wt+1
cosβ =
||wt+1 ||
N umerator = w∗ · wt+1 = w∗ · (wt + pi )

46/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi < 0
what it is We make a correction wt+1 = wt + pi
Let β be the angle between w∗ and wt+1
w∗ · wt+1
cosβ =
||wt+1 ||
N umerator = w∗ · wt+1 = w∗ · (wt + pi )
= w∗ · wt + w∗ · pi

46/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi < 0
what it is We make a correction wt+1 = wt + pi
Let β be the angle between w∗ and wt+1
w∗ · wt+1
cosβ =
||wt+1 ||
N umerator = w∗ · wt+1 = w∗ · (wt + pi )
= w∗ · wt + w∗ · pi
≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}

46/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi < 0
what it is We make a correction wt+1 = wt + pi
Let β be the angle between w∗ and wt+1
w∗ · wt+1
cosβ =
||wt+1 ||
N umerator = w∗ · wt+1 = w∗ · (wt + pi )
= w∗ · wt + w∗ · pi
≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}
≥ w∗ · (wt−1 + pj ) + δ

46/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi < 0
what it is We make a correction wt+1 = wt + pi
Let β be the angle between w∗ and wt+1
w∗ · wt+1
cosβ =
||wt+1 ||
N umerator = w∗ · wt+1 = w∗ · (wt + pi )
= w∗ · wt + w∗ · pi
≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}
≥ w∗ · (wt−1 + pj ) + δ
≥ w∗ · wt−1 + w∗ · pj + δ

46/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi < 0
what it is We make a correction wt+1 = wt + pi
Let β be the angle between w∗ and wt+1
w∗ · wt+1
cosβ =
||wt+1 ||
N umerator = w∗ · wt+1 = w∗ · (wt + pi )
= w∗ · wt + w∗ · pi
≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}
≥ w∗ · (wt−1 + pj ) + δ
≥ w∗ · wt−1 + w∗ · pj + δ
≥ w∗ · wt−1 + 2δ

46/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi < 0
what it is We make a correction wt+1 = wt + pi
We do not make a correction Let β be the angle between w∗ and wt+1
at every time-step w∗ · wt+1
cosβ =
||wt+1 ||
N umerator = w∗ · wt+1 = w∗ · (wt + pi )
= w∗ · wt + w∗ · pi
≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}
≥ w∗ · (wt−1 + pj ) + δ
≥ w∗ · wt−1 + w∗ · pj + δ
≥ w∗ · wt−1 + 2δ

46/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi < 0
what it is We make a correction wt+1 = wt + pi
We do not make a correction Let β be the angle between w∗ and wt+1
at every time-step w∗ · wt+1
We make a correction only if cosβ =
||wt+1 ||
wT · pi < 0 at that time step N umerator = w∗ · wt+1 = w∗ · (wt + pi )
= w∗ · wt + w∗ · pi
≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}
≥ w∗ · (wt−1 + pj ) + δ
≥ w∗ · wt−1 + w∗ · pj + δ
≥ w∗ · wt−1 + 2δ

46/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi < 0
what it is We make a correction wt+1 = wt + pi
We do not make a correction Let β be the angle between w∗ and wt+1
at every time-step w∗ · wt+1
We make a correction only if cosβ =
||wt+1 ||
wT · pi < 0 at that time step N umerator = w∗ · wt+1 = w∗ · (wt + pi )
So at time-step t we would = w∗ · wt + w∗ · pi
have made only k (≤ t) cor-
rections ≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}
≥ w∗ · (wt−1 + pj ) + δ
≥ w∗ · wt−1 + w∗ · pj + δ
≥ w∗ · wt−1 + 2δ

46/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi < 0
what it is We make a correction wt+1 = wt + pi
We do not make a correction Let β be the angle between w∗ and wt+1
at every time-step w∗ · wt+1
We make a correction only if cosβ =
||wt+1 ||
wT · pi < 0 at that time step N umerator = w∗ · wt+1 = w∗ · (wt + pi )
So at time-step t we would = w∗ · wt + w∗ · pi
have made only k (≤ t) cor-
rections ≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}
Every time we make a correc- ≥ w∗ · (wt−1 + pj ) + δ
tion a quantity δ gets added to ≥ w∗ · wt−1 + w∗ · pj + δ
the numerator ≥ w∗ · wt−1 + 2δ

46/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi < 0
what it is We make a correction wt+1 = wt + pi
We do not make a correction Let β be the angle between w∗ and wt+1
at every time-step w∗ · wt+1
We make a correction only if cosβ =
||wt+1 ||
wT · pi < 0 at that time step N umerator = w∗ · wt+1 = w∗ · (wt + pi )
So at time-step t we would = w∗ · wt + w∗ · pi
have made only k (≤ t) cor-
rections ≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}
Every time we make a correc- ≥ w∗ · (wt−1 + pj ) + δ
tion a quantity δ gets added to ≥ w∗ · wt−1 + w∗ · pj + δ
the numerator ≥ w∗ · wt−1 + 2δ
So by time-step t, a quantity
kδ gets added to the numer-
46/69
ator Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi < 0
what it is We make a correction wt+1 = wt + pi
We do not make a correction Let β be the angle between w∗ and wt+1
at every time-step w∗ · wt+1
We make a correction only if cosβ =
||wt+1 ||
wT · pi < 0 at that time step N umerator = w∗ · wt+1 = w∗ · (wt + pi )
So at time-step t we would = w∗ · wt + w∗ · pi
have made only k (≤ t) cor-
rections ≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}
Every time we make a correc- ≥ w∗ · (wt−1 + pj ) + δ
tion a quantity δ gets added to ≥ w∗ · wt−1 + w∗ · pj + δ
the numerator ≥ w∗ · wt−1 + 2δ
So by time-step t, a quantity ≥ w∗ · w0 + (k)δ (By induction)
kδ gets added to the numer-
46/69
ator Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, w T · pi < 0 (and hence we made the correction)

47/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, wT · pi < 0 (and hence we made the correction)


w∗ · wt+1
cosβ = (by definition)
||wt+1 ||

47/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, wT · pi < 0 (and hence we made the correction)


w∗ · wt+1
cosβ = (by definition)
||wt+1 ||
N umerator ≥ w∗ · w0 + kδ (proved by induction)

47/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, wT · pi < 0 (and hence we made the correction)


w∗ · wt+1
cosβ = (by definition)
||wt+1 ||
N umerator ≥ w∗ · w0 + kδ (proved by induction)
Denominator2 = ||wt+1 ||2

47/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, wT · pi < 0 (and hence we made the correction)


w∗ · wt+1
cosβ = (by definition)
||wt+1 ||
N umerator ≥ w∗ · w0 + kδ (proved by induction)
Denominator2 = ||wt+1 ||2
= (wt + pi ) · (wt + pi )

47/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, wT · pi < 0 (and hence we made the correction)


w∗ · wt+1
cosβ = (by definition)
||wt+1 ||
N umerator ≥ w∗ · w0 + kδ (proved by induction)
Denominator2 = ||wt+1 ||2
= (wt + pi ) · (wt + pi )
= ||wt ||2 + 2wt · pi + ||pi ||2

47/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, wT · pi < 0 (and hence we made the correction)


w∗ · wt+1
cosβ = (by definition)
||wt+1 ||
N umerator ≥ w∗ · w0 + kδ (proved by induction)
Denominator2 = ||wt+1 ||2
= (wt + pi ) · (wt + pi )
= ||wt ||2 + 2wt · pi + ||pi ||2
< ||wt ||2 + ||pi ||2 (∵ wt · pi < 0)

47/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, wT · pi < 0 (and hence we made the correction)


w∗ · wt+1
cosβ = (by definition)
||wt+1 ||
N umerator ≥ w∗ · w0 + kδ (proved by induction)
Denominator2 = ||wt+1 ||2
= (wt + pi ) · (wt + pi )
= ||wt ||2 + 2wt · pi + ||pi ||2
< ||wt ||2 + ||pi ||2 (∵ wt · pi < 0)
2
< ||wt || + 1 (∵ ||pi ||2 = 1)

47/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, wT · pi < 0 (and hence we made the correction)


w∗ · wt+1
cosβ = (by definition)
||wt+1 ||
N umerator ≥ w∗ · w0 + kδ (proved by induction)
Denominator2 = ||wt+1 ||2
= (wt + pi ) · (wt + pi )
= ||wt ||2 + 2wt · pi + ||pi ||2
< ||wt ||2 + ||pi ||2 (∵ wt · pi < 0)
2
< ||wt || + 1 (∵ ||pi ||2 = 1)
< (||wt−1 ||2 + 1) + 1

47/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, wT · pi < 0 (and hence we made the correction)


w∗ · wt+1
cosβ = (by definition)
||wt+1 ||
N umerator ≥ w∗ · w0 + kδ (proved by induction)
Denominator2 = ||wt+1 ||2
= (wt + pi ) · (wt + pi )
= ||wt ||2 + 2wt · pi + ||pi ||2
< ||wt ||2 + ||pi ||2 (∵ wt · pi < 0)
2
< ||wt || + 1 (∵ ||pi ||2 = 1)
< (||wt−1 ||2 + 1) + 1
< ||wt−1 ||2 + 2

47/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, wT · pi < 0 (and hence we made the correction)


w∗ · wt+1
cosβ = (by definition)
||wt+1 ||
N umerator ≥ w∗ · w0 + kδ (proved by induction)
Denominator2 = ||wt+1 ||2
= (wt + pi ) · (wt + pi )
= ||wt ||2 + 2wt · pi + ||pi ||2
< ||wt ||2 + ||pi ||2 (∵ wt · pi < 0)
2
< ||wt || + 1 (∵ ||pi ||2 = 1)
< (||wt−1 ||2 + 1) + 1
< ||wt−1 ||2 + 2
< ||w0 ||2 + (k) (By same observation that we made about δ)
47/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, wT · pi ≤ 0 (and hence we made the correction)


w∗ · wt+1
cosβ = (by definition)
||wt+1 ||
N umerator ≥ w∗ · w0 + kδ (proved by induction)
Denominator2 < ||w0 ||2 + k (By same observation that we made about δ)

48/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, wT · pi ≤ 0 (and hence we made the correction)


w∗ · wt+1
cosβ = (by definition)
||wt+1 ||
N umerator ≥ w∗ · w0 + kδ (proved by induction)
Denominator2 < ||w0 ||2 + k (By same observation that we made about δ)
w∗ · w0 + kδ
cosβ ≥ p
||w0 ||2 + k

48/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, wT · pi ≤ 0 (and hence we made the correction)


w∗ · wt+1
cosβ = (by definition)
||wt+1 ||
N umerator ≥ w∗ · w0 + kδ (proved by induction)
Denominator2 < ||w0 ||2 + k (By same observation that we made about δ)
w∗ · w0 + kδ
cosβ ≥ p
||w0 ||2 + k

cosβ thus grows proportional to k

48/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, wT · pi ≤ 0 (and hence we made the correction)


w∗ · wt+1
cosβ = (by definition)
||wt+1 ||
N umerator ≥ w∗ · w0 + kδ (proved by induction)
Denominator2 < ||w0 ||2 + k (By same observation that we made about δ)
w∗ · w0 + kδ
cosβ ≥ p
||w0 ||2 + k

cosβ thus grows proportional to k
As k (number of corrections) increases cosβ can become arbitrarily large

48/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, wT · pi ≤ 0 (and hence we made the correction)


w∗ · wt+1
cosβ = (by definition)
||wt+1 ||
N umerator ≥ w∗ · w0 + kδ (proved by induction)
Denominator2 < ||w0 ||2 + k (By same observation that we made about δ)
w∗ · w0 + kδ
cosβ ≥ p
||w0 ||2 + k

cosβ thus grows proportional to k
As k (number of corrections) increases cosβ can become arbitrarily large
But since cosβ ≤ 1, k must be bounded by a maximum number

48/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, wT · pi ≤ 0 (and hence we made the correction)


w∗ · wt+1
cosβ = (by definition)
||wt+1 ||
N umerator ≥ w∗ · w0 + kδ (proved by induction)
Denominator2 < ||w0 ||2 + k (By same observation that we made about δ)
w∗ · w0 + kδ
cosβ ≥ p
||w0 ||2 + k

cosβ thus grows proportional to k
As k (number of corrections) increases cosβ can become arbitrarily large
But since cosβ ≤ 1, k must be bounded by a maximum number
Thus, there can only be a finite number of corrections (k) to w and the algorithm
will converge!

48/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Coming back to our questions ...
What about non-boolean (say, real) inputs ?

Do we always need to hand code the threshold ?

Are all inputs equal ? What if we want to assign more weight (importance) to
some inputs ?
What about functions which are not linearly separable ?

49/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Coming back to our questions ...
What about non-boolean (say, real) inputs ? Real valued inputs are allowed in
perceptron
Do we always need to hand code the threshold ?

Are all inputs equal ? What if we want to assign more weight (importance) to
some inputs ?
What about functions which are not linearly separable ?

49/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Coming back to our questions ...
What about non-boolean (say, real) inputs ? Real valued inputs are allowed in
perceptron
Do we always need to hand code the threshold ? No, we can learn the threshold

Are all inputs equal ? What if we want to assign more weight (importance) to
some inputs ?
What about functions which are not linearly separable ?

49/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Coming back to our questions ...
What about non-boolean (say, real) inputs ? Real valued inputs are allowed in
perceptron
Do we always need to hand code the threshold ? No, we can learn the threshold

Are all inputs equal ? What if we want to assign more weight (importance) to
some inputs ? A perceptron allows weights to be assigned to inputs
What about functions which are not linearly separable ?

49/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Coming back to our questions ...
What about non-boolean (say, real) inputs ? Real valued inputs are allowed in
perceptron
Do we always need to hand code the threshold ? No, we can learn the threshold

Are all inputs equal ? What if we want to assign more weight (importance) to
some inputs ? A perceptron allows weights to be assigned to inputs
What about functions which are not linearly separable ? Not possible with a
single perceptron but we will see how to handle this ..

49/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.7: Linearly Separable Boolean Functions

50/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
So what do we do about functions which are not linearly separable ?

51/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
So what do we do about functions which are not linearly separable ?
Let us see one such simple boolean function first ?

51/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 XOR
w0 + 2i=1 wi xi
P
0 0 0 <0
w0 + 2i=1 wi xi
P
1 0 1 ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
w0 + 2i=1 wi xi
P
1 1 0 <0

52/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 XOR
w0 + 2i=1 wi xi
P
0 0 0 <0
w0 + 2i=1 wi xi
P
1 0 1 ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
w0 + 2i=1 wi xi
P
1 1 0 <0

w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0

52/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 XOR
w0 + 2i=1 wi xi
P
0 0 0 <0
w0 + 2i=1 wi xi
P
1 0 1 ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
w0 + 2i=1 wi xi
P
1 1 0 <0

w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0

52/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 XOR
w0 + 2i=1 wi xi
P
0 0 0 <0
w0 + 2i=1 wi xi
P
1 0 1 ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
w0 + 2i=1 wi xi
P
1 1 0 <0

w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0

52/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 XOR
w0 + 2i=1 wi xi
P
0 0 0 <0
w0 + 2i=1 wi xi
P
1 0 1 ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
w0 + 2i=1 wi xi
P
1 1 0 <0

w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 < −w0

52/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 XOR
w0 + 2i=1 wi xi
P
0 0 0 <0
w0 + 2i=1 wi xi
P
1 0 1 ≥0
w0 + 2i=1 wi xi
P
0 1 1 ≥0
w0 + 2i=1 wi xi
P
1 1 0 <0

w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 < −w0

The fourth condition contradicts conditions 2


and 3

52/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 XOR x2
w0 + 2i=1 wi xi
P
0 0 0 <0
w0 + 2i=1 wi xi
P
1 0 1 ≥0 (0, 1) (1, 1)
w0 + 2i=1 wi xi
P
0 1 1 ≥0
w0 + 2i=1 wi xi
P
1 1 0 <0

w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
(0, 0) (1, 0) x1
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 < −w0

The fourth condition contradicts conditions 2


and 3
Hence we cannot have a solution to this set of
inequalities
52/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 XOR x2
w0 + 2i=1 wi xi
P
0 0 0 <0
w0 + 2i=1 wi xi
P
1 0 1 ≥0 (0, 1) (1, 1)
w0 + 2i=1 wi xi
P
0 1 1 ≥0
w0 + 2i=1 wi xi
P
1 1 0 <0

w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
(0, 0) (1, 0) x1
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 > −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 > −w0
And indeed you can see that
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 < −w0 it is impossible to draw a
line which separates the red
The fourth condition contradicts conditions 2 points from the blue points
and 3
Hence we cannot have a solution to this set of
inequalities
52/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Most real world data is not linearly separable
and will always contain some outliers

53/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Most real world data is not linearly separable
and will always contain some outliers
In fact, sometimes there may not be any out-
liers but still the data may not be linearly sep-
arable

53/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Most real world data is not linearly separable
and will always contain some outliers
In fact, sometimes there may not be any out-
liers but still the data may not be linearly sep-
arable
We need computational units (models) which
can deal with such data

o
oo o o oo
oo + + + + oo
o + oo
o +
o + + o
o + + o
o + + o
o + + oo
o +
oo + + + oo
oo o o oo
o 53/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Most real world data is not linearly separable
and will always contain some outliers
In fact, sometimes there may not be any out-
liers but still the data may not be linearly sep-
arable
We need computational units (models) which
can deal with such data
While a single perceptron cannot deal with
such data, we will show that a network of per-
o ceptrons can indeed deal with such data
oo o o oo
oo + + + + oo
o + oo
o +
o + + o
o + + o
o + + o
o + + oo
o +
oo + + + oo
oo o o oo
o 53/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Before seeing how a network of perceptrons can deal with linearly inseparable
data, we will discuss boolean functions in some more detail ...

54/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2
0 0
0 1
1 0
1 1

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1
0 0 0
0 1 0
1 0 0
1 1 0

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f16
0 0 0 1
0 1 0 1
1 0 0 1
1 1 0 1

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f16
0 0 0 0 1
0 1 0 0 1
1 0 0 0 1
1 1 0 1 1

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f8 f16
0 0 0 0 0 1
0 1 0 0 1 1
1 0 0 0 1 1
1 1 0 1 1 1

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f8 f16
0 0 0 0 0 0 1
0 1 0 0 0 1 1
1 0 0 0 1 1 1
1 1 0 1 0 1 1

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f4 f8 f16
0 0 0 0 0 0 0 1
0 1 0 0 0 0 1 1
1 0 0 0 1 1 1 1
1 1 0 1 0 1 1 1

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f4 f5 f8 f16
0 0 0 0 0 0 0 0 1
0 1 0 0 0 0 1 1 1
1 0 0 0 1 1 0 1 1
1 1 0 1 0 1 0 1 1

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f4 f5 f6 f8 f16
0 0 0 0 0 0 0 0 0 1
0 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 1 1

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f16
0 0 0 0 0 0 0 0 0 0 1
0 1 0 0 0 0 1 1 1 1 1
1 0 0 0 1 1 0 0 1 1 1
1 1 0 1 0 1 0 1 0 1 1

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f16
0 0 0 0 0 0 0 0 0 0 1 1
0 1 0 0 0 0 1 1 1 1 0 1
1 0 0 0 1 1 0 0 1 1 0 1
1 1 0 1 0 1 0 1 0 1 0 1

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f16
0 0 0 0 0 0 0 0 0 0 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 1
1 0 0 0 1 1 0 0 1 1 0 0 1
1 1 0 1 0 1 0 1 0 1 0 1 1

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f16


0 0 0 0 0 0 0 0 0 0 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f16


0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 1

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f16


0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f16


0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16


0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16


0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Of these, how many are linearly separable ?

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16


0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Of these, how many are linearly separable ? (turns out all except XOR and
!XOR - feel free to verify)

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16


0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Of these, how many are linearly separable ? (turns out all except XOR and
!XOR - feel free to verify)
In general, how many boolean functions can you have for n inputs ?

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16


0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Of these, how many are linearly separable ? (turns out all except XOR and
!XOR - feel free to verify)
n
In general, how many boolean functions can you have for n inputs ? 22

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16


0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Of these, how many are linearly separable ? (turns out all except XOR and
!XOR - feel free to verify)
n
In general, how many boolean functions can you have for n inputs ? 22
n
How many of these 22 functions are not linearly separable ?

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16


0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Of these, how many are linearly separable ? (turns out all except XOR and
!XOR - feel free to verify)
n
In general, how many boolean functions can you have for n inputs ? 22
n
How many of these 22 functions are not linearly separable ? For the time being,
it suffices to know that at least some of these may not be linearly inseparable
(I encourage you to figure out the exact answer :-) )

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.8: Network of Perceptrons

56/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We will now see how to implement any boolean function using a network of
perceptrons ...

57/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For this discussion, we will assume True
= +1 and False = -1

58/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For this discussion, we will assume True
= +1 and False = -1
We consider 2 inputs and 4 perceptrons

x1 x2

58/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For this discussion, we will assume True
= +1 and False = -1
We consider 2 inputs and 4 perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights

x1 x2

58/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For this discussion, we will assume True
= +1 and False = -1
We consider 2 inputs and 4 perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

58/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For this discussion, we will assume True
= +1 and False = -1
We consider 2 inputs and 4 perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

58/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For this discussion, we will assume True
= +1 and False = -1
We consider 2 inputs and 4 perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

58/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For this discussion, we will assume True
= +1 and False = -1
We consider 2 inputs and 4 perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

58/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For this discussion, we will assume True
= +1 and False = -1
We consider 2 inputs and 4 perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

58/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For this discussion, we will assume True
= +1 and False = -1
We consider 2 inputs and 4 perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights
The bias (w0 ) of each perceptron is -2
(i.e., each perceptron will fire only if the
weighted sum of its input is ≥ 2)

bias =-2

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

58/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For this discussion, we will assume True
= +1 and False = -1
We consider 2 inputs and 4 perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights
The bias (w0 ) of each perceptron is -2
(i.e., each perceptron will fire only if the
weighted sum of its input is ≥ 2)
Each of these perceptrons is connected to
bias =-2 an output perceptron by weights (which
need to be learned)
x1 x2
red edge indicates w = -1
blue edge indicates w = +1

58/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For this discussion, we will assume True
= +1 and False = -1
We consider 2 inputs and 4 perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights
The bias (w0 ) of each perceptron is -2
(i.e., each perceptron will fire only if the
weighted sum of its input is ≥ 2)
Each of these perceptrons is connected to
bias =-2 an output perceptron by weights (which
need to be learned)
x1 x2
red edge indicates w = -1
blue edge indicates w = +1

58/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For this discussion, we will assume True
= +1 and False = -1
We consider 2 inputs and 4 perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights
w1 w2 w3 w4 The bias (w0 ) of each perceptron is -2
(i.e., each perceptron will fire only if the
weighted sum of its input is ≥ 2)
Each of these perceptrons is connected to
bias =-2 an output perceptron by weights (which
need to be learned)
x1 x2
red edge indicates w = -1
blue edge indicates w = +1

58/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For this discussion, we will assume True
= +1 and False = -1
y We consider 2 inputs and 4 perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights
w1 w2 w3 w4 The bias (w0 ) of each perceptron is -2
(i.e., each perceptron will fire only if the
weighted sum of its input is ≥ 2)
Each of these perceptrons is connected to
bias =-2 an output perceptron by weights (which
need to be learned)
x1 x2
The output of this perceptron (y) is the
red edge indicates w = -1 output of this network
blue edge indicates w = +1

58/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Terminology:
This network contains 3 layers
y

w1 w2 w3 w4

bias =-2

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

59/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Terminology:
This network contains 3 layers
y The layer containing the inputs (x1 , x2 ) is
called the input layer

w1 w2 w3 w4

bias =-2

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

59/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Terminology:
This network contains 3 layers
y The layer containing the inputs (x1 , x2 ) is
called the input layer
The middle layer containing the 4 per-
ceptrons is called the hidden layer
w1 w2 w3 w4

bias =-2

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

59/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Terminology:
This network contains 3 layers
y The layer containing the inputs (x1 , x2 ) is
called the input layer
The middle layer containing the 4 per-
ceptrons is called the hidden layer
w1 w2 w3 w4
The final layer containing one output
neuron is called the output layer

bias =-2

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

59/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Terminology:
This network contains 3 layers
y The layer containing the inputs (x1 , x2 ) is
called the input layer
The middle layer containing the 4 per-
ceptrons is called the hidden layer
w1 w2 w3 w4
h1 h2 h3 h4 The final layer containing one output
neuron is called the output layer
The outputs of the 4 perceptrons in the
bias =-2 hidden layer are denoted by h1 , h2 , h3 , h4

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

59/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Terminology:
This network contains 3 layers
y The layer containing the inputs (x1 , x2 ) is
called the input layer
The middle layer containing the 4 per-
ceptrons is called the hidden layer
w1 w2 w3 w4
h1 h2 h3 h4 The final layer containing one output
neuron is called the output layer
The outputs of the 4 perceptrons in the
bias =-2 hidden layer are denoted by h1 , h2 , h3 , h4
The red and blue edges are called layer 1
x1 x2 weights
red edge indicates w = -1
blue edge indicates w = +1

59/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Terminology:
This network contains 3 layers
y The layer containing the inputs (x1 , x2 ) is
called the input layer
The middle layer containing the 4 per-
ceptrons is called the hidden layer
w1 w2 w3 w4
h1 h2 h3 h4 The final layer containing one output
neuron is called the output layer
The outputs of the 4 perceptrons in the
bias =-2 hidden layer are denoted by h1 , h2 , h3 , h4
The red and blue edges are called layer 1
x1 x2 weights
red edge indicates w = -1 w1 , w2 , w3 , w4 are called layer 2 weights
blue edge indicates w = +1

59/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !

w1 w2 w3 w4
h1 h2 h3 h4

bias =-2

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

60/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1 , w2 , w3 , w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4 work
h1 h2 h3 h4

bias =-2

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

60/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1 , w2 , w3 , w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4 work
h1 h2 h3 h4
Astonishing claim!

bias =-2

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

60/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1 , w2 , w3 , w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4 work
h1 h2 h3 h4
Astonishing claim! Well, not really, if you
understand what is going on
bias =-2

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

60/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1 , w2 , w3 , w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4 work
h1 h2 h3 h4
Astonishing claim! Well, not really, if you
understand what is going on
bias =-2 Each perceptron in the middle layer fires
only for a specific input (and no two per-
x1 x2 ceptrons fire for the same input)
red edge indicates w = -1
blue edge indicates w = +1

60/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1 , w2 , w3 , w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4 work
h1 h2 h3 h4
-1,-1
Astonishing claim! Well, not really, if you
understand what is going on
bias =-2 Each perceptron in the middle layer fires
only for a specific input (and no two per-
x1 x2 ceptrons fire for the same input)
red edge indicates w = -1 the first perceptron fires for {-1,-1}
blue edge indicates w = +1

60/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1 , w2 , w3 , w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4 work
h1 h2 h3 h4
-1,-1 -1,1
Astonishing claim! Well, not really, if you
understand what is going on
bias =-2 Each perceptron in the middle layer fires
only for a specific input (and no two per-
x1 x2 ceptrons fire for the same input)
red edge indicates w = -1 the second perceptron fires for {-1,1}
blue edge indicates w = +1

60/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1 , w2 , w3 , w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4 work
h1 h2 h3 h4
-1,-1 -1,1 1,-1
Astonishing claim! Well, not really, if you
understand what is going on
bias =-2 Each perceptron in the middle layer fires
only for a specific input (and no two per-
x1 x2 ceptrons fire for the same input)
red edge indicates w = -1 the third perceptron fires for {1,-1}
blue edge indicates w = +1

60/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1 , w2 , w3 , w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4 work
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1
Astonishing claim! Well, not really, if you
understand what is going on
bias =-2 Each perceptron in the middle layer fires
only for a specific input (and no two per-
x1 x2 ceptrons fire for the same input)
red edge indicates w = -1 the fourth perceptron fires for {1,1}
blue edge indicates w = +1

60/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1 , w2 , w3 , w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4 work
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1
Astonishing claim! Well, not really, if you
understand what is going on
bias =-2 Each perceptron in the middle layer fires
only for a specific input (and no two per-
x1 x2 ceptrons fire for the same input)
red edge indicates w = -1 Let us see why this network works by tak-
blue edge indicates w = +1 ing an example of the XOR function

60/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let w0 be the biasPoutput of the neuron
(i.e., it will fire if 4i=1 wi hi ≥ w0 )
y

w1 w2 w3 w4
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1

bias =-2

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

61/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let w0 be the biasPoutput of the neuron
(i.e., it will fire if 4i=1 wi hi ≥ w0 )
y
P4
x1 x2 XOR h1 h2 h3 h4 i=1 wi hi
0 0 0 1 0 0 0 w1
w1 w2 w3 w4
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1

bias =-2

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

61/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let w0 be the biasPoutput of the neuron
(i.e., it will fire if 4i=1 wi hi ≥ w0 )
y
P4
x1 x2 XOR h1 h2 h3 h4 i=1 wi hi
0 0 0 1 0 0 0 w1
w1 w2 w3 w4 0 1 1 0 1 0 0 w2
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1

bias =-2

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

61/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let w0 be the biasPoutput of the neuron
(i.e., it will fire if 4i=1 wi hi ≥ w0 )
y
P4
x1 x2 XOR h1 h2 h3 h4 i=1 wi hi
0 0 0 1 0 0 0 w1
w1 w2 w3 w4 0 1 1 0 1 0 0 w2
h1 h2 h3 h4 1 0 1 0 0 1 0 w3
-1,-1 -1,1 1,-1 1,1

bias =-2

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

61/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let w0 be the biasPoutput of the neuron
(i.e., it will fire if 4i=1 wi hi ≥ w0 )
y
P4
x1 x2 XOR h1 h2 h3 h4 i=1 wi hi
0 0 0 1 0 0 0 w1
w1 w2 w3 w4 0 1 1 0 1 0 0 w2
h1 h2 h3 h4 1 0 1 0 0 1 0 w3
-1,-1 -1,1 1,-1 1,1 1 1 0 0 0 0 1 w4

bias =-2

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

61/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let w0 be the biasPoutput of the neuron
(i.e., it will fire if 4i=1 wi hi ≥ w0 )
y
P4
x1 x2 XOR h1 h2 h3 h4 i=1 wi hi
0 0 0 1 0 0 0 w1
w1 w2 w3 w4 0 1 1 0 1 0 0 w2
h1 h2 h3 h4 1 0 1 0 0 1 0 w3
-1,-1 -1,1 1,-1 1,1 1 1 0 0 0 0 1 w4

This results in the following four conditions


bias =-2 to implement XOR: w1 < w0 , w2 ≥ w0 , w3 ≥
w0 , w4 < w0
x1 x2
red edge indicates w = -1
blue edge indicates w = +1

61/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let w0 be the biasPoutput of the neuron
(i.e., it will fire if 4i=1 wi hi ≥ w0 )
y
P4
x1 x2 XOR h1 h2 h3 h4 i=1 wi hi
0 0 0 1 0 0 0 w1
w1 w2 w3 w4 0 1 1 0 1 0 0 w2
h1 h2 h3 h4 1 0 1 0 0 1 0 w3
-1,-1 -1,1 1,-1 1,1 1 1 0 0 0 0 1 w4

This results in the following four conditions


bias =-2 to implement XOR: w1 < w0 , w2 ≥ w0 , w3 ≥
w0 , w4 < w0
x1 x2
Unlike before, there are no contradictions now
red edge indicates w = -1 and the system of inequalities can be satisfied
blue edge indicates w = +1

61/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let w0 be the biasPoutput of the neuron
(i.e., it will fire if 4i=1 wi hi ≥ w0 )
y
P4
x1 x2 XOR h1 h2 h3 h4 i=1 wi hi
0 0 0 1 0 0 0 w1
w1 w2 w3 w4 0 1 1 0 1 0 0 w2
h1 h2 h3 h4 1 0 1 0 0 1 0 w3
-1,-1 -1,1 1,-1 1,1 1 1 0 0 0 0 1 w4

This results in the following four conditions


bias =-2 to implement XOR: w1 < w0 , w2 ≥ w0 , w3 ≥
w0 , w4 < w0
x1 x2
Unlike before, there are no contradictions now
red edge indicates w = -1 and the system of inequalities can be satisfied
blue edge indicates w = +1
Essentially each wi is now responsible for one
of the 4 possible inputs and can be adjusted
to get the desired output for that input 61/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
It should be clear that the same network
can be used to represent the remaining 15
y boolean functions also

w1 w2 w3 w4
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1

bias =-2

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

62/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
It should be clear that the same network
can be used to represent the remaining 15
y boolean functions also
Each boolean function will result in a dif-
ferent set of non-contradicting inequalit-
ies which can be satisfied by appropriately
w1 w2 w3 w4 setting w1 , w2 , w3 , w4
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1

bias =-2

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

62/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
It should be clear that the same network
can be used to represent the remaining 15
y boolean functions also
Each boolean function will result in a dif-
ferent set of non-contradicting inequalit-
ies which can be satisfied by appropriately
w1 w2 w3 w4 setting w1 , w2 , w3 , w4
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1
Try it!

bias =-2

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

62/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
What if we have more than 3 inputs ?

63/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Again each of the 8 perceptorns will fire only for one of the 8 inputs
Each of the 8 weights in the second layer is responsible for one of the 8 inputs
and can be adjusted to produce the desired output for that input
y

w1 w2 w3 w4 w5 w6 w7 w8

bias =-3

x1 x2 x3
64/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
What if we have n inputs ?

65/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Theorem
Any boolean function of n inputs can be represented exactly by a network of
perceptrons containing 1 hidden layer with 2n perceptrons and one output layer
containing 1 perceptron

66/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Theorem
Any boolean function of n inputs can be represented exactly by a network of
perceptrons containing 1 hidden layer with 2n perceptrons and one output layer
containing 1 perceptron

Proof (informal:) We just saw how to construct such a network

Note: A network of 2n + 1 perceptrons is not necessary but sufficient. For


example, we already saw how to represent AND function with just 1 perceptron

66/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Theorem
Any boolean function of n inputs can be represented exactly by a network of
perceptrons containing 1 hidden layer with 2n perceptrons and one output layer
containing 1 perceptron

Proof (informal:) We just saw how to construct such a network

Note: A network of 2n + 1 perceptrons is not necessary but sufficient. For


example, we already saw how to represent AND function with just 1 perceptron

Catch: As n increases the number of perceptrons in the hidden layers obviously


increases exponentially

66/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Again, why do we care about boolean functions ?

67/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Again, why do we care about boolean functions ?
How does this help us with our original problem: which was to predict whether
we like a movie or not?

67/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Again, why do we care about boolean functions ?
How does this help us with our original problem: which was to predict whether
we like a movie or not? Let us see!

67/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We are given this data about our past movie
experience

 
p1 x11 x12 . . . x1n y1 = 1
p2 x21
 x22 . . . x2n y2 = 1
..  .. .. .. .. .. 
.  .
 . . . . 
n1 xk1
 xk2 . . . xkn yi = 0 

n2  xj1
 xj2 . . . xjn yj = 0
.. .. .. .. .. ..
. . . . . .

68/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We are given this data about our past movie
experience
For each movie, we are given the values of the
various factors (x1 , x2 , . . . , xn ) that we base
our decision on and we are also also given the
value of y (like/dislike)
 
p1 x11 x12 . . . x1n y1 = 1
p2 x21
 x22 . . . x2n y2 = 1
..  .. .. .. .. .. 
.  .
 . . . . 
n1 xk1
 xk2 . . . xkn yi = 0 

n2  xj1
 xj2 . . . xjn yj = 0
.. .. .. .. .. ..
. . . . . .

68/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We are given this data about our past movie
experience
For each movie, we are given the values of the
various factors (x1 , x2 , . . . , xn ) that we base
our decision on and we are also also given the
value of y (like/dislike)
  pi ’s are the points for which the output was 1
p1 x11 x12 . . . x1n y1 = 1 and ni ’s are the points for which it was 0
p2 x21
 x22 . . . x2n y2 = 1
..  .. .. .. .. .. 
.  .
 . . . . 
n1 xk1
 xk2 . . . xkn yi = 0 

n2  xj1
 xj2 . . . xjn yj = 0
.. .. .. .. .. ..
. . . . . .

68/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We are given this data about our past movie
experience
For each movie, we are given the values of the
various factors (x1 , x2 , . . . , xn ) that we base
our decision on and we are also also given the
value of y (like/dislike)
  pi ’s are the points for which the output was 1
p1 x11 x12 . . . x1n y1 = 1 and ni ’s are the points for which it was 0
p2 x21
 x22 . . . x2n y2 = 1 The data may or may not be linearly separable
..  .. .. .. .. .. 
.  .
 . . . . 
n1 xk1
 xk2 . . . xkn yi = 0 

n2  xj1
 xj2 . . . xjn yj = 0
.. .. .. .. .. ..
. . . . . .

68/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y
We are given this data about our past movie
w1 w2 w3 w4 w5 w6 w7 w8
experience
bias =-3 For each movie, we are given the values of the
various factors (x1 , x2 , . . . , xn ) that we base
our decision on and we are also also given the
x1 x2 x3
value of y (like/dislike)
  pi ’s are the points for which the output was 1
p1 x11 x12 . . . x1n y1 = 1 and ni ’s are the points for which it was 0
p2 x21
 x22 . . . x2n y2 = 1 The data may or may not be linearly separable
..  .. .. .. .. .. 
.  .
 . . . .  But the proof that we just saw tells us that it
n1 xk1
 xk2 . . . xkn yi = 0 
 is possible to have a network of perceptrons
n2  xj1
 xj2 . . . xjn yj = 0 and learn the weights in this network such
.. .. .. .. .. .. that for any given pi or nj the output of the
. . . . . .
network will be the same as yi or yj (i.e.,
we can separate the positive and the negat-
ive points)
68/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The story so far ...
Networks of the form that we just saw (containing, an input, output and one
or more hidden layers) are called Multilayer Perceptrons (MLP, in short)

69/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The story so far ...
Networks of the form that we just saw (containing, an input, output and one
or more hidden layers) are called Multilayer Perceptrons (MLP, in short)
More appropriate terminology would be“Multilayered Network of Perceptrons”
but MLP is the more commonly used name

69/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The story so far ...
Networks of the form that we just saw (containing, an input, output and one
or more hidden layers) are called Multilayer Perceptrons (MLP, in short)
More appropriate terminology would be“Multilayered Network of Perceptrons”
but MLP is the more commonly used name
The theorem that we just saw gives us the representation power of a MLP with
a single hidden layer

69/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The story so far ...
Networks of the form that we just saw (containing, an input, output and one
or more hidden layers) are called Multilayer Perceptrons (MLP, in short)
More appropriate terminology would be“Multilayered Network of Perceptrons”
but MLP is the more commonly used name
The theorem that we just saw gives us the representation power of a MLP with
a single hidden layer
Specifically, it tells us that a MLP with a single hidden layer can represent any
boolean function

69/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2

You might also like