IE643 Lecture3 2020aug21

Deep Learning - Theory and Practice
IE 643
Lecture 3
August 21, 2020.
P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 1 / 60

Outline
1 Recap
Supervised Machine Learning
Perceptron and Learning
2 Perceptron Convergence

Recap Supervised Machine Learning
Recap: Supervised Machine Learning

Binary Classification
Recall: e-mail Spam Classification

Binary Classification
Input: e-mail messages =⇒ some feature space ⊆ Rd .

I x ∈ X ⊆ Rd
Output: Spam/Not spam =⇒ {+1, −1}

I y ∈ Y = {+1, −1}
Generally n input/output pairs {(xi , yi )}ni=1 ⊂ (X × Y) are given for

learning the machine learning model.
D = {(xi , yi )}ni=1 called the training data.

General Nature of a Supervised Machine Learning Task
Training
Input: Training data D = {(xi , yi )}ni=1
Aim: Learn a model h : X → Y

General Nature of a Supervised Machine Learning Task
Training
Input: Training data D = {(xi , yi )}ni=1
Aim: Learn a model h : X → Y
Testing
Given x̂, predict ŷ = h(x̂)
Recap Perceptron and Learning
Recap: Perceptron and Learning

Perceptron
Input is of the form x = (x1 , x2 , . . . , xd ) ∈ Rd .

We associate weights w = (w1 , w2 , . . . , wd ) ∈ Rd to the connections.
Prediction Rule:
I hw , xi ≥ θ =⇒ predict 1.
I hw , xi < θ =⇒ predict −1.
Perceptron - Geometric Idea
Prediction Rule
hw , xi ≥ θ =⇒ predict 1.
hw , xi < θ =⇒ predict −1.
Geometric Idea: To find a separating hyperplane.
Hyperplane in Rd
H = {x ∈ Rd : ∃w 6= 0 and b ∈ R s.t. hw , xi = b}.

Open Halfspaces associated with a hyperplane hw , xi = b

Q1 = {x ∈ Rd : hw , xi > b} and Q2 = {x ∈ Rd : hw , xi < b}.

Open Halfspaces associated with a hyperplane hw , xi = b

Q1 = {x ∈ Rd : hw , xi > b} and Q2 = {x ∈ Rd : hw , xi < b}.

Closed Halfspaces associated with a hyperplane hw , xi = b

S1 = {x ∈ Rd : hw , xi ≥ b} and S2 = {x ∈ Rd : hw , xi ≤ b}.

Closed Halfspaces associated with a hyperplane hw , xi = b

S1 = {x ∈ Rd : hw , xi ≥ b} and S2 = {x ∈ Rd : hw , xi ≤ b}.

Prediction Rule
Geometric Idea: To find a separating hyperplane (w , θ) such that samples
with class labels 1 and −1 lie on alternate sides of the hyperplane.

Prediction Rule
Geometric Idea: To find a separating hyperplane (w , θ) such that samples
with class labels 1 and −1 lie on alternate sides of the hyperplane.

Perceptron
Input is of the form x̃ = (x, 1) = (x1 , x2 , . . . , xd , 1) ∈ Rd+1 .

We associate weights w̃ = (w , −θ) = (w1 , w2 , . . . , wd , −θ) ∈ Rd+1 to
the connections.
Prediction Rule:
I hw̃ , x̃i = hw , xi − θ ≥ 0 =⇒ predict 1.
I hw̃ , x̃i = hw , xi − θ < 0 =⇒ predict −1.
Perceptron - Data Perspective
Input: data point x = (x1 , x2 , . . . , xd ), label y ∈ {+1, −1}.

Perceptron - Training
Perceptron Training Procedure

1: w̃ 1 = 0
2: for t ← 1, 2, 3, . . . do
3: receive (x t , y t ), x t ∈ Rd , y t ∈ {+1, −1}.
4: Transform x t into x̃ t = (x t , 1) ∈ Rd+1 .
5: yb = Perceptron(x̃ t ; w̃ t )
6: if yb 6= y t then
7: w̃ t+1 = w̃ t + y t x̃ t
8: else
9: w̃ t+1 = w̃ t

Perceptron Update Rule - Geometric Idea
The current weights are w̃ t = (1, 1, −4).

Suppose a new point x t = (2, 1), y t = 1 arrives.

With the current weights, perceptron outputs ŷ t = −1 for the point

x t = (2, 1).
An error occurs and w̃ t gets updated to w̃ t+1 = w̃ t + y t x̃ t , .
After update, the new weights become w̃ t+1 = (3, 2, −3).

Perceptron Convergence
Convergence of Perceptron Training

Perceptron Convergence - Geometric Intuition
What are some natural assumptions to expect the perceptron training

to converge?
Let us first motivate such assumptions through geometric intuition.

Can the data be separated by a hyperplane?

First assumption: At least the data should be such that the samples
with label 1 can to be separated by a hyperplane from samples with
label −1.

First assumption: At least the data should be such that the samples
with label 1 can to be separated by a hyperplane from samples with
label −1.
Is this assumption sufficient?










Refined assumption: We not only want the data to be separated

but the separation should be good enough!
Perceptron Convergence - Separability Assumption
Linear Separability Assumption

Let D = {(x t , y t )}∞ t
t=1 denote the training data where x ∈ R ,
d
y t ∈ {+1, −1}, ∀t = 1, 2, . . .. Then there exist Rd 3 w ∗ 6= 0, γ > 0, such

that:
hw ∗ , x t i > γ where y t = 1,
hw ∗ , x t i < −γ where y t = −1.

Perceptron Convergence - Separability Assumption
Linear Separability Assumption

Let D = {(x t , y t )}∞ t d
t=1 denote the training data where x ∈ R ,
t d ∗
y ∈ {+1, −1}, ∀t = 1, 2, . . .. Then there exist R 3 w 6= 0, γ > 0, such
that:
y t hw ∗ , x t i > γ.

Perceptron Convergence - Mistake Bound
We will try to derive useful bounds on the number of mistakes that a

perceptron can commit during its training.
Assumption on data: Linear Separability
Assume that the T rounds of training have been completed in

perceptron training. Assume T to be some large number.
Assume that M mistakes are made by the perceptron in these T

rounds. (Obviously, M < T .)
We ask if the number of mistakes M can be bounded by some

suitable quantity.


We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.


Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).


I ŷ t 6= y t .


I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .


I ŷ t 6= y t .
Now from linear separability assumption, we have w ∗ 6= 0 such that
y t hw ∗ , x t i > γ.


I ŷ t 6= y t .
y t hw ∗ , x t i > γ.
First step: To bound the difference hw ∗ , w t+1 i − hw ∗ , w t i.


I ŷ t 6= y t .
y t hw ∗ , x t i > γ.
Now we can write
hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t + y t x t i − hw ∗ , w t i


I ŷ t 6= y t .
y t hw ∗ , x t i > γ.
Now we can write
= hw ∗ , w t i + hw ∗ , y t x t i − hw ∗ , w t i (how?)


I ŷ t 6= y t .
y t hw ∗ , x t i > γ.
Now we can write
= hw ∗ , w t i + hw ∗ , y t x t i − hw ∗ , w t i
= hw ∗ , y t x t i


I ŷ t 6= y t .
y t hw ∗ , x t i > γ.
Now we can write
= hw ∗ , y t x t i
= y t hw ∗ , x t i (how?)


I ŷ t 6= y t .
y t hw ∗ , x t i > γ.
Now we can write
= hw ∗ , y t x t i
= y t hw ∗ , x t i > γ


I ŷ t 6= y t .
y t hw ∗ , x t i > γ.
Now we can write
hw ∗ , w t+1 i − hw ∗ , w t i > γ.

Now when no mistake is made in round t, we have w t+1 = w t .
Hence hw ∗ , w t+1 i − hw ∗ , w t i = 0.

Recall our assumptions:

Assume that the T rounds of training have been completed in
perceptron training. Assume T to be some large number.
Assume that M mistakes are made by the perceptron in these T

rounds. (Obviously, M < T .)

T
X X
hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t+1 i − hw ∗ , w t i+
t=1 t∈{1,...,T },
t:mistake is made
at round t
X
hw ∗ , w t+1 i − hw ∗ , w t i
t∈{1,...,T },
t:no mistake is made
at round t

T
X X
hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t+1 i − hw ∗ , w t i+
t=1 t∈{1,...,T },
t:mistake is made
at round t
X
hw ∗ , w t+1 i − hw ∗ , w t i
t∈{1,...,T },
t:no mistake is made
at round t
X
= hw ∗ , w t+1 i − hw ∗ , w t i + 0 (how?)
t∈{1,...,T },
t:mistake is made
at round t

T
X X
hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t+1 i − hw ∗ , w t i
t=1 t∈{1,...,T },
t:mistake is made
at round t
> Mγ (how?)

Also note:
T
X
hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w T +1 i (homework!)
t=1

Hence we have:
T
X
hw ∗ , w t+1 i − hw ∗ , w t i > Mγ
t=1
=⇒ hw ∗ , w T +1 i > Mγ

IE643 Lecture3 2020aug21

Uploaded by

Copyright:

Available Formats

You might also like

IE643 Lecture3 2020aug21

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IE643 Lecture3 2020aug21

Uploaded by

Copyright:

Available Formats

Deep Learning - Theory and Practice

August 21, 2020.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 1 / 60

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 2 / 60

Recap: Supervised Machine Learning

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 3 / 60

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 4 / 60

Input: e-mail messages =⇒ some feature space ⊆ Rd .

Output: Spam/Not spam =⇒ {+1, −1}

Generally n input/output pairs {(xi , yi )}ni=1 ⊂ (X × Y) are given for

D = {(xi , yi )}ni=1 called the training data.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 5 / 60

General Nature of a Supervised Machine Learning Task

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 6 / 60

General Nature of a Supervised Machine Learning Task

Recap: Perceptron and Learning

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 8 / 60

Input is of the form x = (x1 , x2 , . . . , xd ) ∈ Rd .

Perceptron - Geometric Idea

Perceptron - Geometric Idea

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 11 / 60

Perceptron - Geometric Idea

Open Halfspaces associated with a hyperplane hw , xi = b

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 12 / 60

Perceptron - Geometric Idea

Open Halfspaces associated with a hyperplane hw , xi = b

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 13 / 60

Perceptron - Geometric Idea

Closed Halfspaces associated with a hyperplane hw , xi = b

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 14 / 60

Perceptron - Geometric Idea

Closed Halfspaces associated with a hyperplane hw , xi = b

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 15 / 60

Perceptron - Geometric Idea

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 16 / 60

Perceptron - Geometric Idea

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 17 / 60

Input is of the form x̃ = (x, 1) = (x1 , x2 , . . . , xd , 1) ∈ Rd+1 .

Perceptron - Data Perspective

Input: data point x = (x1 , x2 , . . . , xd ), label y ∈ {+1, −1}.

Perceptron Training Procedure

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 20 / 60

Perceptron Update Rule - Geometric Idea

The current weights are w̃ t = (1, 1, −4).

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 21 / 60

Perceptron Update Rule - Geometric Idea

Suppose a new point x t = (2, 1), y t = 1 arrives.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 22 / 60

Perceptron Update Rule - Geometric Idea

With the current weights, perceptron outputs ŷ t = −1 for the point

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 23 / 60

Convergence of Perceptron Training

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 24 / 60

Perceptron Convergence - Geometric Intuition

What are some natural assumptions to expect the perceptron training

Let us first motivate such assumptions through geometric intuition.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 25 / 60