IE643 Lecture3 2020aug21

Deep Learning - Theory and Practice

IE 643
Lecture 3

August 21, 2020.

1 Recap
Supervised Machine Learning
Perceptron and Learning

2 Perceptron Convergence

Recap Supervised Machine Learning

Recap: Supervised Machine Learning

Recap Supervised Machine Learning

Binary Classification
Recall: e-mail Spam Classification

Recap Supervised Machine Learning

Binary Classification

Input: e-mail messages =⇒ some feature space ⊆ Rd .

I x ∈ X ⊆ Rd

Output: Spam/Not spam =⇒ {+1, −1}

I y ∈ Y = {+1, −1}

Generally n input/output pairs {(xi , yi )}ni=1 ⊂ (X × Y) are given for

learning the machine learning model.

D = {(xi , yi )}ni=1 called the training data.

Recap Supervised Machine Learning

General Nature of a Supervised Machine Learning Task

Input: Training data D = {(xi , yi )}ni=1
Aim: Learn a model h : X → Y

Recap Supervised Machine Learning

General Nature of a Supervised Machine Learning Task

Input: Training data D = {(xi , yi )}ni=1
Aim: Learn a model h : X → Y
Given x̂, predict ŷ = h(x̂)
Recap Perceptron and Learning

Recap: Perceptron and Learning

Recap Perceptron and Learning


Input is of the form x = (x1 , x2 , . . . , xd ) ∈ Rd .

We associate weights w = (w1 , w2 , . . . , wd ) ∈ Rd to the connections.
Prediction Rule:
I hw , xi ≥ θ =⇒ predict 1.
I hw , xi < θ =⇒ predict −1.
Recap Perceptron and Learning

Perceptron - Geometric Idea

Prediction Rule
hw , xi ≥ θ =⇒ predict 1.
hw , xi < θ =⇒ predict −1.
Geometric Idea: To find a separating hyperplane.
Recap Perceptron and Learning

Perceptron - Geometric Idea

Hyperplane in Rd
H = {x ∈ Rd : ∃w 6= 0 and b ∈ R s.t. hw , xi = b}.

Recap Perceptron and Learning

Perceptron - Geometric Idea

Open Halfspaces associated with a hyperplane hw , xi = b

Q1 = {x ∈ Rd : hw , xi > b} and Q2 = {x ∈ Rd : hw , xi < b}.

Recap Perceptron and Learning

Perceptron - Geometric Idea

Open Halfspaces associated with a hyperplane hw , xi = b

Q1 = {x ∈ Rd : hw , xi > b} and Q2 = {x ∈ Rd : hw , xi < b}.

Recap Perceptron and Learning

Perceptron - Geometric Idea

Closed Halfspaces associated with a hyperplane hw , xi = b

S1 = {x ∈ Rd : hw , xi ≥ b} and S2 = {x ∈ Rd : hw , xi ≤ b}.

Recap Perceptron and Learning

Perceptron - Geometric Idea

Closed Halfspaces associated with a hyperplane hw , xi = b

S1 = {x ∈ Rd : hw , xi ≥ b} and S2 = {x ∈ Rd : hw , xi ≤ b}.

Recap Perceptron and Learning

Perceptron - Geometric Idea

Prediction Rule
hw , xi ≥ θ =⇒ predict 1.
hw , xi < θ =⇒ predict −1.
Geometric Idea: To find a separating hyperplane (w , θ) such that samples
with class labels 1 and −1 lie on alternate sides of the hyperplane.

Recap Perceptron and Learning

Perceptron - Geometric Idea

Prediction Rule
hw , xi ≥ θ =⇒ predict 1.
hw , xi < θ =⇒ predict −1.
Geometric Idea: To find a separating hyperplane (w , θ) such that samples
with class labels 1 and −1 lie on alternate sides of the hyperplane.

Recap Perceptron and Learning


Input is of the form x̃ = (x, 1) = (x1 , x2 , . . . , xd , 1) ∈ Rd+1 .

We associate weights w̃ = (w , −θ) = (w1 , w2 , . . . , wd , −θ) ∈ Rd+1 to
the connections.
Prediction Rule:
I hw̃ , x̃i = hw , xi − θ ≥ 0 =⇒ predict 1.
I hw̃ , x̃i = hw , xi − θ < 0 =⇒ predict −1.
Recap Perceptron and Learning

Perceptron - Data Perspective

Input: data point x = (x1 , x2 , . . . , xd ), label y ∈ {+1, −1}.

Recap Perceptron and Learning

Perceptron - Training

Perceptron Training Procedure

1: w̃ 1 = 0
2: for t ← 1, 2, 3, . . . do
3: receive (x t , y t ), x t ∈ Rd , y t ∈ {+1, −1}.
4: Transform x t into x̃ t = (x t , 1) ∈ Rd+1 .
5: yb = Perceptron(x̃ t ; w̃ t )
6: if yb 6= y t then
7: w̃ t+1 = w̃ t + y t x̃ t
8: else
9: w̃ t+1 = w̃ t

Recap Perceptron and Learning

Perceptron Update Rule - Geometric Idea

The current weights are w̃ t = (1, 1, −4).

Recap Perceptron and Learning

Perceptron Update Rule - Geometric Idea

Suppose a new point x t = (2, 1), y t = 1 arrives.

Recap Perceptron and Learning

Perceptron Update Rule - Geometric Idea

With the current weights, perceptron outputs ŷ t = −1 for the point

x t = (2, 1).
An error occurs and w̃ t gets updated to w̃ t+1 = w̃ t + y t x̃ t , .
After update, the new weights become w̃ t+1 = (3, 2, −3).

Perceptron Convergence

Convergence of Perceptron Training

Perceptron Convergence

Perceptron Convergence - Geometric Intuition

What are some natural assumptions to expect the perceptron training

to converge?

Let us first motivate such assumptions through geometric intuition.

Perceptron Convergence

Perceptron Convergence - Geometric Intuition

Can the data be separated by a hyperplane?

Perceptron Convergence

Perceptron Convergence - Geometric Intuition

First assumption: At least the data should be such that the samples
with label 1 can to be separated by a hyperplane from samples with
label −1.

Perceptron Convergence

Perceptron Convergence - Geometric Intuition

First assumption: At least the data should be such that the samples
with label 1 can to be separated by a hyperplane from samples with
label −1.
Is this assumption sufficient?

Perceptron Convergence

Perceptron Convergence - Geometric Intuition

Perceptron Convergence

Perceptron Convergence - Geometric Intuition

Perceptron Convergence

Perceptron Convergence - Geometric Intuition

Perceptron Convergence

Perceptron Convergence - Geometric Intuition

Perceptron Convergence

Perceptron Convergence - Geometric Intuition

Perceptron Convergence

Perceptron Convergence - Geometric Intuition

Perceptron Convergence

Perceptron Convergence - Geometric Intuition

Perceptron Convergence

Perceptron Convergence - Geometric Intuition

Perceptron Convergence

Perceptron Convergence - Geometric Intuition

Perceptron Convergence

Perceptron Convergence - Geometric Intuition

Refined assumption: We not only want the data to be separated

but the separation should be good enough!
Perceptron Convergence

Perceptron Convergence - Separability Assumption

Linear Separability Assumption

Let D = {(x t , y t )}∞ t
t=1 denote the training data where x ∈ R ,

y t ∈ {+1, −1}, ∀t = 1, 2, . . .. Then there exist Rd 3 w ∗ 6= 0, γ > 0, such


hw ∗ , x t i > γ where y t = 1,
hw ∗ , x t i < −γ where y t = −1.

Perceptron Convergence

Perceptron Convergence - Separability Assumption

Linear Separability Assumption

Let D = {(x t , y t )}∞ t d
t=1 denote the training data where x ∈ R ,
t d ∗
y ∈ {+1, −1}, ∀t = 1, 2, . . .. Then there exist R 3 w 6= 0, γ > 0, such

y t hw ∗ , x t i > γ.

Perceptron Convergence

Perceptron Convergence - Mistake Bound

We will try to derive useful bounds on the number of mistakes that a

perceptron can commit during its training.

Assumption on data: Linear Separability

Assume that the T rounds of training have been completed in

perceptron training. Assume T to be some large number.

Assume that M mistakes are made by the perceptron in these T

rounds. (Obviously, M < T .)

We ask if the number of mistakes M can be bounded by some

suitable quantity.

Perceptron Convergence

Perceptron Convergence - Mistake Bound

We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.

Perceptron Convergence

Perceptron Convergence - Mistake Bound

We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).

Perceptron Convergence

Perceptron Convergence - Mistake Bound

We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .

Perceptron Convergence

Perceptron Convergence - Mistake Bound

We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .

Perceptron Convergence

Perceptron Convergence - Mistake Bound

We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .
Now from linear separability assumption, we have w ∗ 6= 0 such that
y t hw ∗ , x t i > γ.

Perceptron Convergence

Perceptron Convergence - Mistake Bound

We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .
Now from linear separability assumption, we have w ∗ 6= 0 such that
y t hw ∗ , x t i > γ.
First step: To bound the difference hw ∗ , w t+1 i − hw ∗ , w t i.

Perceptron Convergence

Perceptron Convergence - Mistake Bound

We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .
Now from linear separability assumption, we have w ∗ 6= 0 such that
y t hw ∗ , x t i > γ.
Now we can write

hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t + y t x t i − hw ∗ , w t i

Perceptron Convergence

Perceptron Convergence - Mistake Bound

We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .
Now from linear separability assumption, we have w ∗ 6= 0 such that
y t hw ∗ , x t i > γ.
Now we can write

hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t + y t x t i − hw ∗ , w t i
= hw ∗ , w t i + hw ∗ , y t x t i − hw ∗ , w t i (how?)

Perceptron Convergence

Perceptron Convergence - Mistake Bound

We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .
Now from linear separability assumption, we have w ∗ 6= 0 such that
y t hw ∗ , x t i > γ.
Now we can write

hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t + y t x t i − hw ∗ , w t i
= hw ∗ , w t i + hw ∗ , y t x t i − hw ∗ , w t i
= hw ∗ , y t x t i

Perceptron Convergence

Perceptron Convergence - Mistake Bound

We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .
Now from linear separability assumption, we have w ∗ 6= 0 such that
y t hw ∗ , x t i > γ.
Now we can write

hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t + y t x t i − hw ∗ , w t i
= hw ∗ , w t i + hw ∗ , y t x t i − hw ∗ , w t i
= hw ∗ , y t x t i
= y t hw ∗ , x t i (how?)

Perceptron Convergence

Perceptron Convergence - Mistake Bound

We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .
Now from linear separability assumption, we have w ∗ 6= 0 such that
y t hw ∗ , x t i > γ.
Now we can write

hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t + y t x t i − hw ∗ , w t i
= hw ∗ , w t i + hw ∗ , y t x t i − hw ∗ , w t i
= hw ∗ , y t x t i
= y t hw ∗ , x t i > γ

Perceptron Convergence

Perceptron Convergence - Mistake Bound

We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .
Now from linear separability assumption, we have w ∗ 6= 0 such that
y t hw ∗ , x t i > γ.
Now we can write

hw ∗ , w t+1 i − hw ∗ , w t i > γ.

Perceptron Convergence

Perceptron Convergence - Mistake Bound

Now when no mistake is made in round t, we have w t+1 = w t .

Hence hw ∗ , w t+1 i − hw ∗ , w t i = 0.

Perceptron Convergence

Perceptron Convergence - Mistake Bound

Recall our assumptions:

Assume that the T rounds of training have been completed in
perceptron training. Assume T to be some large number.

Assume that M mistakes are made by the perceptron in these T

rounds. (Obviously, M < T .)

Perceptron Convergence

Perceptron Convergence - Mistake Bound

hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t+1 i − hw ∗ , w t i+
t=1 t∈{1,...,T },
t:mistake is made
at round t
hw ∗ , w t+1 i − hw ∗ , w t i
t∈{1,...,T },
t:no mistake is made
at round t

Perceptron Convergence

Perceptron Convergence - Mistake Bound

hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t+1 i − hw ∗ , w t i+
t=1 t∈{1,...,T },
t:mistake is made
at round t
hw ∗ , w t+1 i − hw ∗ , w t i
t∈{1,...,T },
t:no mistake is made
at round t
= hw ∗ , w t+1 i − hw ∗ , w t i + 0 (how?)
t∈{1,...,T },
t:mistake is made
at round t

Perceptron Convergence

Perceptron Convergence - Mistake Bound

hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t+1 i − hw ∗ , w t i
t=1 t∈{1,...,T },
t:mistake is made
at round t
> Mγ (how?)

Perceptron Convergence

Perceptron Convergence - Mistake Bound

Also note:
hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w T +1 i (homework!)

Perceptron Convergence

Perceptron Convergence - Mistake Bound

Hence we have:
hw ∗ , w t+1 i − hw ∗ , w t i > Mγ
=⇒ hw ∗ , w T +1 i > Mγ

