IE643 Lecture3 2020aug21

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

Deep Learning - Theory and Practice

IE 643
Lecture 3

August 21, 2020.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 1 / 60


Outline

1 Recap
Supervised Machine Learning
Perceptron and Learning

2 Perceptron Convergence

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 2 / 60


Recap Supervised Machine Learning

Recap: Supervised Machine Learning

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 3 / 60


Recap Supervised Machine Learning

Binary Classification
Recall: e-mail Spam Classification

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 4 / 60


Recap Supervised Machine Learning

Binary Classification

Input: e-mail messages =⇒ some feature space ⊆ Rd .


I x ∈ X ⊆ Rd

Output: Spam/Not spam =⇒ {+1, −1}


I y ∈ Y = {+1, −1}

Generally n input/output pairs {(xi , yi )}ni=1 ⊂ (X × Y) are given for


learning the machine learning model.

D = {(xi , yi )}ni=1 called the training data.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 5 / 60


Recap Supervised Machine Learning

General Nature of a Supervised Machine Learning Task

Training
Input: Training data D = {(xi , yi )}ni=1
Aim: Learn a model h : X → Y

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 6 / 60


Recap Supervised Machine Learning

General Nature of a Supervised Machine Learning Task

Training
Input: Training data D = {(xi , yi )}ni=1
Aim: Learn a model h : X → Y
Testing
Given x̂, predict ŷ = h(x̂)
P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 7 / 60
Recap Perceptron and Learning

Recap: Perceptron and Learning

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 8 / 60


Recap Perceptron and Learning

Perceptron

Input is of the form x = (x1 , x2 , . . . , xd ) ∈ Rd .


We associate weights w = (w1 , w2 , . . . , wd ) ∈ Rd to the connections.
Prediction Rule:
I hw , xi ≥ θ =⇒ predict 1.
I hw , xi < θ =⇒ predict −1.
P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 9 / 60
Recap Perceptron and Learning

Perceptron - Geometric Idea

Prediction Rule
hw , xi ≥ θ =⇒ predict 1.
hw , xi < θ =⇒ predict −1.
Geometric Idea: To find a separating hyperplane.
P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 10 / 60
Recap Perceptron and Learning

Perceptron - Geometric Idea

Hyperplane in Rd
H = {x ∈ Rd : ∃w 6= 0 and b ∈ R s.t. hw , xi = b}.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 11 / 60


Recap Perceptron and Learning

Perceptron - Geometric Idea

Open Halfspaces associated with a hyperplane hw , xi = b


Q1 = {x ∈ Rd : hw , xi > b} and Q2 = {x ∈ Rd : hw , xi < b}.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 12 / 60


Recap Perceptron and Learning

Perceptron - Geometric Idea

Open Halfspaces associated with a hyperplane hw , xi = b


Q1 = {x ∈ Rd : hw , xi > b} and Q2 = {x ∈ Rd : hw , xi < b}.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 13 / 60


Recap Perceptron and Learning

Perceptron - Geometric Idea

Closed Halfspaces associated with a hyperplane hw , xi = b


S1 = {x ∈ Rd : hw , xi ≥ b} and S2 = {x ∈ Rd : hw , xi ≤ b}.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 14 / 60


Recap Perceptron and Learning

Perceptron - Geometric Idea

Closed Halfspaces associated with a hyperplane hw , xi = b


S1 = {x ∈ Rd : hw , xi ≥ b} and S2 = {x ∈ Rd : hw , xi ≤ b}.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 15 / 60


Recap Perceptron and Learning

Perceptron - Geometric Idea

Prediction Rule
hw , xi ≥ θ =⇒ predict 1.
hw , xi < θ =⇒ predict −1.
Geometric Idea: To find a separating hyperplane (w , θ) such that samples
with class labels 1 and −1 lie on alternate sides of the hyperplane.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 16 / 60


Recap Perceptron and Learning

Perceptron - Geometric Idea

Prediction Rule
hw , xi ≥ θ =⇒ predict 1.
hw , xi < θ =⇒ predict −1.
Geometric Idea: To find a separating hyperplane (w , θ) such that samples
with class labels 1 and −1 lie on alternate sides of the hyperplane.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 17 / 60


Recap Perceptron and Learning

Perceptron

Input is of the form x̃ = (x, 1) = (x1 , x2 , . . . , xd , 1) ∈ Rd+1 .


We associate weights w̃ = (w , −θ) = (w1 , w2 , . . . , wd , −θ) ∈ Rd+1 to
the connections.
Prediction Rule:
I hw̃ , x̃i = hw , xi − θ ≥ 0 =⇒ predict 1.
I hw̃ , x̃i = hw , xi − θ < 0 =⇒ predict −1.
P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 18 / 60
Recap Perceptron and Learning

Perceptron - Data Perspective

Input: data point x = (x1 , x2 , . . . , xd ), label y ∈ {+1, −1}.


P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 19 / 60
Recap Perceptron and Learning

Perceptron - Training

Perceptron Training Procedure


1: w̃ 1 = 0
2: for t ← 1, 2, 3, . . . do
3: receive (x t , y t ), x t ∈ Rd , y t ∈ {+1, −1}.
4: Transform x t into x̃ t = (x t , 1) ∈ Rd+1 .
5: yb = Perceptron(x̃ t ; w̃ t )
6: if yb 6= y t then
7: w̃ t+1 = w̃ t + y t x̃ t
8: else
9: w̃ t+1 = w̃ t

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 20 / 60


Recap Perceptron and Learning

Perceptron Update Rule - Geometric Idea

The current weights are w̃ t = (1, 1, −4).

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 21 / 60


Recap Perceptron and Learning

Perceptron Update Rule - Geometric Idea

Suppose a new point x t = (2, 1), y t = 1 arrives.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 22 / 60


Recap Perceptron and Learning

Perceptron Update Rule - Geometric Idea

With the current weights, perceptron outputs ŷ t = −1 for the point


x t = (2, 1).
An error occurs and w̃ t gets updated to w̃ t+1 = w̃ t + y t x̃ t , .
After update, the new weights become w̃ t+1 = (3, 2, −3).

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 23 / 60


Perceptron Convergence

Convergence of Perceptron Training

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 24 / 60


Perceptron Convergence

Perceptron Convergence - Geometric Intuition

What are some natural assumptions to expect the perceptron training


to converge?

Let us first motivate such assumptions through geometric intuition.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 25 / 60


Perceptron Convergence

Perceptron Convergence - Geometric Intuition

Can the data be separated by a hyperplane?

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 26 / 60


Perceptron Convergence

Perceptron Convergence - Geometric Intuition

First assumption: At least the data should be such that the samples
with label 1 can to be separated by a hyperplane from samples with
label −1.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 27 / 60


Perceptron Convergence

Perceptron Convergence - Geometric Intuition

First assumption: At least the data should be such that the samples
with label 1 can to be separated by a hyperplane from samples with
label −1.
Is this assumption sufficient?

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 28 / 60


Perceptron Convergence

Perceptron Convergence - Geometric Intuition

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 29 / 60


Perceptron Convergence

Perceptron Convergence - Geometric Intuition

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 30 / 60


Perceptron Convergence

Perceptron Convergence - Geometric Intuition

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 31 / 60


Perceptron Convergence

Perceptron Convergence - Geometric Intuition

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 32 / 60


Perceptron Convergence

Perceptron Convergence - Geometric Intuition

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 33 / 60


Perceptron Convergence

Perceptron Convergence - Geometric Intuition

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 34 / 60


Perceptron Convergence

Perceptron Convergence - Geometric Intuition

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 35 / 60


Perceptron Convergence

Perceptron Convergence - Geometric Intuition

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 36 / 60


Perceptron Convergence

Perceptron Convergence - Geometric Intuition

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 37 / 60


Perceptron Convergence

Perceptron Convergence - Geometric Intuition

Refined assumption: We not only want the data to be separated


but the separation should be good enough!
P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 38 / 60
Perceptron Convergence

Perceptron Convergence - Separability Assumption

Linear Separability Assumption


Let D = {(x t , y t )}∞ t
t=1 denote the training data where x ∈ R ,
d

y t ∈ {+1, −1}, ∀t = 1, 2, . . .. Then there exist Rd 3 w ∗ 6= 0, γ > 0, such


that:

hw ∗ , x t i > γ where y t = 1,
hw ∗ , x t i < −γ where y t = −1.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 39 / 60


Perceptron Convergence

Perceptron Convergence - Separability Assumption

Linear Separability Assumption


Let D = {(x t , y t )}∞ t d
t=1 denote the training data where x ∈ R ,
t d ∗
y ∈ {+1, −1}, ∀t = 1, 2, . . .. Then there exist R 3 w 6= 0, γ > 0, such
that:

y t hw ∗ , x t i > γ.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 40 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound

We will try to derive useful bounds on the number of mistakes that a


perceptron can commit during its training.

Assumption on data: Linear Separability

Assume that the T rounds of training have been completed in


perceptron training. Assume T to be some large number.

Assume that M mistakes are made by the perceptron in these T


rounds. (Obviously, M < T .)

We ask if the number of mistakes M can be bounded by some


suitable quantity.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 41 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound


We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 42 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound


We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 43 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound


We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 44 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound


We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 45 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound


We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .
Now from linear separability assumption, we have w ∗ 6= 0 such that
y t hw ∗ , x t i > γ.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 46 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound


We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .
Now from linear separability assumption, we have w ∗ 6= 0 such that
y t hw ∗ , x t i > γ.
First step: To bound the difference hw ∗ , w t+1 i − hw ∗ , w t i.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 47 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound


We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .
Now from linear separability assumption, we have w ∗ 6= 0 such that
y t hw ∗ , x t i > γ.
Now we can write

hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t + y t x t i − hw ∗ , w t i

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 48 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound


We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .
Now from linear separability assumption, we have w ∗ 6= 0 such that
y t hw ∗ , x t i > γ.
Now we can write

hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t + y t x t i − hw ∗ , w t i
= hw ∗ , w t i + hw ∗ , y t x t i − hw ∗ , w t i (how?)

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 49 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound


We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .
Now from linear separability assumption, we have w ∗ 6= 0 such that
y t hw ∗ , x t i > γ.
Now we can write

hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t + y t x t i − hw ∗ , w t i
= hw ∗ , w t i + hw ∗ , y t x t i − hw ∗ , w t i
= hw ∗ , y t x t i

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 50 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound


We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .
Now from linear separability assumption, we have w ∗ 6= 0 such that
y t hw ∗ , x t i > γ.
Now we can write

hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t + y t x t i − hw ∗ , w t i
= hw ∗ , w t i + hw ∗ , y t x t i − hw ∗ , w t i
= hw ∗ , y t x t i
= y t hw ∗ , x t i (how?)

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 51 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound


We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .
Now from linear separability assumption, we have w ∗ 6= 0 such that
y t hw ∗ , x t i > γ.
Now we can write

hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t + y t x t i − hw ∗ , w t i
= hw ∗ , w t i + hw ∗ , y t x t i − hw ∗ , w t i
= hw ∗ , y t x t i
= y t hw ∗ , x t i > γ

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 52 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound


We begin the analysis by considering an arbitrary round
t ∈ {1, 2, . . . , T } where a mistake is made by the perceptron.
Recall: During this t-th round:
I Perceptron computes ŷ t = sign(hw t , x t i).
I ŷ t 6= y t .
I Perceptron update: w t+1 = w t + y t x t .
Now from linear separability assumption, we have w ∗ 6= 0 such that
y t hw ∗ , x t i > γ.
Now we can write

hw ∗ , w t+1 i − hw ∗ , w t i > γ.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 53 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound

Now when no mistake is made in round t, we have w t+1 = w t .

Hence hw ∗ , w t+1 i − hw ∗ , w t i = 0.

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 54 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound

Recall our assumptions:


Assume that the T rounds of training have been completed in
perceptron training. Assume T to be some large number.

Assume that M mistakes are made by the perceptron in these T


rounds. (Obviously, M < T .)

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 55 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound

T
X X
hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t+1 i − hw ∗ , w t i+
t=1 t∈{1,...,T },
t:mistake is made
at round t
X
hw ∗ , w t+1 i − hw ∗ , w t i
t∈{1,...,T },
t:no mistake is made
at round t

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 56 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound

T
X X
hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t+1 i − hw ∗ , w t i+
t=1 t∈{1,...,T },
t:mistake is made
at round t
X
hw ∗ , w t+1 i − hw ∗ , w t i
t∈{1,...,T },
t:no mistake is made
at round t
X
= hw ∗ , w t+1 i − hw ∗ , w t i + 0 (how?)
t∈{1,...,T },
t:mistake is made
at round t

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 57 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound

T
X X
hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t+1 i − hw ∗ , w t i
t=1 t∈{1,...,T },
t:mistake is made
at round t
> Mγ (how?)

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 58 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound

Also note:
T
X
hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w T +1 i (homework!)
t=1

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 59 / 60


Perceptron Convergence

Perceptron Convergence - Mistake Bound

Hence we have:
T
X
hw ∗ , w t+1 i − hw ∗ , w t i > Mγ
t=1
=⇒ hw ∗ , w T +1 i > Mγ

P. Balamurugan Deep Learning - Theory and Practice August 21, 2020. 60 / 60

You might also like