IE643 Lecture9 2020sep15 Moodle

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 75

Deep Learning - Theory and Practice

IE 643
Lecture 9

September 15, 2020.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 1 / 75


Outline

1 Recap
MLP for prediction tasks

2 MLP for multi-class classification


KL-Divergence
Cross-entropy
Training MLP for multi-class classification

3 MLP for multi-label classification

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 2 / 75


Recap MLP for prediction tasks

Multi Layer Perceptron for Prediction Tasks

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y ∈ Y, ∀i ∈ {1, . . . , S} and MLP


architecture parametrized by weights w .

Aim of training MLP: To learn a parametrized map hw : X → Y such that for the
training data D, we have y i = hw (x i ), ∀i ∈ {1, . . . , S}.

Aim of using the trained MLP model: For an unseen sample x̂ ∈ X , predict
ŷ = hw (x̂) = MLP(x̂; w ).

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 3 / 75


Recap MLP for prediction tasks

Multi Layer Perceptron for Prediction Tasks

Methodology for training MLP


Design a suitable loss (or error) function e : Y × Y → [0, +∞) to compare the actual
label y i and the prediction ŷ i made by MLP using e(y i , ŷ i ), ∀i{1, . . . , S}.
Usually the error is parametrized by the weights w of the MLP and is denoted by
e(ŷ i , y i ; w ).
Use Gradient descent/SGD/mini-batch SGD to minimize the total error:
S
X S
X
E = e(ŷ i , y i ; w ) =: e i (w ).
i=1 i=1

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 4 / 75


Recap MLP for prediction tasks

Stochastic Gradient Descent for training MLP


SGD Algorithm to train MLP
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y i ∈ Y, ∀i;
MLP architecture, max epochs K , learning rates γk , ∀k ∈ {1, . . . , K }.

Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . . , K
I Choose a sample jk ∈ {1, . . . , S}.
I Find ŷ jk = MLP(x jk ; w k ). (forward pass)
I Compute error e jk (w k ).
I Compute error gradient ∇w e jk (w k ) using backpropagation.
I Update: w k+1 ← w k − γk ∇w e jk (w k ).

Output: w ∗ = w K +1 .

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 5 / 75


Recap MLP for prediction tasks

Multi Layer Perceptron for Prediction Tasks

Recall forward pass: For an arbitrary sample (x, y ) from training data D, and the MLP with
weights w = (W 1 , W 2 . . . , W L ), the prediction ŷ is computed using forward pass as:

ŷ = MLP(x; w ) = φ(W L φ(W L−1 . . . φ(W 1 x) . . .)).

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 6 / 75


Recap MLP for prediction tasks

Multi Layer Perceptron for Prediction Tasks

Recall backpropagation: For an arbitrary sample (x, y ) from training data D, and the MLP with
weights w = (W 1 , W 2 . . . , W L ), the error gradient with respect to weights at `-th layer is
computed as:
0
∇W ` e = Diag(φ` )δ ` (a`−1 )>

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 7 / 75


Recap MLP for prediction tasks

Multi Layer Perceptron for Prediction Tasks

Recall backpropagation: For an arbitrary sample (x, y ) from training data D, and the MLP with
weights w = (W 1 , W 2 . . . , W L ), the error gradient with respect to weights at `-th layer is
computed as:
0
∇W ` e = Diag(φ` )δ ` (a`−1 )>
 0 `   ∂e   `−1 
φ (z1 ) ∂a` a1
 1  .
, δ ` =  ..

0 
where Diag(φ` ) =  .  
..  and a`−1 =  . .
 
   .   . 
`−1
φ0 (zN` ) ∂e
` aN
` ∂aN `−1
`
P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 8 / 75
Recap MLP for prediction tasks

Multi Layer Perceptron for Prediction Tasks

Recall backpropagation: For an arbitrary sample (x, y ) from training data D, and the MLP with
weights w = (W 1 , W 2 . . . , W L ), the error gradient with respect to weights at `-th layer is
computed as:

0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V `+1 V `+2 . . . V L δ L (a`−1 )>

0
where V `+1 = (W `+1 )> Diag(φ`+1 ).

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 9 / 75


Recap MLP for prediction tasks

Multi Layer Perceptron for Prediction Tasks

Task considered so far: Y = {+1, −1}.

Corresponds to two-class (or binary) classification.

Usually a single neuron at the last (L-th) layer of MLP, with logistic sigmoid function
σ : R → (0, 1) with σ(z) = 1+e1−z , for some z ∈ R.

Prediction: MLP(x̂; w ) = σ(W L aL−1 ), followed by a thresholding function.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 10 / 75


MLP for multi-class classification

MLP for multi-class classification

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y i ∈ Y, ∀i ∈ {1, . . . , S} and


MLP architecture parametrized by weights w .

New Task: Y = {1, . . . , C }, C ≥ 2.

Corresponds to multi-class classification.

Question 1: What is a suitable architecture for the MLP’s last (or output) layer?
Question 2: What is a suitable loss (or error) function?

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 11 / 75


MLP for multi-class classification

MLP for multi-class classification

Question 1: Can the same MLP architecture with single output neuron used in binary
classification be used for multi-class classification?
Question 2: Can the same logistic sigmoidal activation function for the output neuron used in
binary classification be used for multi-class classification?

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 12 / 75


MLP for multi-class classification

MLP for multi-class classification


We will use the following approach for multi-class classification:
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y i ∈ Y, ∀i ∈ {1, . . . , S} and
MLP architecture parametrized by weights w .

New Task: Y = {1, . . . , C }, C ≥ 2 corresponds to multi-class classification.


 
0
.
.
.
0
 
Transform y = c to y onehotenc = 1.
 
0
 
 .. 
 
.
0

Note: y onehotenc ∈ {0, 1}C corresponding to y = c ∈ Y has a 1 at c-th coordinate, and


other entries as zeros.

y onehotenc is called the one-hot encoding of y .

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 13 / 75


MLP for multi-class classification

MLP for multi-class classification


We will use the following approach for multi-class classification:
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y i ∈ Y, ∀i ∈ {1, . . . , S} and
MLP architecture parametrized by weights w .

New Task: Y = {1, . . . , C }, C ≥ 2 corresponds to multi-class classification.


 
0
.
.
.
0
 
Transform y = c to y onehotenc = 1.
 
0
 
 .. 
 
.
0

Note: y onehotenc ∈ {0, 1}C corresponding to y = c ∈ Y has a 1 at c-th coordinate, and


other entires as zeros.

y onehotenc is called the one-hot encoding of y .


y onehotenc for y = c corresponds to a discrete probability distribution with its entire mass
concentrated at the c-th coordinate.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 14 / 75


MLP for multi-class classification

MLP for multi-class classification: Discrete Probability


Distribution

Recall: A discrete probability distribution is represented by a set of real


values p1 , p2 , . . . , pm (we assume m to be finite) with the following
properties:

pi ≥ 0 ∀i ∈ {1, . . . , m},
m
X
pi = 1.
i=1

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 15 / 75


MLP for multi-class classification

MLP for multi-class classification

Question:
Why should we transform y ∈ Y taking an integer value, to a
y onehotenc which represents a discrete probability distribution?

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 16 / 75


MLP for multi-class classification

MLP for multi-class classification

Question:
Why should we transform y ∈ Y taking an integer value, to a
y onehotenc which represents a discrete probability distribution?
One possible answer:
If the prediction ŷ made by MLP is also a discrete probability
distribution, then the comparison between ŷ and y onehotenc becomes a
comparison between two discrete probability distributions.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 17 / 75


MLP for multi-class classification

MLP for multi-class classification

Question:
Why should we transform y ∈ Y taking an integer value, to a
y onehotenc which represents a discrete probability distribution?
One possible answer:
If the prediction ŷ made by MLP is also a discrete probability
distribution, then the comparison between ŷ and y onehotenc becomes a
comparison between two discrete probability distributions.

Multiple ways to compare two probability distributions.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 18 / 75


MLP for multi-class classification

MLP for multi-class classification

Question:
Why should we transform y ∈ Y taking an integer value, to a
y onehotenc which represents a discrete probability distribution?
One possible answer:
If the prediction ŷ made by MLP is also a discrete probability
distribution, then the comparison between ŷ and y onehotenc becomes a
comparison between two discrete probability distributions.

Multiple ways to compare two probability distributions.

Loss function design becomes possibly simpler?

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 19 / 75


MLP for multi-class classification

MLP for multi-class classification

Question:
Why should we transform y ∈ Y taking an integer value, to a
y onehotenc which represents a discrete probability distribution?
One possible answer:
If the prediction ŷ made by MLP is also a discrete probability
distribution, then the comparison between ŷ and y onehotenc becomes a
comparison between two discrete probability distributions.

What change can be made to the network architecture so that the


MLP outputs a discrete probability distribution?

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 20 / 75


MLP for multi-class classification

MLP for multi-class classification


What change can be made to the network architecture so that the
MLP outputs a discrete probability distribution?
Step 1: Since Y = {1, . . . , C }, output layer of MLP to contain C neurons.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 21 / 75


MLP for multi-class classification

MLP for multi-class classification


What change can be made to the network architecture so that the
MLP outputs a discrete probability distribution?
Step 1: Since Y = {1, . . . , C }, output layer of MLP to contain C neurons.

However activation functions of output neurons might be arbitrary!

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 22 / 75


MLP for multi-class classification

MLP for multi-class classification


What change can be made to the network architecture so that the
MLP outputs a discrete probability distribution?
Step 1: Since Y = {1, . . . , C }, output layer of MLP to contain C neurons.

However activation functions of output neurons might be arbitrary!


How do we get probabilities as outputs?

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 23 / 75


MLP for multi-class classification

MLP for multi-class classification

How do we get probabilities as outputs?


Step 2: Perform a soft-max function over the outputs from output layer so
that the outputs are transformed into probabilities.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 24 / 75


MLP for multi-class classification

MLP for multi-class classification

How do we get probabilities as outputs?


Step 2: Perform a soft-max function over the outputs from output layer so
that the outputs are transformed into probabilities.

What is a soft-max function?

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 25 / 75


MLP for multi-class classification

MLP for multi-class classification

What is a soft-max function?

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 26 / 75


MLP for multi-class classification

MLP for multi-class classification

What is a soft-max function?


Given arbitrary activations a1L , a2L , . . . , aCL from an output layer (L-th
layer), how do we get probabilities?

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 27 / 75


MLP for multi-class classification

MLP for multi-class classification

What is a soft-max function?


Given arbitrary activations a1L , a2L , . . . , aCL from an output layer (L-th
layer), how do we get probabilities?

Perform the following transformation:

exp(ajL )
qj = PC , ∀j = 1, . . . , C .
L
r =1 exp (ar )

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 28 / 75


MLP for multi-class classification

MLP for multi-class classification

What is a soft-max function?


Given arbitrary activations a1L , a2L , . . . , aCL from an output layer (L-th
layer), how do we get probabilities?

Perform the following transformation:

exp(ajL )
qj = PC , ∀j = 1, . . . , C .
L
r =1 exp (ar )

q1 , . . . , qC form a discrete probability distribution. (Verify this


claim!)

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 29 / 75


MLP for multi-class classification

MLP for multi-class classification

What is a soft-max function?


Given arbitrary activations a1L , a2L , . . . , aCL from an output layer (L-th
layer), how do we get probabilities?

Perform the following transformation:

exp(ajL )
qj = PC , ∀j = 1, . . . , C .
L
r =1 exp (ar )

q1 , . . . , qC form a discrete probability distribution. (Verify this


claim!)
The transformation used to obtain the probabilities qj is called the
soft-max function.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 30 / 75


MLP for multi-class classification

MLP for multi-class classification

Question:
Why should we transform y ∈ Y taking an integer value, to a
y onehotenc which represents a discrete probability distribution?
One possible answer:
If the prediction ŷ made by MLP is also a discrete probability
distribution, then the comparison between ŷ and y onehotenc becomes a
comparison between two discrete probability distributions.

Multiple ways to compare two probability distributions.

Loss function design becomes possibly simpler?

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 31 / 75


MLP for multi-class classification

MLP for multi-class classification


Now that the MLP outputs a discrete probability distribution, how
do we compare the one-hot encoding and the output distribution?

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 32 / 75


MLP for multi-class classification

MLP for multi-class classification


Now that the MLP outputs a discrete probability distribution, how
do we compare the one-hot encoding and the output distribution?
We will use the popular divergence measure called Kullback-Liebler
divergence (or KL-divergence).

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 33 / 75


MLP for multi-class classification

MLP for multi-class classification


Now that the MLP outputs a discrete probability distribution, how
do we compare the one-hot encoding and the output distribution?
We will use the popular divergence measure called Kullback-Liebler
divergence (or KL-divergence).
Given two discrete probability distributions p = (p1 , . . . , pC ) and
q = (q1 , . . . , qC ), where qj > 0 ∀j = 1, . . . , C , KL-divergence between
p and q is defined as:
C
X pj
KL(p||q) = pj log .
qj
j=1

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 34 / 75


MLP for multi-class classification KL-Divergence

MLP for multi-class classification


Now that the MLP outputs a discrete probability distribution, how
do we compare the one-hot encoding and the output distribution?
We will use the popular divergence measure called Kullback-Liebler
divergence (or KL-divergence).
Given two discrete probability distributions p = (p1 , . . . , pC ) and
q = (q1 , . . . , qC ), where qj > 0 ∀j = 1, . . . , C , KL-divergence between
p and q is defined as:
C
X pj
KL(p||q) = pj log .
qj
j=1

Note: The distribution p is usually called the true distribution and


the distribution q is called the predicted distribution.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 35 / 75


MLP for multi-class classification KL-Divergence

MLP for multi-class classification


Now that the MLP outputs a discrete probability distribution, how
do we compare the one-hot encoding and the output distribution?
We will use the popular divergence measure called Kullback-Liebler
divergence (or KL-divergence).
Given two discrete probability distributions p = (p1 , . . . , pC ) and
q = (q1 , . . . , qC ), where qj > 0 ∀j = 1, . . . , C , KL-divergence between
p and q is defined as:
C
X pj
KL(p||q) = pj log .
qj
j=1

Note: The distribution p is usually called the true distribution and


the distribution q is called the predicted distribution.
Also note: qj > 0, j = 1, . . . , C , which demands the predicted
distribution to have non-zero probabilities!
P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 36 / 75
MLP for multi-class classification KL-Divergence

MLP for multi-class classification


Now that the MLP outputs a discrete probability distribution, how
do we compare the one-hot encoding and the output distribution?
We will use the popular divergence measure called Kullback-Liebler
divergence (or KL-divergence).

Given two discrete probability distributions p = (p1 , . . . , pC ) and


q = (q1 , . . . , qC ), where qj > 0 ∀j = 1, . . . , C , KL-divergence between
p and q is defined as:
C
X pj
KL(p||q) = pj log .
qj
j=1

Note: The distribution p is usually called the true distribution and


the distribution q is called the predicted distribution.

Does the soft-max function give predictions qj > 0, j = 1, . . . , C ?


P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 37 / 75
MLP for multi-class classification KL-Divergence

MLP for multi-class classification: KL Divergence

Some useful properties of KL-divergence:


KL-divergence is not a distance function.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 38 / 75


MLP for multi-class classification KL-Divergence

MLP for multi-class classification: KL Divergence

Some useful properties of KL-divergence:


KL-divergence is not a distance function.
Recall: A distance function d : X × X → [0, ∞) has the following
properties:
I d(x, x) = 0, ∀x ∈ X (identity of indistinguishables)
I d(x, y ) = d(y , x), ∀x, y ∈ X (Symmetry)
I d(x, z) ≤ d(x, y ) + d(y , z), ∀x, y , z ∈ X (triangle inequality)

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 39 / 75


MLP for multi-class classification KL-Divergence

MLP for multi-class classification: KL Divergence

Some useful properties of KL-divergence:


KL-divergence is not a distance function.
Recall: A distance function d : X × X → [0, ∞) has the following
properties:
I d(x, x) = 0, ∀x ∈ X (identity of indistinguishables)
I d(x, y ) = d(y , x), ∀x, y ∈ X (Symmetry)
I d(x, z) ≤ d(x, y ) + d(y , z), ∀x, y , z ∈ X (triangle inequality)
KL-divergence does not obey symmetry property.
I Simple example: compute KL(p||q) and KL(q||p) for p = (1/4, 3/4)
and q = (1/2, 1/2).

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 40 / 75


MLP for multi-class classification KL-Divergence

MLP for multi-class classification: KL Divergence

Some useful properties of KL-divergence:


KL-divergence is not a distance function.
Recall: A distance function d : X × X → [0, ∞] has the following
properties:
I d(x, x) = 0, ∀x ∈ X (identity of indistinguishables)
I d(x, y ) = d(y , x), ∀x, y ∈ X (Symmetry)
I d(x, z) ≤ d(x, y ) + d(y , z), ∀x, y , z ∈ X (triangle inequality)
Does KL-divergence obey triangle inequality?

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 41 / 75


MLP for multi-class classification KL-Divergence

MLP for multi-class classification: KL Divergence

Some useful properties of KL-divergence:


For two discrete probability distributions p = (p1 , p2 , . . . , pC ) and
q = (q1 , q2 , . . . , qC ), qj > 0, ∀j = 1, . . . , C , KL(p||q) ≥ 0.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 42 / 75


MLP for multi-class classification KL-Divergence

MLP for multi-class classification: KL Divergence

KL-Divergence: Equivalent Representation


Given two discrete probability distributions p = (p1 , . . . , pC ) and
q = (q1 , . . . , qC ), where qj > 0 ∀j = 1, . . . , C , KL-divergence between
p and q is defined as:
C C C
X pj X X
KL(p||q) = pj log = pj log pj − pj log qj .
qj
j=1 j=1 j=1

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 43 / 75


MLP for multi-class classification KL-Divergence

MLP for multi-class classification: KL Divergence

KL-Divergence: Equivalent Representation


Given two discrete probability distributions p = (p1 , . . . , pC ) and
q = (q1 , . . . , qC ), where qj > 0 ∀j = 1, . . . , C , KL-divergence between
p and q is defined as:
C C C
X pj X X
KL(p||q) = pj log = pj log pj − pj log qj .
qj
j=1 j=1 j=1

PC
Note: j=1 pj log pj is called negative entropy associated with
distribution p (denoted by NE (p)) and − Cj=1 pj log qj is called
P
cross-entropy between p and q (denoted by CE (p, q)).

Hence KL(p||q) = NE (p) + CE (p, q).

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 44 / 75


MLP for multi-class classification KL-Divergence

MLP for multi-class classification

Question:
Why should we transform y ∈ Y taking an integer value, to a
y onehotenc which represents a discrete probability distribution?
One possible answer:
If the prediction ŷ made by MLP is also a discrete probability
distribution, then the comparison between ŷ and y onehotenc becomes a
comparison between two discrete probability distributions.

Multiple ways to compare two probability distributions.

Loss function design becomes possibly simpler?

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 45 / 75


MLP for multi-class classification KL-Divergence

MLP for multi-class classification


Loss function using KL-Divergence:
Given two discrete probability distributions p = (p1 , . . . , pC ) and
q = (q1 , . . . , qC ), where qj > 0 ∀j = 1, . . . , C , KL-divergence between
p and q is defined as:
C C C
X pj X X
KL(p||q) = pj log = pj log pj − pj log qj .
qj
j=1 j=1 j=1

= NE (p) + CE (p, q).

Our aim is to minimize the error measured using KL-divergence


between the true distribution p and the predicted distribution q.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 46 / 75


MLP for multi-class classification KL-Divergence

MLP for multi-class classification


Loss function using KL-Divergence:
Given two discrete probability distributions p = (p1 , . . . , pC ) and
q = (q1 , . . . , qC ), where qj > 0 ∀j = 1, . . . , C , KL-divergence between
p and q is defined as:
C C C
X pj X X
KL(p||q) = pj log = pj log pj − pj log qj .
qj
j=1 j=1 j=1

= NE (p) + CE (p, q).

Our aim is to minimize the error measured using KL-divergence


between the true distribution p and the predicted distribution q.
However, minimizing KL-divergence between p and q is equivalent to
minimizing cross-entropy between p and q. (why?)

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 47 / 75


MLP for multi-class classification KL-Divergence

MLP for multi-class classification


Loss function using KL-Divergence:
Given two discrete probability distributions p = (p1 , . . . , pC ) and
q = (q1 , . . . , qC ), where qj > 0 ∀j = 1, . . . , C , KL-divergence between
p and q is defined as:
C C C
X pj X X
KL(p||q) = pj log = pj log pj − pj log qj .
qj
j=1 j=1 j=1

= NE (p) + CE (p, q).

Our aim is to minimize the error measured using KL-divergence


between the true distribution p and the predicted distribution q.
However, minimizing KL-divergence between p and q is equivalent to
minimizing cross-entropy between p and q. (why?)
Hence we will consider the cross-entropy between p and q as our loss
function.
P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 48 / 75
MLP for multi-class classification Cross-entropy

MLP for multi-class classification

Cross-entropy loss function:


Given two discrete probability distributions p = (p1 , . . . , pC ) and
q = (q1 , . . . , qC ), where qj > 0 ∀j = 1, . . . , C , cross-entropy between
p and q is defined as:
C
X
CE (p, q) = − pj log qj .
j=1

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 49 / 75


MLP for multi-class classification Cross-entropy

MLP for multi-class classification

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y i ∈ Ỹ,


∀i ∈ {1, . . . , S} and MLP architecture parametrized by weights w .

Without loss of generality, assume Ỹ = {0, 1}C corresponding to the


output space Y = {1, . . . , C }, C ≥ 2 and y i are one-hot encoded
vectors.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 50 / 75


MLP for multi-class classification Training MLP for multi-class classification

MLP for multi-class classification

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y i ∈ Ỹ, ∀i ∈ {1, . . . , S} and


MLP architecture parametrized by weights w .

Aim of training MLP: To learn a parametrized map hw : X → Ỹ such that for the
training data D, we have y i = hw (x i ), ∀i ∈ {1, . . . , S}.

Aim of using the trained MLP model: For an unseen sample x̂ ∈ X , predict
ŷ = hw (x̂) = MLP(x̂; w ); to find integer label use argmaxj ŷj .

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 51 / 75


MLP for multi-class classification Training MLP for multi-class classification

MLP for multi-class classification

Methodology for training MLP


Design a suitable loss (or error) function e : Ỹ × Ỹ → [0, +∞) to compare the actual
label y i and the prediction ŷ i made by MLP using e(y i , ŷ i ), ∀i{1, . . . , S}.
recall: e(y i , ŷ i ) = CE(y i , ŷ i ), the cross-entropy loss function.
Usually the error is parametrized by the weights w of the MLP and is denoted by
e(ŷ i , y i ; w ).
Use Gradient descent/SGD/mini-batch SGD to minimize the total error:
S
X S
X
E = e(ŷ i , y i ; w ) =: e i (w ).
i=1 i=1
P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 52 / 75
MLP for multi-class classification Training MLP for multi-class classification

SGD for training MLP for multi-class classification


SGD Algorithm to train MLP
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y i ∈ Ỹ, ∀i;
MLP architecture, max epochs K , learning rates γk , ∀k ∈ {1, . . . , K }.

Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . . , K
I Choose a sample jk ∈ {1, . . . , S}.
I Find ŷ jk = MLP(x jk ; w k ). (forward pass)
I Compute error e jk (w k ) using cross-entropy loss function.
I Compute error gradient ∇w e jk (w k ) using backpropagation.
I Update: w k+1 ← w k − γk ∇w e jk (w k ).

Output: w ∗ = w K +1 .

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 53 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification

Recall forward pass: For an arbitrary sample (x, y ) from training data D, and the MLP with
weights w = (W 1 , W 2 . . . , W L ), the prediction ŷ is computed using forward pass as:

ŷ = MLP(x; w ) = softmax(φ(W L φ(W L−1 . . . φ(W 1 x) . . .))).

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 54 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification

Backpropagation: For an arbitrary sample (x, y ) from training data D, and the MLP with
weights w = (W 1 , W 2 . . . , W L ), how do we compute the error gradient?

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 55 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification

Backpropagation: Indeed, the error gradients are very similar to the binary classification setup.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 56 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification

Backpropagation: Indeed, the error gradients are very similar to the binary classification setup.
For an arbitrary sample (x, y ) from training data D, and the MLP with weights
w = (W 1 , W 2 . . . , W L ), the error gradient with respect to weights W ` in `-th layer is:
0
∇W ` e = Diag(φ` )δ ` (a`−1 )>
 0 `   ∂e   `−1 
φ (z1 ) ∂a` a1
 1  .
, δ ` =  ..

0 
where Diag(φ` ) =  .  
..  and a`−1 =  . .
 
   .   . 
`−1
φ0 (zN` ) ∂e
` aN
` ∂aN `−1
`
P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 57 / 75
MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification

Backpropagation: Indeed, the error gradients are very similar to the binary classification setup.
For an arbitrary sample (x, y ) from training data D, and the MLP with weights
w = (W 1 , W 2 . . . , W L ), the error gradient with respect to weights W ` in `-th layer is:

0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V `+1 V `+2 . . . V L δ L (a`−1 )>

0
where V ` = (W `+1 )> Diag(φ`+1 ).

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 58 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification

Backpropagation: Indeed, the error gradients are very similar to the binary classification setup.
For an arbitrary sample (x, y ) from training data D, and the MLP with weights
w = (W 1 , W 2 . . . , W L ), the error gradient with respect to weights W ` in `-th layer is:

0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V `+1 V `+2 . . . V L δ L (a`−1 )>
 ∂e 
∂a1L
 . 
 
The main difference arises in the procedure to find δ L =  .. .
 
∂e
L
∂aC

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 59 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification: Finding error


gradients at the output layer
Recall: For an arbitrary sample (x, y ) from training data D, and the prediction ŷ from the MLP,
we know the following:
y1
 
 . 
y is in one-hot encoded form: y =  ..  with some particular yc taking value 1
yC
(corresponding to the label c) and yj = 0, j 6= c.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 60 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification: Finding error


gradients at the output layer
Recall: For an arbitrary sample (x, y ) from training data D, and the prediction ŷ from the MLP,
we know the following:
y1
 
 . 
y is in one-hot encoded form: y =  ..  with some particular yc taking value 1
yC
(corresponding to the label c) and yj = 0, j 6= c.

y and ŷ are discrete probability distributions.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 61 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification: Finding error


gradients at the output layer
Recall: For an arbitrary sample (x, y ) from training data D, and the prediction ŷ from the MLP,
we know the following:
y1
 
 . 
y is in one-hot encoded form: y =  ..  with some particular yc taking value 1
yC
(corresponding to the label c) and yj = 0, j 6= c.

y and ŷ are discrete probability distributions.

ŷ1
 
 . 
ŷ =  .. .
ŷC

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 62 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification: Finding error


gradients at the output layer
Recall: For an arbitrary sample (x, y ) from training data D, and the prediction ŷ from the MLP,
we know the following:
y1
 
 . 
y is in one-hot encoded form: y =  ..  with some particular yc taking value 1
yC
(corresponding to the label c) and yj = 0, j 6= c.

y and ŷ are discrete probability distributions.

ŷ1
 
 . 
ŷ =  .. .
ŷC
PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj .

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 63 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification: Finding error


gradients at the output layer
Recall: For an arbitrary sample (x, y ) from training data D, and the prediction ŷ from the MLP,
we know the following:
y1
 
 . 
y is in one-hot encoded form: y =  ..  with some particular yc taking value 1
yC
(corresponding to the label c) and yj = 0, j 6= c.

y and ŷ are discrete probability distributions.

ŷ1
 
 . 
ŷ =  .. .
ŷC
PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj .
exp (ajL )
Note: ŷj = PC L , ∀j = 1, . . . , C .
r =1 exp (ar )

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 64 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification: Finding error


gradients at the output layer

PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj , with
exp (ajL )
ŷj = PC L , ∀j = 1, . . . , C .
r =1 exp (ar )

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 65 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification: Finding error


gradients at the output layer

PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj , with
exp (ajL )
ŷj = PC L , ∀j = 1, . . . , C .
r =1 exp (ar )
h ∂e ∂e
i>
Aim: To find δ L = ... .
∂a1L L
∂aC

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 66 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification: Finding error


gradients at the output layer

PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj , with
exp (ajL )
ŷj = PC L , ∀j = 1, . . . , C .
r =1 exp (ar )
h ∂e ∂e
i>
Aim: To find δ L = ... .
∂a1L L
∂aC
∂e PC ∂e ∂ ŷm
We have = m=1 ∂ ŷm ∂aL , ∀j.
∂ajL j

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 67 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification: Finding error


gradients at the output layer

PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj , with
exp (ajL )
ŷj = PC L , ∀j.
r =1 exp (ar )

∂e PC ∂e ∂ ŷm
We have = m=1 ∂ ŷm ∂aL .
∂ajL j

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 68 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification: Finding error


gradients at the output layer

PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj , with
exp (ajL )
ŷj = PC L , ∀j.
r =1 exp (ar )

∂e PC ∂e ∂ ŷm
We have = m=1 ∂ ŷm ∂aL .
∂ajL j

∂e
∂ ŷm
= − ŷym .
m

 
L
∂ ŷm ∂ exp (am )
= .
∂ajL ∂ajL
PC L
r =1 exp (ar )

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 69 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification: Finding error


gradients at the output layer

PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj , with
exp (ajL )
ŷj = PC , ∀j.
r =1 exp (arL )

∂e PC ∂e ∂ ŷm
We have = m=1 ∂ ŷm ∂aL .
∂ajL j

∂e
∂ ŷm
= − ŷym .
m

  (
∂ ŷm ∂
L
exp (am ) ŷj (1 − ŷj ) if j = m
= = (Homework!)
∂ajL ∂ajL
PC
r =1 exp (arL ) −ŷm ŷj otherwise.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 70 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification: Finding error


gradients at the output layer

PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj , with
exp (ajL )
ŷj = PC , ∀j.
r =1 exp (arL )

∂e PC ∂e ∂ ŷm
We have = m=1 ∂ ŷm ∂aL .
∂ajL j

∂e
∂ ŷm
= − ŷym .
m

  (
∂ ŷm ∂
L
exp (am ) ŷj (1 − ŷj ) if j = m
= = (Homework!)
∂ajL ∂ajL
PC L
r =1 exp (ar ) −ŷm ŷj otherwise.

∂e PC
Hence by substitution, = m=1 ym ŷj − yj (1 − ŷj ) = ŷj − yj .
∂ajL
m6=j

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 71 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification: Finding error


gradients at the output layer

 ∂e 
∂a1L
 . 
 
Recall: We wanted to find δ L =  .. .
 
∂e
L
∂aC

∂e
We have = ŷj − yj .
∂ajL

ŷ1 − y1
 
..
Hence δL = .
 
.
ŷC − yC

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 72 / 75


MLP for multi-class classification Training MLP for multi-class classification

Training MLP for multi-class classification

Since the backpropagation for multi-class classification is essentially similar


to that of binary classification, it suffers from
Exploding gradient problem and

Vanishing gradient problem.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 73 / 75


MLP for multi-label classification

MLP for multi-label classification

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd ,


∀i ∈ {1, . . . , S} and MLP architecture parametrized by weights w .

New Task: y i ⊆ Y = {1, . . . , C }, C ≥ 2.

Corresponds to multi-label classification.

Question 1: What is a suitable architecture for the MLP’s last (or


output) layer?
Question 2: What is a suitable loss (or error) function?

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 74 / 75


MLP for multi-label classification

MLP for multi-label classification

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd ,


∀i ∈ {1, . . . , S} and MLP architecture parametrized by weights w .

New Task: y i ⊆ Y = {1, . . . , C }, C ≥ 2.

Corresponds to multi-label classification.

Question 1: What is a suitable architecture for the MLP’s last (or


output) layer?
Question 2: What is a suitable loss (or error) function?
(Homework!)

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 75 / 75

You might also like