IE643 Lecture9 2020sep15 Moodle

Deep Learning - Theory and Practice
IE 643
Lecture 9
September 15, 2020.
P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 1 / 75

Outline
1 Recap
MLP for prediction tasks
2 MLP for multi-class classification

KL-Divergence
Cross-entropy
Training MLP for multi-class classification
3 MLP for multi-label classification

Recap MLP for prediction tasks
Multi Layer Perceptron for Prediction Tasks
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y ∈ Y, ∀i ∈ {1, . . . , S} and MLP

architecture parametrized by weights w .
Aim of training MLP: To learn a parametrized map hw : X → Y such that for the
training data D, we have y i = hw (x i ), ∀i ∈ {1, . . . , S}.
Aim of using the trained MLP model: For an unseen sample x̂ ∈ X , predict
ŷ = hw (x̂) = MLP(x̂; w ).

Methodology for training MLP

Design a suitable loss (or error) function e : Y × Y → [0, +∞) to compare the actual
label y i and the prediction ŷ i made by MLP using e(y i , ŷ i ), ∀i{1, . . . , S}.
Usually the error is parametrized by the weights w of the MLP and is denoted by
e(ŷ i , y i ; w ).
Use Gradient descent/SGD/mini-batch SGD to minimize the total error:
S
X S
X
E = e(ŷ i , y i ; w ) =: e i (w ).
i=1 i=1

Stochastic Gradient Descent for training MLP

SGD Algorithm to train MLP
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y i ∈ Y, ∀i;
MLP architecture, max epochs K , learning rates γk , ∀k ∈ {1, . . . , K }.
Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . . , K
I Choose a sample jk ∈ {1, . . . , S}.
I Find ŷ jk = MLP(x jk ; w k ). (forward pass)
I Compute error e jk (w k ).
I Compute error gradient ∇w e jk (w k ) using backpropagation.
I Update: w k+1 ← w k − γk ∇w e jk (w k ).
Output: w ∗ = w K +1 .

Recall forward pass: For an arbitrary sample (x, y ) from training data D, and the MLP with
weights w = (W 1 , W 2 . . . , W L ), the prediction ŷ is computed using forward pass as:
ŷ = MLP(x; w ) = φ(W L φ(W L−1 . . . φ(W 1 x) . . .)).

Recall backpropagation: For an arbitrary sample (x, y ) from training data D, and the MLP with
weights w = (W 1 , W 2 . . . , W L ), the error gradient with respect to weights at `-th layer is
computed as:
0
∇W ` e = Diag(φ` )δ ` (a`−1 )>

computed as:
0
∇W ` e = Diag(φ` )δ ` (a`−1 )>
 0 `   ∂e   `−1 
φ (z1 ) ∂a` a1
 1  .
, δ ` =  ..

0 
where Diag(φ` ) =  .  
..  and a`−1 =  . .
 
   .   . 
`−1
φ0 (zN` ) ∂e
` aN
` ∂aN `−1
`
computed as:
0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V `+1 V `+2 . . . V L δ L (a`−1 )>
0
where V `+1 = (W `+1 )> Diag(φ`+1 ).

Task considered so far: Y = {+1, −1}.
Corresponds to two-class (or binary) classification.
Usually a single neuron at the last (L-th) layer of MLP, with logistic sigmoid function
σ : R → (0, 1) with σ(z) = 1+e1−z , for some z ∈ R.
Prediction: MLP(x̂; w ) = σ(W L aL−1 ), followed by a thresholding function.

MLP for multi-class classification
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y i ∈ Y, ∀i ∈ {1, . . . , S} and

MLP architecture parametrized by weights w .
New Task: Y = {1, . . . , C }, C ≥ 2.
Corresponds to multi-class classification.
Question 1: What is a suitable architecture for the MLP’s last (or output) layer?
Question 2: What is a suitable loss (or error) function?

Question 1: Can the same MLP architecture with single output neuron used in binary
classification be used for multi-class classification?
Question 2: Can the same logistic sigmoidal activation function for the output neuron used in
binary classification be used for multi-class classification?


We will use the following approach for multi-class classification:
New Task: Y = {1, . . . , C }, C ≥ 2 corresponds to multi-class classification.

 
0
.
.
.
0
 
Transform y = c to y onehotenc = 1.
 
0
 
 .. 
 
.
0
Note: y onehotenc ∈ {0, 1}C corresponding to y = c ∈ Y has a 1 at c-th coordinate, and

other entries as zeros.
y onehotenc is called the one-hot encoding of y .


We will use the following approach for multi-class classification:
New Task: Y = {1, . . . , C }, C ≥ 2 corresponds to multi-class classification.

 
0
.
.
.
0
 
Transform y = c to y onehotenc = 1.
 
0
 
 .. 
 
.
0
Note: y onehotenc ∈ {0, 1}C corresponding to y = c ∈ Y has a 1 at c-th coordinate, and

other entires as zeros.
y onehotenc is called the one-hot encoding of y .

y onehotenc for y = c corresponds to a discrete probability distribution with its entire mass
concentrated at the c-th coordinate.

MLP for multi-class classification: Discrete Probability

Distribution
Recall: A discrete probability distribution is represented by a set of real

values p1 , p2 , . . . , pm (we assume m to be finite) with the following
properties:
pi ≥ 0 ∀i ∈ {1, . . . , m},
m
X
pi = 1.
i=1

Question:
Why should we transform y ∈ Y taking an integer value, to a
y onehotenc which represents a discrete probability distribution?

Question:
One possible answer:
If the prediction ŷ made by MLP is also a discrete probability
distribution, then the comparison between ŷ and y onehotenc becomes a
comparison between two discrete probability distributions.

Question:
Multiple ways to compare two probability distributions.

Question:
Loss function design becomes possibly simpler?

Question:
What change can be made to the network architecture so that the

MLP outputs a discrete probability distribution?


Step 1: Since Y = {1, . . . , C }, output layer of MLP to contain C neurons.


However activation functions of output neurons might be arbitrary!


However activation functions of output neurons might be arbitrary!

How do we get probabilities as outputs?


Step 2: Perform a soft-max function over the outputs from output layer so
that the outputs are transformed into probabilities.


Step 2: Perform a soft-max function over the outputs from output layer so
that the outputs are transformed into probabilities.
What is a soft-max function?



Given arbitrary activations a1L , a2L , . . . , aCL from an output layer (L-th
layer), how do we get probabilities?


Perform the following transformation:
exp(ajL )
qj = PC , ∀j = 1, . . . , C .
L
r =1 exp (ar )


exp(ajL )
qj = PC , ∀j = 1, . . . , C .
L
r =1 exp (ar )
q1 , . . . , qC form a discrete probability distribution. (Verify this

claim!)


exp(ajL )
qj = PC , ∀j = 1, . . . , C .
L
r =1 exp (ar )
q1 , . . . , qC form a discrete probability distribution. (Verify this

claim!)
The transformation used to obtain the probabilities qj is called the
soft-max function.

Question:


Now that the MLP outputs a discrete probability distribution, how
do we compare the one-hot encoding and the output distribution?


We will use the popular divergence measure called Kullback-Liebler
divergence (or KL-divergence).


Given two discrete probability distributions p = (p1 , . . . , pC ) and
q = (q1 , . . . , qC ), where qj > 0 ∀j = 1, . . . , C , KL-divergence between
p and q is defined as:
C
X pj
KL(p||q) = pj log .
qj
j=1

MLP for multi-class classification KL-Divergence

C
X pj
KL(p||q) = pj log .
qj
j=1
Note: The distribution p is usually called the true distribution and

the distribution q is called the predicted distribution.


C
X pj
KL(p||q) = pj log .
qj
j=1

Also note: qj > 0, j = 1, . . . , C , which demands the predicted
distribution to have non-zero probabilities!


C
X pj
KL(p||q) = pj log .
qj
j=1

Does the soft-max function give predictions qj > 0, j = 1, . . . , C ?

MLP for multi-class classification: KL Divergence
Some useful properties of KL-divergence:

KL-divergence is not a distance function.


Recall: A distance function d : X × X → [0, ∞) has the following
properties:
I d(x, x) = 0, ∀x ∈ X (identity of indistinguishables)
I d(x, y ) = d(y , x), ∀x, y ∈ X (Symmetry)
I d(x, z) ≤ d(x, y ) + d(y , z), ∀x, y , z ∈ X (triangle inequality)


Recall: A distance function d : X × X → [0, ∞) has the following
properties:
KL-divergence does not obey symmetry property.
I Simple example: compute KL(p||q) and KL(q||p) for p = (1/4, 3/4)
and q = (1/2, 1/2).


Recall: A distance function d : X × X → [0, ∞] has the following
properties:
Does KL-divergence obey triangle inequality?


For two discrete probability distributions p = (p1 , p2 , . . . , pC ) and
q = (q1 , q2 , . . . , qC ), qj > 0, ∀j = 1, . . . , C , KL(p||q) ≥ 0.

KL-Divergence: Equivalent Representation

C C C
X pj X X
KL(p||q) = pj log = pj log pj − pj log qj .
qj
j=1 j=1 j=1

KL-Divergence: Equivalent Representation

C C C
X pj X X
qj
j=1 j=1 j=1
PC
Note: j=1 pj log pj is called negative entropy associated with
distribution p (denoted by NE (p)) and − Cj=1 pj log qj is called
P
cross-entropy between p and q (denoted by CE (p, q)).
Hence KL(p||q) = NE (p) + CE (p, q).

Question:


Loss function using KL-Divergence:
C C C
X pj X X
qj
j=1 j=1 j=1
= NE (p) + CE (p, q).
Our aim is to minimize the error measured using KL-divergence

between the true distribution p and the predicted distribution q.


C C C
X pj X X
qj
j=1 j=1 j=1
= NE (p) + CE (p, q).

However, minimizing KL-divergence between p and q is equivalent to
minimizing cross-entropy between p and q. (why?)


C C C
X pj X X
qj
j=1 j=1 j=1
= NE (p) + CE (p, q).

However, minimizing KL-divergence between p and q is equivalent to
minimizing cross-entropy between p and q. (why?)
Hence we will consider the cross-entropy between p and q as our loss
function.
MLP for multi-class classification Cross-entropy
Cross-entropy loss function:

q = (q1 , . . . , qC ), where qj > 0 ∀j = 1, . . . , C , cross-entropy between
C
X
CE (p, q) = − pj log qj .
j=1

MLP for multi-class classification Cross-entropy
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y i ∈ Ỹ,

∀i ∈ {1, . . . , S} and MLP architecture parametrized by weights w .
Without loss of generality, assume Ỹ = {0, 1}C corresponding to the

output space Y = {1, . . . , C }, C ≥ 2 and y i are one-hot encoded
vectors.

MLP for multi-class classification Training MLP for multi-class classification
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y i ∈ Ỹ, ∀i ∈ {1, . . . , S} and

Aim of training MLP: To learn a parametrized map hw : X → Ỹ such that for the
training data D, we have y i = hw (x i ), ∀i ∈ {1, . . . , S}.
Aim of using the trained MLP model: For an unseen sample x̂ ∈ X , predict
ŷ = hw (x̂) = MLP(x̂; w ); to find integer label use argmaxj ŷj .

Methodology for training MLP

Design a suitable loss (or error) function e : Ỹ × Ỹ → [0, +∞) to compare the actual
label y i and the prediction ŷ i made by MLP using e(y i , ŷ i ), ∀i{1, . . . , S}.
recall: e(y i , ŷ i ) = CE(y i , ŷ i ), the cross-entropy loss function.
Usually the error is parametrized by the weights w of the MLP and is denoted by
e(ŷ i , y i ; w ).
Use Gradient descent/SGD/mini-batch SGD to minimize the total error:
S
X S
X
E = e(ŷ i , y i ; w ) =: e i (w ).
i=1 i=1
SGD for training MLP for multi-class classification

SGD Algorithm to train MLP
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y i ∈ Ỹ, ∀i;
MLP architecture, max epochs K , learning rates γk , ∀k ∈ {1, . . . , K }.
Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . . , K
I Choose a sample jk ∈ {1, . . . , S}.
I Find ŷ jk = MLP(x jk ; w k ). (forward pass)
I Compute error e jk (w k ) using cross-entropy loss function.
I Compute error gradient ∇w e jk (w k ) using backpropagation.
I Update: w k+1 ← w k − γk ∇w e jk (w k ).
Output: w ∗ = w K +1 .

Recall forward pass: For an arbitrary sample (x, y ) from training data D, and the MLP with
weights w = (W 1 , W 2 . . . , W L ), the prediction ŷ is computed using forward pass as:
ŷ = MLP(x; w ) = softmax(φ(W L φ(W L−1 . . . φ(W 1 x) . . .))).

Backpropagation: For an arbitrary sample (x, y ) from training data D, and the MLP with
weights w = (W 1 , W 2 . . . , W L ), how do we compute the error gradient?

Backpropagation: Indeed, the error gradients are very similar to the binary classification setup.

For an arbitrary sample (x, y ) from training data D, and the MLP with weights
w = (W 1 , W 2 . . . , W L ), the error gradient with respect to weights W ` in `-th layer is:
0
∇W ` e = Diag(φ` )δ ` (a`−1 )>
 0 `   ∂e   `−1 
φ (z1 ) ∂a` a1
 1  .
, δ ` =  ..

0 
where Diag(φ` ) =  .  
..  and a`−1 =  . .
 
   .   . 
`−1
φ0 (zN` ) ∂e
` aN
` ∂aN `−1
`
0 0
0
where V ` = (W `+1 )> Diag(φ`+1 ).

0 0
 ∂e 
∂a1L
 . 
 
The main difference arises in the procedure to find δ L =  .. .
 
∂e
L
∂aC

Training MLP for multi-class classification: Finding error

gradients at the output layer
Recall: For an arbitrary sample (x, y ) from training data D, and the prediction ŷ from the MLP,
we know the following:
y1
 
 . 
y is in one-hot encoded form: y =  ..  with some particular yc taking value 1
yC
(corresponding to the label c) and yj = 0, j 6= c.


y1
 
 . 
yC
y and ŷ are discrete probability distributions.


y1
 
 . 
yC
ŷ1
 
 . 
ŷ =  .. .
ŷC


y1
 
 . 
yC
ŷ1
 
 . 
ŷ =  .. .
ŷC
PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj .


y1
 
 . 
yC
ŷ1
 
 . 
ŷ =  .. .
ŷC
PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj .
exp (ajL )
Note: ŷj = PC L , ∀j = 1, . . . , C .
r =1 exp (ar )


PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj , with
exp (ajL )
ŷj = PC L , ∀j = 1, . . . , C .
r =1 exp (ar )


PC
exp (ajL )
ŷj = PC L , ∀j = 1, . . . , C .
r =1 exp (ar )
h ∂e ∂e
i>
Aim: To find δ L = ... .
∂a1L L
∂aC


PC
exp (ajL )
ŷj = PC L , ∀j = 1, . . . , C .
r =1 exp (ar )
h ∂e ∂e
i>
Aim: To find δ L = ... .
∂a1L L
∂aC
∂e PC ∂e ∂ ŷm
We have = m=1 ∂ ŷm ∂aL , ∀j.
∂ajL j


PC
exp (ajL )
ŷj = PC L , ∀j.
r =1 exp (ar )
We have = m=1 ∂ ŷm ∂aL .
∂ajL j


PC
exp (ajL )
ŷj = PC L , ∀j.
r =1 exp (ar )
∂ajL j
∂e
∂ ŷm
= − ŷym .
m

L
∂ ŷm ∂ exp (am )
= .
∂ajL ∂ajL
PC L
r =1 exp (ar )


PC
exp (ajL )
ŷj = PC , ∀j.
r =1 exp (arL )
∂ajL j
∂e
∂ ŷm
= − ŷym .
m
(
∂ ŷm ∂
L
exp (am ) ŷj (1 − ŷj ) if j = m
= = (Homework!)
∂ajL ∂ajL
PC
r =1 exp (arL ) −ŷm ŷj otherwise.


PC
exp (ajL )
ŷj = PC , ∀j.
r =1 exp (arL )
∂ajL j
∂e
∂ ŷm
= − ŷym .
m
(
∂ ŷm ∂
L
exp (am ) ŷj (1 − ŷj ) if j = m
= = (Homework!)
∂ajL ∂ajL
PC L
r =1 exp (ar ) −ŷm ŷj otherwise.
∂e PC
Hence by substitution, = m=1 ym ŷj − yj (1 − ŷj ) = ŷj − yj .
∂ajL
m6=j


 ∂e 
∂a1L
 . 
 
Recall: We wanted to find δ L =  .. .
 
∂e
L
∂aC
∂e
We have = ŷj − yj .
∂ajL
ŷ1 − y1
 
..
Hence δL = .
 
.
ŷC − yC

Since the backpropagation for multi-class classification is essentially similar

to that of binary classification, it suffers from
Exploding gradient problem and
Vanishing gradient problem.

MLP for multi-label classification
Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd ,

New Task: y i ⊆ Y = {1, . . . , C }, C ≥ 2.
Corresponds to multi-label classification.
Question 1: What is a suitable architecture for the MLP’s last (or

output) layer?

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd ,

New Task: y i ⊆ Y = {1, . . . , C }, C ≥ 2.
Corresponds to multi-label classification.
Question 1: What is a suitable architecture for the MLP’s last (or

output) layer?
(Homework!)

IE643 Lecture9 2020sep15 Moodle

Uploaded by

Copyright:

Available Formats

You might also like

IE643 Lecture9 2020sep15 Moodle

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IE643 Lecture9 2020sep15 Moodle

Uploaded by

Copyright:

Available Formats

Deep Learning - Theory and Practice

September 15, 2020.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 1 / 75

2 MLP for multi-class classification

3 MLP for multi-label classification

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 2 / 75

Multi Layer Perceptron for Prediction Tasks

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y ∈ Y, ∀i ∈ {1, . . . , S} and MLP

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 3 / 75

Multi Layer Perceptron for Prediction Tasks

Methodology for training MLP

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 4 / 75

Stochastic Gradient Descent for training MLP

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 5 / 75

Multi Layer Perceptron for Prediction Tasks

ŷ = MLP(x; w ) = φ(W L φ(W L−1 . . . φ(W 1 x) . . .)).

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 6 / 75

Multi Layer Perceptron for Prediction Tasks

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 7 / 75

Multi Layer Perceptron for Prediction Tasks

Multi Layer Perceptron for Prediction Tasks

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 9 / 75

Multi Layer Perceptron for Prediction Tasks

Task considered so far: Y = {+1, −1}.

Corresponds to two-class (or binary) classification.

Prediction: MLP(x̂; w ) = σ(W L aL−1 ), followed by a thresholding function.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 10 / 75

MLP for multi-class classification

Input: Training Data D = {(x i , y i )}Si=1 , x i ∈ X ⊆ Rd , y i ∈ Y, ∀i ∈ {1, . . . , S} and

New Task: Y = {1, . . . , C }, C ≥ 2.

Corresponds to multi-class classification.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 11 / 75

MLP for multi-class classification

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 12 / 75

MLP for multi-class classification

New Task: Y = {1, . . . , C }, C ≥ 2 corresponds to multi-class classification.

Note: y onehotenc ∈ {0, 1}C corresponding to y = c ∈ Y has a 1 at c-th coordinate, and

y onehotenc is called the one-hot encoding of y .

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 13 / 75

MLP for multi-class classification

New Task: Y = {1, . . . , C }, C ≥ 2 corresponds to multi-class classification.

Note: y onehotenc ∈ {0, 1}C corresponding to y = c ∈ Y has a 1 at c-th coordinate, and

y onehotenc is called the one-hot encoding of y .

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 14 / 75

MLP for multi-class classification: Discrete Probability

Recall: A discrete probability distribution is represented by a set of real

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 15 / 75

MLP for multi-class classification

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 16 / 75

MLP for multi-class classification

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 17 / 75

MLP for multi-class classification

Multiple ways to compare two probability distributions.

P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 18 / 75

MLP for multi-class classification

Multiple ways to compare two probability distributions.

Loss function design becomes possibly simpler?