Professional Documents
Culture Documents
IE643 Lecture9 2020sep15 Moodle
IE643 Lecture9 2020sep15 Moodle
IE643 Lecture9 2020sep15 Moodle
IE 643
Lecture 9
1 Recap
MLP for prediction tasks
Aim of training MLP: To learn a parametrized map hw : X → Y such that for the
training data D, we have y i = hw (x i ), ∀i ∈ {1, . . . , S}.
Aim of using the trained MLP model: For an unseen sample x̂ ∈ X , predict
ŷ = hw (x̂) = MLP(x̂; w ).
Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . . , K
I Choose a sample jk ∈ {1, . . . , S}.
I Find ŷ jk = MLP(x jk ; w k ). (forward pass)
I Compute error e jk (w k ).
I Compute error gradient ∇w e jk (w k ) using backpropagation.
I Update: w k+1 ← w k − γk ∇w e jk (w k ).
Output: w ∗ = w K +1 .
Recall forward pass: For an arbitrary sample (x, y ) from training data D, and the MLP with
weights w = (W 1 , W 2 . . . , W L ), the prediction ŷ is computed using forward pass as:
Recall backpropagation: For an arbitrary sample (x, y ) from training data D, and the MLP with
weights w = (W 1 , W 2 . . . , W L ), the error gradient with respect to weights at `-th layer is
computed as:
0
∇W ` e = Diag(φ` )δ ` (a`−1 )>
Recall backpropagation: For an arbitrary sample (x, y ) from training data D, and the MLP with
weights w = (W 1 , W 2 . . . , W L ), the error gradient with respect to weights at `-th layer is
computed as:
0
∇W ` e = Diag(φ` )δ ` (a`−1 )>
0 ` ∂e `−1
φ (z1 ) ∂a` a1
1 .
, δ ` = ..
0
where Diag(φ` ) = .
.. and a`−1 = . .
. .
`−1
φ0 (zN` ) ∂e
` aN
` ∂aN `−1
`
P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 8 / 75
Recap MLP for prediction tasks
Recall backpropagation: For an arbitrary sample (x, y ) from training data D, and the MLP with
weights w = (W 1 , W 2 . . . , W L ), the error gradient with respect to weights at `-th layer is
computed as:
0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V `+1 V `+2 . . . V L δ L (a`−1 )>
0
where V `+1 = (W `+1 )> Diag(φ`+1 ).
Usually a single neuron at the last (L-th) layer of MLP, with logistic sigmoid function
σ : R → (0, 1) with σ(z) = 1+e1−z , for some z ∈ R.
Question 1: What is a suitable architecture for the MLP’s last (or output) layer?
Question 2: What is a suitable loss (or error) function?
Question 1: Can the same MLP architecture with single output neuron used in binary
classification be used for multi-class classification?
Question 2: Can the same logistic sigmoidal activation function for the output neuron used in
binary classification be used for multi-class classification?
pi ≥ 0 ∀i ∈ {1, . . . , m},
m
X
pi = 1.
i=1
Question:
Why should we transform y ∈ Y taking an integer value, to a
y onehotenc which represents a discrete probability distribution?
Question:
Why should we transform y ∈ Y taking an integer value, to a
y onehotenc which represents a discrete probability distribution?
One possible answer:
If the prediction ŷ made by MLP is also a discrete probability
distribution, then the comparison between ŷ and y onehotenc becomes a
comparison between two discrete probability distributions.
Question:
Why should we transform y ∈ Y taking an integer value, to a
y onehotenc which represents a discrete probability distribution?
One possible answer:
If the prediction ŷ made by MLP is also a discrete probability
distribution, then the comparison between ŷ and y onehotenc becomes a
comparison between two discrete probability distributions.
Question:
Why should we transform y ∈ Y taking an integer value, to a
y onehotenc which represents a discrete probability distribution?
One possible answer:
If the prediction ŷ made by MLP is also a discrete probability
distribution, then the comparison between ŷ and y onehotenc becomes a
comparison between two discrete probability distributions.
Question:
Why should we transform y ∈ Y taking an integer value, to a
y onehotenc which represents a discrete probability distribution?
One possible answer:
If the prediction ŷ made by MLP is also a discrete probability
distribution, then the comparison between ŷ and y onehotenc becomes a
comparison between two discrete probability distributions.
exp(ajL )
qj = PC , ∀j = 1, . . . , C .
L
r =1 exp (ar )
exp(ajL )
qj = PC , ∀j = 1, . . . , C .
L
r =1 exp (ar )
exp(ajL )
qj = PC , ∀j = 1, . . . , C .
L
r =1 exp (ar )
Question:
Why should we transform y ∈ Y taking an integer value, to a
y onehotenc which represents a discrete probability distribution?
One possible answer:
If the prediction ŷ made by MLP is also a discrete probability
distribution, then the comparison between ŷ and y onehotenc becomes a
comparison between two discrete probability distributions.
PC
Note: j=1 pj log pj is called negative entropy associated with
distribution p (denoted by NE (p)) and − Cj=1 pj log qj is called
P
cross-entropy between p and q (denoted by CE (p, q)).
Question:
Why should we transform y ∈ Y taking an integer value, to a
y onehotenc which represents a discrete probability distribution?
One possible answer:
If the prediction ŷ made by MLP is also a discrete probability
distribution, then the comparison between ŷ and y onehotenc becomes a
comparison between two discrete probability distributions.
Aim of training MLP: To learn a parametrized map hw : X → Ỹ such that for the
training data D, we have y i = hw (x i ), ∀i ∈ {1, . . . , S}.
Aim of using the trained MLP model: For an unseen sample x̂ ∈ X , predict
ŷ = hw (x̂) = MLP(x̂; w ); to find integer label use argmaxj ŷj .
Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . . , K
I Choose a sample jk ∈ {1, . . . , S}.
I Find ŷ jk = MLP(x jk ; w k ). (forward pass)
I Compute error e jk (w k ) using cross-entropy loss function.
I Compute error gradient ∇w e jk (w k ) using backpropagation.
I Update: w k+1 ← w k − γk ∇w e jk (w k ).
Output: w ∗ = w K +1 .
Recall forward pass: For an arbitrary sample (x, y ) from training data D, and the MLP with
weights w = (W 1 , W 2 . . . , W L ), the prediction ŷ is computed using forward pass as:
Backpropagation: For an arbitrary sample (x, y ) from training data D, and the MLP with
weights w = (W 1 , W 2 . . . , W L ), how do we compute the error gradient?
Backpropagation: Indeed, the error gradients are very similar to the binary classification setup.
Backpropagation: Indeed, the error gradients are very similar to the binary classification setup.
For an arbitrary sample (x, y ) from training data D, and the MLP with weights
w = (W 1 , W 2 . . . , W L ), the error gradient with respect to weights W ` in `-th layer is:
0
∇W ` e = Diag(φ` )δ ` (a`−1 )>
0 ` ∂e `−1
φ (z1 ) ∂a` a1
1 .
, δ ` = ..
0
where Diag(φ` ) = .
.. and a`−1 = . .
. .
`−1
φ0 (zN` ) ∂e
` aN
` ∂aN `−1
`
P. Balamurugan Deep Learning - Theory and Practice September 15, 2020. 57 / 75
MLP for multi-class classification Training MLP for multi-class classification
Backpropagation: Indeed, the error gradients are very similar to the binary classification setup.
For an arbitrary sample (x, y ) from training data D, and the MLP with weights
w = (W 1 , W 2 . . . , W L ), the error gradient with respect to weights W ` in `-th layer is:
0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V `+1 V `+2 . . . V L δ L (a`−1 )>
0
where V ` = (W `+1 )> Diag(φ`+1 ).
Backpropagation: Indeed, the error gradients are very similar to the binary classification setup.
For an arbitrary sample (x, y ) from training data D, and the MLP with weights
w = (W 1 , W 2 . . . , W L ), the error gradient with respect to weights W ` in `-th layer is:
0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V `+1 V `+2 . . . V L δ L (a`−1 )>
∂e
∂a1L
.
The main difference arises in the procedure to find δ L = .. .
∂e
L
∂aC
ŷ1
.
ŷ = .. .
ŷC
ŷ1
.
ŷ = .. .
ŷC
PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj .
ŷ1
.
ŷ = .. .
ŷC
PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj .
exp (ajL )
Note: ŷj = PC L , ∀j = 1, . . . , C .
r =1 exp (ar )
PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj , with
exp (ajL )
ŷj = PC L , ∀j = 1, . . . , C .
r =1 exp (ar )
PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj , with
exp (ajL )
ŷj = PC L , ∀j = 1, . . . , C .
r =1 exp (ar )
h ∂e ∂e
i>
Aim: To find δ L = ... .
∂a1L L
∂aC
PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj , with
exp (ajL )
ŷj = PC L , ∀j = 1, . . . , C .
r =1 exp (ar )
h ∂e ∂e
i>
Aim: To find δ L = ... .
∂a1L L
∂aC
∂e PC ∂e ∂ ŷm
We have = m=1 ∂ ŷm ∂aL , ∀j.
∂ajL j
PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj , with
exp (ajL )
ŷj = PC L , ∀j.
r =1 exp (ar )
∂e PC ∂e ∂ ŷm
We have = m=1 ∂ ŷm ∂aL .
∂ajL j
PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj , with
exp (ajL )
ŷj = PC L , ∀j.
r =1 exp (ar )
∂e PC ∂e ∂ ŷm
We have = m=1 ∂ ŷm ∂aL .
∂ajL j
∂e
∂ ŷm
= − ŷym .
m
L
∂ ŷm ∂ exp (am )
= .
∂ajL ∂ajL
PC L
r =1 exp (ar )
PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj , with
exp (ajL )
ŷj = PC , ∀j.
r =1 exp (arL )
∂e PC ∂e ∂ ŷm
We have = m=1 ∂ ŷm ∂aL .
∂ajL j
∂e
∂ ŷm
= − ŷym .
m
(
∂ ŷm ∂
L
exp (am ) ŷj (1 − ŷj ) if j = m
= = (Homework!)
∂ajL ∂ajL
PC
r =1 exp (arL ) −ŷm ŷj otherwise.
PC
e(ŷ , y ) is the cross-entropy error function: e(ŷ , y ) = − j=1 yj log ŷj , with
exp (ajL )
ŷj = PC , ∀j.
r =1 exp (arL )
∂e PC ∂e ∂ ŷm
We have = m=1 ∂ ŷm ∂aL .
∂ajL j
∂e
∂ ŷm
= − ŷym .
m
(
∂ ŷm ∂
L
exp (am ) ŷj (1 − ŷj ) if j = m
= = (Homework!)
∂ajL ∂ajL
PC L
r =1 exp (ar ) −ŷm ŷj otherwise.
∂e PC
Hence by substitution, = m=1 ym ŷj − yj (1 − ŷj ) = ŷj − yj .
∂ajL
m6=j
∂e
∂a1L
.
Recall: We wanted to find δ L = .. .
∂e
L
∂aC
∂e
We have = ŷj − yj .
∂ajL
ŷ1 − y1
..
Hence δL = .
.
ŷC − yC