Professional Documents
Culture Documents
IE643 Lecture8 2020sep11 2020sep8
IE643 Lecture8 2020sep11 2020sep8
IE 643
Lecture 8
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 1 / 100
Outline
1 Recap
MLP - Data Perspective
Optimization Concepts
Gradient Descent
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 2 / 100
Recap MLP - Data Perspective
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 3 / 100
Recap MLP - Data Perspective
Optimization perspective
Given training data D = {(x i , y i )}Si=1 ,
S
X S
X S
X
min ei = Err(y i , ŷ i ) = Err(y i , MLP(x i ))
i=1 i=1 i=1
1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )
Aim: To minimize the total error (assumed here as sum of squared errors), which is
4
!2
X 1
min E = yi − 1 i 1 xi
1 ,w 1
w11 12 i=1
1 + exp − w11 x1 + w12 2
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 5 / 100
Recap MLP - Data Perspective
1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )
4
!2
X 1
E = yi − 1 xi + w1 xi
i=1
1 + exp − w11 1 12 2
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 6 / 100
Recap Optimization Concepts
Optimization Concepts
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 7 / 100
Recap Optimization Concepts
min f (x)
x∈C
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 8 / 100
Recap Optimization Concepts
min f (x)
x∈C
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 9 / 100
Recap Optimization Concepts
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 10 / 100
Recap Optimization Concepts
min f (x)
x∈C
C ⊆ Rd .
f : C −→ R.
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 11 / 100
Recap Optimization Concepts
Directional derivative
f (x + αd) − f (x)
lim
α↓0 α
Note: If all partial derivatives of f exist at x, then f 0 (x; d) = h∇f (x), di,
h i>
where ∇f (x) = ∂f∂x(x)1
. . . ∂f (x)
∂x d
.
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 12 / 100
Recap Optimization Concepts
Directional derivative
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 13 / 100
Recap Optimization Concepts
Descent Direction
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 14 / 100
Recap Optimization Concepts
Descent Direction
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 15 / 100
Recap Optimization Concepts
Descent Direction
Proposition
Let f : Rd −→ R be a continuously differentiable function over Rd . Let
0 6= d ∈ Rd be a descent direction of f at x. Then there exists > 0 such
that ∀α ∈ (0, ] we have
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 16 / 100
Recap Optimization Concepts
Descent Direction
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 17 / 100
Recap Optimization Concepts
Descent Direction
Proposition
Let f : Rd −→ R be a continuously differentiable function over Rd . Let
0 6= d ∈ Rd be a descent direction of f at x. Then there exists > 0 such
that ∀α ∈ (0, ] we have
Proof idea:
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 18 / 100
Recap Optimization Concepts
Descent Direction
Proposition
Let f : Rd −→ R be a continuously differentiable function over Rd . Let
0 6= d ∈ Rd be a descent direction of f at x. Then there exists > 0 such
that ∀α ∈ (0, ] we have
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 19 / 100
Recap Optimization Concepts
Descent Direction
Proposition
Let f : Rd −→ R be a continuously differentiable function over Rd . Let
0 6= d ∈ Rd be a descent direction of f at x. Then there exists > 0 such
that ∀α ∈ (0, ] we have
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 20 / 100
Recap Optimization Concepts
where f : Rd −→ R
Algorithm to solve (GEN-OPT)
Start with x 0 ∈ Rd .
For k = 0, 1, 2, . . .
I Find a descent direction d k of f at x k and αk > 0 such that
f (x k + αk d k ) < f (x k ).
I x k+1 = x k + αk d k .
I Check for some stopping criterion and break from loop.
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 21 / 100
Recap Optimization Concepts
Proposition
Let f : C −→ R be a function over the set C ⊆ Rd . Let x ? ∈ int(C) be a
local optimum point of f . Let all partial derivatives of f exist at x ? . Then
∇f (x ? ) = 0.
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 22 / 100
Recap Optimization Concepts
Proof idea:
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 23 / 100
Recap Optimization Concepts
Proof idea:
Consider ei = [0 . . . 0 1
|{z} 0 . . . 0]> containing 1 at the i-th
i-th coordinate
coordinate.
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 24 / 100
Recap Optimization Concepts
Proof idea:
Consider ei = [0 . . . 0 1
|{z} 0 . . . 0]> containing 1 at the i-th
i-th coordinate
coordinate.
Let g (α) = f (x ? + αei ). Note g is a scalar function.
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 25 / 100
Recap Optimization Concepts
Proof idea:
Consider ei = [0 . . . 0 1
|{z} 0 . . . 0]> containing 1 at the i-th
i-th coordinate
coordinate.
Let g (α) = f (x ? + αei ). Note g is a scalar function.
x ? is a local optimum point of f =⇒ 0 is a local optimum point of
g . (Why?)
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 26 / 100
Recap Optimization Concepts
Proof idea:
Consider ei = [0 . . . 0 1
|{z} 0 . . . 0]> containing 1 at the i-th
i-th coordinate
coordinate.
Let g (α) = f (x ? + αei ). Note g is a scalar function.
x ? is a local optimum point of f =⇒ 0 is a local optimum point of
g . (Why?)
=⇒ g 0 (0) = 0.
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 27 / 100
Recap Optimization Concepts
Proof idea:
Consider ei = [0 . . . 0 1
|{z} 0 . . . 0]> containing 1 at the i-th
i-th coordinate
coordinate.
Let g (α) = f (x ? + αei ). Note g is a scalar function.
x ? is a local optimum point of f =⇒ 0 is a local optimum point of
g . (Why?)
=⇒ g 0 (0) = 0.
∂f (x ? )
However g 0 (0) = h∇f (x ? ), ei i = ∂xi = 0. (How?)
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 28 / 100
Recap Optimization Concepts
Proof idea:
Consider ei = [0 . . . 0 1
|{z} 0 . . . 0]> containing 1 at the i-th
i-th coordinate
coordinate.
Let g (α) = f (x ? + αei ). Note g is a scalar function.
x ? is a local optimum point of f =⇒ 0 is a local optimum point of
g . (Why?)
=⇒ g 0 (0) = 0.
∂f (x ? )
However g 0 (0) = h∇f (x ? ), ei i = ∂xi = 0. (How?)
=⇒ ∇f (x ? ) = 0. (Why?)
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 29 / 100
Recap Optimization Concepts
where f : Rd −→ R.
Algorithm to solve (GEN-OPT)
Start with x 0 ∈ Rd .
For k = 0, 1, 2, . . .
I If k∇f (x k )k2 = 0, set x ∗ = x k , break from loop. (Caveat alert!)
I Find a descent direction d k of f at x k and αk > 0 such that
f (x k + αk d k ) < f (x k ).
I x k+1 = x k + αk d k .
Output x ∗ .
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 30 / 100
Recap Gradient Descent
where f : Rd −→ R.
Gradient Descent Algorithm to solve (GEN-OPT)
Start with x 0 ∈ Rd .
For k = 0, 1, 2, . . .
I If k∇f (x k )k2 = 0, set x ∗ = x k , break from loop.
I d k = −∇f (x k ).
I αk = argminα>0 f (x k + αd k ).
I x k+1 = x k + αk d k .
Output x ∗ .
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 31 / 100
Recap Gradient Descent
4 4
!2
X
i
X
i 1
min E (w ) = e (w ) = y − 1 i 1 xi
1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2
where E : R2 −→ R.
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 32 / 100
Recap Gradient Descent
4 4
!2
X X 1
min E (w ) = e i (w ) = yi − 1 i 1 xi
1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 33 / 100
Recap Gradient Descent
4 4
!2
X
i
X
i 1
min E (w ) = e (w ) = y − 1 i 1 xi
1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2
Gradient Descent:
√
I Function values E (w k ) exhibit O(1/ k) convergence under minor
assumptions and the assumption of existence of a local optimum.
I O(1/k 2 ) convergence possible.
I Linear convergence also possible for strongly convex and smooth
function E .
I Arbitrary accuracy possible |E (w gd ) − E (w ∗ )| ≈ O(10−15 ).
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 34 / 100
Recap Gradient Descent
Gradient Descent:
I Blind to structure of E (w ).
I Finding proper αk at each k is computationally intensive - takes at
least O(Sd) time.
I Storage complexity: O(d)
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 35 / 100
Stochastic Gradient Descent
I w k+1 ← w k − γk ∇w e jk (w k ).
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 36 / 100
Stochastic Gradient Descent
I w k+1 ← w k − γk ∇w e jk (w k ).
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 37 / 100
Stochastic Gradient Descent
I w k+1 ← w k − γk ∇w e jk (w k ).
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 38 / 100
Stochastic Gradient Descent
I w k+1 ← w k − γk ∇w e jk (w k ).
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 39 / 100
Stochastic Gradient Descent
I w k+1 ← w k − γk ∇w e jk (w k ).
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 40 / 100
Stochastic Gradient Descent Mini-batch SGD
w k+1 ← w k − γk ∇w e j (w k ).
P
I
j∈Bk
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 41 / 100
Stochastic Gradient Descent Mini-batch SGD
w k+1 ← w k − γk ∇w e j (w k ).
P
I
j∈Bk
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 42 / 100
Stochastic Gradient Descent Mini-batch SGD
4 4
!2
X X 1
min E (w ) = e i (w ) = yi − 1 i 1 xi
1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 43 / 100
Stochastic Gradient Descent Mini-batch SGD
4 4
!2
X X 1
min E (w ) = e i (w ) = yi − 1 i 1 xi
1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 44 / 100
Stochastic Gradient Descent Mini-batch SGD
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 45 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 46 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 47 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 48 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 49 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 50 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 51 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 52 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 53 / 100
Sample-wise Gradient Computation
2 a1 + w 2 a1 .
We have at layer L2 : a12 = φ z12 = φ w11
1 12 2
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 54 / 100
Sample-wise Gradient Computation
2 a1 + w 2 a1 .
We have at layer L2 : a12 = φ z12 = φ w11
1 12 2
2
∂e ∂z1 ∂e 1
Hence, ∇w 2 e = ∂z12 ∂w11
2 = a .
∂z12 1
11
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 55 / 100
Sample-wise Gradient Computation
2 a1 + w 2 a1 .
We have at layer L2 : a12 = φ z12 = φ w11
1 12 2
2
∂e ∂z1 ∂e 1
Hence, ∇w 2 e = ∂z12 ∂w11
2 = a .
∂z12 1
11
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 56 / 100
Sample-wise Gradient Computation
2 a1 + w 2 a1 .
We have at layer L2 : a12 = φ z12 = φ w11
1 12 2
2 2
∂e ∂z1 ∂e 1 ∂e ∂a1 1 ∂e 0 2 1
Hence, ∇w 2 e = ∂z12 ∂w11
2 = a
∂z12 1
= a
∂a12 ∂z12 1
= ∂a12
φ (z1 )a1 .
11
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 57 / 100
Sample-wise Gradient Computation
2 a1 + w 2 a1 .
We have at layer L2 : a12 = φ z12 = φ w11
1 12 2
2 2
∂e ∂z1 ∂e 1 ∂e ∂a1 1 ∂e 0 2 1
Hence, ∇w 2 e = 2 = ∂z 2 a1 = ∂a2 ∂z 2 a1
∂z12 ∂w11
= ∂a12
φ (z1 )a1 .
11 1
1 1
Now recall that z13 = w11 3 a2 + w 3 a2 .
1 12 2
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 58 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 59 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 60 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 61 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 62 / 100
Sample-wise Gradient Computation
∂e 0 3 3 0 2 1
Thus, ∇w 2 e = ∂ ŷ φ (z1 )w11 φ (z1 )a1 .
11
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 63 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 64 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 65 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 66 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 67 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 68 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 69 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 70 / 100
Sample-wise Gradient Computation
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 71 / 100
Sample-wise Gradient Computation
1 x + w1 x .
We have at layer L1 : a11 = φ z11 = φ w11
1 12 2
∂e ∂e 0 1
Note: ∇w 1 e = x = ∂a
∂z11 1 1 φ (z1 )x1 .
11 1
Now we see that a11 contributes to both z12 and z22 .
Recall: z12 = w11
2 a1 + w 2 a1 and z 2 = w 2 a1 +
1 12 2 2 21 1
2 a1 .
w22 2
∂e P2 ∂e ∂zi2 P2 ∂e 2
Hence ∂a1 = i=1 ∂z 2 ∂a1 = i=1 ∂z 2 wi1 .
1 1 i i
∂e
Recall: We have already computed ∂zi2
,i = 1, 2.
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 72 / 100
Sample-wise Gradient Computation
1 x + w1 x .
We have at layer L1 : a11 = φ z11 = φ w11
1 12 2
∂e ∂e 0 1
Note: ∇w 1 e = x = ∂a
∂z11 1 1 φ (z1 )x1 .
11 1
Now we see that a11 contributes to both z12 and z22 .
Recall: z12 = w11
2 a1 + w 2 a1 and z 2 = w 2 a1 +
1 12 2 2 21 1
2 a1 .
w22 2
∂e P2 ∂e ∂zi2 P2 ∂e 2
Hence ∂a1 = i=1 ∂z 2 ∂a1 = i=1 ∂z 2 wi1 .
1 1 i i
∂e ∂e 0 2
Recall: We have already computed ∂zi2
= ∂ai2
φ (zi ), i = 1, 2.
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 73 / 100
Sample-wise Gradient Computation
Generalized setting:
∂e ∂e `−1
= a
∂wij` ∂zi` j
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 74 / 100
Sample-wise Gradient Computation
Generalized setting:
∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 75 / 100
Sample-wise Gradient Computation
Generalized setting:
∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
N`+1
∂e X ∂e
= w `+1
`+1 mi
∂ai` m=1 ∂zm
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 76 / 100
Sample-wise Gradient Computation
Generalized setting:
∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
N`+1
∂e X ∂e
`
= w `+1
`+1 mi
∂ai m=1 ∂z m
N`+1
X ∂e
= `+1
φ0 (zm
`+1 `+1
)wmi
m=1 ∂am
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 77 / 100
Sample-wise Gradient Computation
Generalized setting:
∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
N`+1
∂e X ∂e
= w `+1
`+1 mi
∂ai` m=1 ∂zm
N`+1
X ∂e
= `+1
φ0 (zm
`+1 `+1
)wmi
m=1 ∂am
∂e
∂a1`+1
h i ..
= φ0 (z1`+1 )w11
`+1
. . . φ0 (zN`+1 )wN`+1 1
.
`+1 `+1
∂e
`+1
∂aN
`+1
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 78 / 100
Sample-wise Gradient Computation
Generalized setting:
∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
∂e ∂e
φ0 (z1`+1 )w11
`+1
φ0 (zN`+1 )wN`+1 1 ∂a1`+1
∂a` ...
1 `+1 `+1
.
.
. .. .. .
. = . ... . .
∂e
∂e `+1 `+1 `+1 `+1
φ0 (z1 )w1N φ0 (zN )wN N
∂a` ... ∂a`+1
N` ` `+1 `+1 ` N`+1
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 79 / 100
Sample-wise Gradient Computation
Generalized setting:
∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
∂e ∂e
φ0 (z1`+1 )w11`+1
φ0 (zN`+1 )wN`+1 1 ∂a1`+1
∂a` ...
1 `+1 `+1
.
.. .. .. .
. = . ... . .
∂e
∂e 0 (z `+1 )w `+1 0 `+1 `+1
∂a ` φ 1 1N ... φ (zN )wN N ∂a`+1
N` ` `+1 `+1 ` N`+1
∂e
∂e
`+1
wN`+1 1 φ0 (z1`+1 )
∂a1`w11 ... ∂a1`+1
`+1
..
.. .. .. ..
=⇒ . = .
.
. ... .
∂e ∂e
`+1 `+1 φ0 (zN`+1
∂a` w1N ... wN N ) `+1
N` ` `+1 ` `+1 ∂aN
`+1
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 80 / 100
Sample-wise Gradient Computation
∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
∂e ∂e
φ0 (z1`+1 )w11`+1
φ0 (zN`+1 )wN`+1 1 ∂a1`+1
∂a` ...
1 `+1 `+1
.
.. . . .
. = .. ... .. .
∂e
∂e 0 (z `+1 )w `+1 φ0 (zN`+1 )wN`+1 N
∂a` φ 1 1N ... ∂a`+1
N` ` `+1 `+1 ` N`+1
∂e
∂e
`+1
wN`+1 1 φ0 (z1`+1 )
∂a1` w11 ... ∂a1`+1
`+1
.
.. .. .. ..
=⇒ .. =
. ... .
.
.
∂e ∂e
`+1
wN`+1 φ0 (zN`+1
` w1N ... ) `+1
∂aN ` `+1 N` `+1 ∂aN
` `+1
` `+1 > `+10 `+1
δ = (W ) Diag(φ )δ
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 81 / 100
Sample-wise Gradient Computation
∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
∂e ∂e
φ0 (z1`+1 )w11`+1
φ0 (zN`+1 )wN`+1 1 ∂a1`+1
∂a` ...
1 `+1 `+1
.
.. . . .
. = .. ... .. .
∂e
∂e 0 (z `+1 )w `+1 φ0 (zN`+1 )wN`+1 N
∂a` φ 1 1N ... ∂a`+1
N` ` `+1 `+1 ` N`+1
∂e
∂e
`+1
wN`+1 1 φ0 (z1`+1 )
∂a1` w11 ... ∂a1`+1
`+1
.
.. .. .. ..
=⇒ .. =
. ... .
.
.
∂e ∂e
`+1
wN`+1 φ0 (zN`+1
` w1N ... ) `+1
∂aN ` `+1 N` `+1 ∂aN
` `+1
` `+1 > `+10 `+1 `+1 `+1
δ = (W ) Diag(φ )δ =V δ
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 82 / 100
Sample-wise Gradient Computation
∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
∂e ∂e
φ0 (z1`+1 )w11`+1
φ0 (zN`+1 )wN`+1 1 ∂a1`+1
∂a` ...
1 `+1 `+1
.
.. . . .
. = .. ... .. .
∂e
∂e 0 (z `+1 )w `+1 φ0 (zN`+1 )wN`+1 N
∂a` φ 1 1N ... ∂a`+1
N` ` `+1 `+1 ` N`+1
∂e
∂e
`+1
wN`+1 1 φ0 (z1`+1 )
∂a1` w11 ... ∂a1`+1
`+1
.
.. .. .. ..
=⇒ .. =
. ... .
.
.
∂e ∂e
`+1
wN`+1 φ0 (zN`+1
` w1N ... ) `+1
∂aN ` `+1 N` `+1 ∂aN
` `+1
` `+1 > `+10 `+1 `+1 `+1 `+1 `+2 `+2
δ = (W ) Diag(φ )δ =V δ =V V δ
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 83 / 100
Sample-wise Gradient Computation
∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
∂e ∂e
φ0 (z1`+1 )w11
`+1
φ0 (zN`+1 )wN`+1 1 ∂a1`+1
∂a` ...
1 `+1 `+1
.
.. .. .. .
. = . ... . .
∂e
∂e 0 `+1 `+1 0 `+1 `+1
∂a` φ (z1 )w1N ... φ (zN )wN N ∂a`+1
N` ` `+1 `+1 ` N`+1
∂e
∂e
`+1
wN`+1 1 φ0 (z1`+1 )
∂a1` w11 ... ∂a1`+1
`+1
.
.. .. .. ..
=⇒ .. = . ... .
.
.
∂e ∂e
`+1
wN`+1 φ0 (zN`+1
` w1N ... ) `+1
∂aN ` `+1 N` `+1 ∂aN
` `+1
0
δ ` = (W `+1 )> Diag(φ`+1 )δ `+1 = V `+1 δ `+1 = V `+1 V `+2 δ `+2 = V `+1 V `+2 . . . V L δ L
Generalized setting:
∂e ∂e `−1 ∂e 0 ` `−1
= a = φ (zi )aj
∂wij` ∂zi` j ∂ai`
∂e
∂e 0 ` `−1
∂w1j ` ` φ (z1 )aj
∂a1
.. ..
=⇒ . =
.
∂e ∂e
φ 0 (z ` )a`−1
`
∂wN j `
∂aN N
` j
` `
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 85 / 100
Sample-wise Gradient Computation
Generalized setting:
∂e ∂e `−1 ∂e 0 ` `−1
= a = φ (zi )aj
∂wij` ∂zi` j ∂ai`
∂e
∂e 0 ` `−1 ∂e
` φ (z1 )aj
0 `
` φ (z1 )
∂w1j
∂a1 ∂a1`
.. .. ..
h
..
i
a`−1
=⇒ . = =
. . . j
∂e ∂e 0 (z ` )a`−1 φ0 (zN` ) ∂e
` ` φ N j `
∂aN
∂wN j ∂aN ` `
` ` `
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 86 / 100
Sample-wise Gradient Computation
Generalized setting:
∂e ∂e `−1 ∂e 0 ` `−1
= a = φ (zi )aj
∂wij` ∂zi` j ∂ai`
∂e
∂e 0 ` `−1 ∂e
φ (z1 )aj
0 `
` φ (z1 )
∂w1j
∂a1` ∂a1`
.. .. ..
h
.
i
. `−1
=⇒ = = aj
. . .
.
∂e ∂e 0 ` `−1
φ0 (zN` ) ∂e
`
∂wN ` φ (zN )aj
∂aN ` ` ∂a`
`j ` N`
∂e ∂e
∂e
...
` `
φ0 (z1` )
∂w11 ∂w1N ∂a1`
`−1
.. .. .. h `−1
..
i
`−1
=⇒ =
.
a ... aN
. ... . . 1 `−1
∂e ∂e φ0 (zN` ) ∂e
` ... ` ` `
∂aN
∂wN ∂wN
`1 N ` `−1 `
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 87 / 100
Sample-wise Gradient Computation
Generalized setting:
∂e ∂e `−1 ∂e 0 ` `−1
= a = φ (zi )aj
∂wij` ∂zi` j ∂ai`
∂e
∂e 0 ` `−1 ∂e
φ (z1 )aj
0 `
` φ (z1 )
∂w1j
∂a1` ∂a1`
.. .. ..
h
.
i
.. `−1
=⇒ = = aj
. . .
∂e ∂e 0 ` `−1
φ0 (zN` ) ∂e
`
∂wN ` φ (zN )aj
∂aN ` ` ∂a`
`j ` N`
∂e ∂e
∂e
...
` `
φ0 (z1` )
∂w11 ∂w1N ∂a1`
`−1
.. .. ..
.. h i
a`−1 `−1
=⇒ =
.
. ... aN
. ... . 1 `−1
∂e ∂e φ0 (zN` ) ∂e
` ... ` ` `
∂aN
∂wN ∂wN
`1 N ` `−1 `
0
=⇒ ∇W ` e = Diag(φ` )δ ` (a`−1 )>
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 88 / 100
Sample-wise Gradient Computation
Generalized setting:
∂e ∂e `−1 ∂e 0 ` `−1
= a = φ (zi )aj
∂wij` ∂zi` j ∂ai`
∂e
∂e 0 ` `−1 ∂e
φ (z1 )aj
0 `
` φ (z1 )
∂w1j
∂a1` ∂a1`
.. .. ..
h
.
i
.. `−1
=⇒ = = aj
. . .
∂e ∂e 0 ` `−1
φ0 (zN` ) ∂e
`
∂wN ` φ (zN )aj
∂aN ` ` ∂a`
`j ` N`
∂e ∂e
∂e
...
` `
φ0 (z1` )
∂w11 ∂w1N ∂a1`
`−1
.. .. ..
.. h i
a`−1 `−1
=⇒ =
.
. ... aN
. ... . 1 `−1
∂e ∂e φ0 (zN` ) ∂e
` ... ` ` `
∂aN
∂wN ∂wN
`1 N ` `−1 `
0 0
=⇒ ∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 89 / 100
Sample-wise Gradient Computation
∂e ∂e `−1 ∂e 0 ` `−1
= a = φ (zi )aj
∂wij` ∂zi` j ∂ai`
∂e
∂e 0 ` `−1 ∂e
φ (z1 )aj
0 `
` φ (z1 )
∂w1j
∂a1` ∂a1`
.. .. ..
h
.
i
.. `−1
=⇒ = = aj
. . .
∂e ∂e 0 ` `−1
φ0 (zN` ) ∂e
`
∂wN ` φ (zN )aj
∂aN ` ` ∂a`
`j ` N`
∂e ∂e
∂e
` ... `
0 `
∂w11 ∂w1N φ (z1 ) ∂a1`
`−1
.. .. .. h `−1
..
i
`−1
=⇒ =
.
. a ... aN
. ... . 1
`−1
∂e ∂e φ0 (zN` ) ∂e
` ... ` ` `
∂aN
∂wN ∂wN
`1 N ` `−1 `
`0 `−1 > `0
=⇒ ∇W ` e = Diag(φ )δ (a `
) = Diag(φ )V ` . . . V L δ L (a`−1 )>
Homework: Assume each neuron with a bias term and compute the
gradients of loss with respect to bias terms.
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 90 / 100
Sample-wise Gradient Computation
Generalized setting:
0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 91 / 100
Sample-wise Gradient Computation
Generalized setting:
0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 92 / 100
Sample-wise Gradient Computation
Generalized setting:
0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 93 / 100
Sample-wise Gradient Computation
0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 94 / 100
Sample-wise Gradient Computation
Generalized setting:
0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 95 / 100
Sample-wise Gradient Computation
Generalized setting:
0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 96 / 100
Sample-wise Gradient Computation
Generalized setting:
0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 97 / 100
Sample-wise Gradient Computation
Generalized setting:
0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>
`0
=⇒ k∇W ` ek2 ≤ kDiag(φ )k2 kV ` . . . V L δ L k2 k(a`−1 )> k2
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 98 / 100
Sample-wise Gradient Computation
Generalized setting:
0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>
∂e
∂aL
1
.
L
recall:δ = ..
∂e
L
∂aN
L
∂e ∂e
∂aiL
=: ∂ ŷi denotes the gradient term with respect to a i-th neuron in
the last (L-th) layer.
So far we have considered squared error function.
We will see more examples of constructing appropriate error functions
and the corresponding gradient computation.
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 99 / 100
Sample-wise Gradient Computation
References
Gradient Descent introduced in:
Cauchy, A.: Méthode générale pour la résolution des systèmes
d'équations simultanées. Comptes rendus des séances de l'Académie
des sciences de Paris 25, 536–538, 1847.
Idea of SGD introduced in:
H. Robbins, and S. Monro: A stochastic approximation method.
Annals of Mathematical Statistics. Vol. 22(3), pp. 400-407, 1951.
Backpropagation introduced in:
P. J. Werbos: Beyond regression: new tools for prediction and analysis
in the behavioral sciences. PhD Thesis. Harvard University, 1974.
D. E. Rumelhart, G. E. Hinton, R. J. Williams: Learning internal
representations by error propagation. Chapter in Parallel Distributed
Processing: Explorations in the Microstructure of Cognition. Vol I:
Foundation, MIT Press, 1986.
Acknowledgments:
CalcPlot3D website for plotting.
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 100 / 100