Download as pdf or txt
Download as pdf or txt
You are on page 1of 100

Deep Learning - Theory and Practice

IE 643
Lecture 8

Sep 8 & Sep 11, 2020.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 1 / 100
Outline

1 Recap
MLP - Data Perspective
Optimization Concepts
Gradient Descent

2 Stochastic Gradient Descent


Mini-batch SGD

3 Sample-wise Gradient Computation

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 2 / 100
Recap MLP - Data Perspective

Multi Layer Perceptron - Data Perspective

Given data (x, y ), multi layer perceptron predicts:

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

Similar to perceptron, if y 6= ŷ an error E (y , ŷ ) is incurred.


Aim: To change the weights W 1 , W 2 , W 3 , such that the error Err(y , ŷ ) is
minimized.
Leads to an error minimization problem.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 3 / 100
Recap MLP - Data Perspective

Multi Layer Perceptron - Data Perspective

Optimization perspective
Given training data D = {(x i , y i )}Si=1 ,

S
X S
X S
X
min ei = Err(y i , ŷ i ) = Err(y i , MLP(x i ))
i=1 i=1 i=1

Note: The minimization is over the weights of the MLP W 1 , . . . , W L ,


where L denotes number of layers in MLP.
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 4 / 100
Recap MLP - Data Perspective

MLP - Data Perspective: A Simple Example

1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )

Aim: To minimize the total error (assumed here as sum of squared errors), which is
4
!2
X 1
min E = yi −  1 i 1 xi

1 ,w 1
w11 12 i=1
1 + exp − w11 x1 + w12 2

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 5 / 100
Recap MLP - Data Perspective

MLP - Data Perspective: A Simple Example


Visualizing the loss surface:

1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )

4
!2
X 1
E = yi − 1 xi + w1 xi
 
i=1
1 + exp − w11 1 12 2

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 6 / 100
Recap Optimization Concepts

Optimization Concepts

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 7 / 100
Recap Optimization Concepts

General Optimization Problem

min f (x)
x∈C

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 8 / 100
Recap Optimization Concepts

General Optimization Problem

min f (x)
x∈C

f is called objective function and C is called feasible set.


Let f ∗ = minx∈C f (x) denote the optimal objective function value.
Optimal Solution Set S ∗ = {x ∈ C : f (x) = f ∗ }.
Let us denote by x ∗ an optimal solution in S ∗ .

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 9 / 100
Recap Optimization Concepts

General Optimization Problem

min f (x) (OP)


x∈C

Local Optimal Solution


A solution z to (OP) is called local optimal solution if f (z) ≤ f (ẑ),
∀ẑ ∈ N (z, ) for some  > 0.

Global Optimal Solution


A solution z to (OP) is called global optimal solution if f (z) ≤ f (ẑ),
∀ẑ ∈ C.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 10 / 100
Recap Optimization Concepts

General Optimization Problem

min f (x)
x∈C

C ⊆ Rd .

f : C −→ R.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 11 / 100
Recap Optimization Concepts

Directional derivative

Let f : C −→ R be a function defined over C ⊆ Rd . Let x ∈ int(C). Let


d 6= 0 ∈ Rd . If the limit

f (x + αd) − f (x)
lim
α↓0 α

exists, then it is called the directional derivative of f at x along the


direction d, and is denoted by f 0 (x; d).

Note: If all partial derivatives of f exist at x, then f 0 (x; d) = h∇f (x), di,
h i>
where ∇f (x) = ∂f∂x(x)1
. . . ∂f (x)
∂x d
.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 12 / 100
Recap Optimization Concepts

Directional derivative

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 13 / 100
Recap Optimization Concepts

Descent Direction

Let f : Rd −→ R be a continuously differentiable function over Rd . Then


a vector 0 6= d ∈ Rd is called a descent direction of f at x if the
directional derivative of f at x is negative; that is,

f 0 (x; d) = h∇f (x), di < 0.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 14 / 100
Recap Optimization Concepts

Descent Direction

Let f : Rd −→ R be a continuously differentiable function over Rd . Then


a vector 0 6= d ∈ Rd is called a descent direction of f at x if the
directional derivative derivative of f at x is negative; that is,

f 0 (x; d) = h∇f (x), di < 0.

Note: A natural candidate for a descent direction is d = −∇f (x).

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 15 / 100
Recap Optimization Concepts

Descent Direction

Proposition
Let f : Rd −→ R be a continuously differentiable function over Rd . Let
0 6= d ∈ Rd be a descent direction of f at x. Then there exists  > 0 such
that ∀α ∈ (0, ] we have

f (x + αd) < f (x).

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 16 / 100
Recap Optimization Concepts

Descent Direction

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 17 / 100
Recap Optimization Concepts

Descent Direction

Proposition
Let f : Rd −→ R be a continuously differentiable function over Rd . Let
0 6= d ∈ Rd be a descent direction of f at x. Then there exists  > 0 such
that ∀α ∈ (0, ] we have

f (x + αd) < f (x).

Proof idea:

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 18 / 100
Recap Optimization Concepts

Descent Direction

Proposition
Let f : Rd −→ R be a continuously differentiable function over Rd . Let
0 6= d ∈ Rd be a descent direction of f at x. Then there exists  > 0 such
that ∀α ∈ (0, ] we have

f (x + αd) < f (x).

Proof idea: Since 0 6= d ∈ Rd is a descent direction, by definition of the


directional derivative we have
f (x + αd) − f (x)
f 0 (x; d) = lim <0
α↓0 α

=⇒ ∃ > 0 such that ∀α ∈ (0, ], f (x + αd) < f (x).

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 19 / 100
Recap Optimization Concepts

Descent Direction

Proposition
Let f : Rd −→ R be a continuously differentiable function over Rd . Let
0 6= d ∈ Rd be a descent direction of f at x. Then there exists  > 0 such
that ∀α ∈ (0, ] we have

f (x + αd) < f (x).

Proof idea: Since 0 6= d ∈ Rd is a descent direction, by definition of the


directional derivative we have
f (x + αd) − f (x)
f 0 (x; d) = lim <0
α↓0 α

=⇒ ∃ > 0 such that ∀α ∈ (0, ], f (x + αd) < f (x).


Note: If we cannot find such , d is no longer a descent direction. Why?

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 20 / 100
Recap Optimization Concepts

Algorithm Development using Descent Direction


Consider the general optimization problem:

min f (x) (GEN-OPT)


x∈Rd

where f : Rd −→ R
Algorithm to solve (GEN-OPT)
Start with x 0 ∈ Rd .
For k = 0, 1, 2, . . .
I Find a descent direction d k of f at x k and αk > 0 such that
f (x k + αk d k ) < f (x k ).
I x k+1 = x k + αk d k .
I Check for some stopping criterion and break from loop.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 21 / 100
Recap Optimization Concepts

Characterization Of Local Optimum

Proposition
Let f : C −→ R be a function over the set C ⊆ Rd . Let x ? ∈ int(C) be a
local optimum point of f . Let all partial derivatives of f exist at x ? . Then
∇f (x ? ) = 0.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 22 / 100
Recap Optimization Concepts

Characterization Of Local Optimum


Proposition
Let f : C −→ R be a function over the set C ⊆ Rd . Let x ? ∈ int(C) be a
local optimum point of f . Let all partial derivatives of f exist at x ? . Then
∇f (x ? ) = 0.

Proof idea:

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 23 / 100
Recap Optimization Concepts

Characterization Of Local Optimum


Proposition
Let f : C −→ R be a function over the set C ⊆ Rd . Let x ? ∈ int(C) be a
local optimum point of f . Let all partial derivatives of f exist at x ? . Then
∇f (x ? ) = 0.

Proof idea:
Consider ei = [0 . . . 0 1
|{z} 0 . . . 0]> containing 1 at the i-th
i-th coordinate
coordinate.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 24 / 100
Recap Optimization Concepts

Characterization Of Local Optimum


Proposition
Let f : C −→ R be a function over the set C ⊆ Rd . Let x ? ∈ int(C) be a
local optimum point of f . Let all partial derivatives of f exist at x ? . Then
∇f (x ? ) = 0.

Proof idea:
Consider ei = [0 . . . 0 1
|{z} 0 . . . 0]> containing 1 at the i-th
i-th coordinate
coordinate.
Let g (α) = f (x ? + αei ). Note g is a scalar function.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 25 / 100
Recap Optimization Concepts

Characterization Of Local Optimum


Proposition
Let f : C −→ R be a function over the set C ⊆ Rd . Let x ? ∈ int(C) be a
local optimum point of f . Let all partial derivatives of f exist at x ? . Then
∇f (x ? ) = 0.

Proof idea:
Consider ei = [0 . . . 0 1
|{z} 0 . . . 0]> containing 1 at the i-th
i-th coordinate
coordinate.
Let g (α) = f (x ? + αei ). Note g is a scalar function.
x ? is a local optimum point of f =⇒ 0 is a local optimum point of
g . (Why?)

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 26 / 100
Recap Optimization Concepts

Characterization Of Local Optimum


Proposition
Let f : C −→ R be a function over the set C ⊆ Rd . Let x ? ∈ int(C) be a
local optimum point of f . Let all partial derivatives of f exist at x ? . Then
∇f (x ? ) = 0.

Proof idea:
Consider ei = [0 . . . 0 1
|{z} 0 . . . 0]> containing 1 at the i-th
i-th coordinate
coordinate.
Let g (α) = f (x ? + αei ). Note g is a scalar function.
x ? is a local optimum point of f =⇒ 0 is a local optimum point of
g . (Why?)
=⇒ g 0 (0) = 0.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 27 / 100
Recap Optimization Concepts

Characterization Of Local Optimum


Proposition
Let f : C −→ R be a function over the set C ⊆ Rd . Let x ? ∈ int(C) be a
local optimum point of f . Let all partial derivatives of f exist at x ? . Then
∇f (x ? ) = 0.

Proof idea:
Consider ei = [0 . . . 0 1
|{z} 0 . . . 0]> containing 1 at the i-th
i-th coordinate
coordinate.
Let g (α) = f (x ? + αei ). Note g is a scalar function.
x ? is a local optimum point of f =⇒ 0 is a local optimum point of
g . (Why?)
=⇒ g 0 (0) = 0.
∂f (x ? )
However g 0 (0) = h∇f (x ? ), ei i = ∂xi = 0. (How?)

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 28 / 100
Recap Optimization Concepts

Characterization Of Local Optimum


Proposition
Let f : C −→ R be a function over the set C ⊆ Rd . Let x ? ∈ int(C) be a
local optimum point of f . Let all partial derivatives of f exist at x ? . Then
∇f (x ? ) = 0.

Proof idea:
Consider ei = [0 . . . 0 1
|{z} 0 . . . 0]> containing 1 at the i-th
i-th coordinate
coordinate.
Let g (α) = f (x ? + αei ). Note g is a scalar function.
x ? is a local optimum point of f =⇒ 0 is a local optimum point of
g . (Why?)
=⇒ g 0 (0) = 0.
∂f (x ? )
However g 0 (0) = h∇f (x ? ), ei i = ∂xi = 0. (How?)
=⇒ ∇f (x ? ) = 0. (Why?)
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 29 / 100
Recap Optimization Concepts

Algorithm Development using Descent Direction


Consider the general optimization problem:

min f (x) (GEN-OPT)


x∈Rd

where f : Rd −→ R.
Algorithm to solve (GEN-OPT)
Start with x 0 ∈ Rd .
For k = 0, 1, 2, . . .
I If k∇f (x k )k2 = 0, set x ∗ = x k , break from loop. (Caveat alert!)
I Find a descent direction d k of f at x k and αk > 0 such that
f (x k + αk d k ) < f (x k ).
I x k+1 = x k + αk d k .

Output x ∗ .

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 30 / 100
Recap Gradient Descent

Algorithm Development using Descent Direction


Consider the general optimization problem:

min f (x) (GEN-OPT)


x∈Rd

where f : Rd −→ R.
Gradient Descent Algorithm to solve (GEN-OPT)
Start with x 0 ∈ Rd .
For k = 0, 1, 2, . . .
I If k∇f (x k )k2 = 0, set x ∗ = x k , break from loop.
I d k = −∇f (x k ).
I αk = argminα>0 f (x k + αd k ).
I x k+1 = x k + αk d k .

Output x ∗ .
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 31 / 100
Recap Gradient Descent

Gradient Descent for our MLP Problem


Recall: For MLP, the loss minimization problem is:

4 4
!2
X
i
X
i 1
min E (w ) = e (w ) = y −  1 i 1 xi

1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2

where E : R2 −→ R.

Gradient Descent Algorithm to solve MLP Loss Minimization Problem


Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . .
I If k∇E (w k )k2 = 0, set w ∗ = w k , break from loop.
I d k = −∇E (w k ).
I αk = argminα>0 E (w k + αd k ).
I w k+1 = w k + αk d k .
Output w ∗ .

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 32 / 100
Recap Gradient Descent

Gradient Descent for our MLP Problem


Recall: For MLP, the loss minimization problem is:

4 4
!2
X X 1
min E (w ) = e i (w ) = yi −  1 i 1 xi

1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2

Gradient Descent Algorithm to solve MLP Loss Minimization Problem


Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . .
I If k∇E (w k )k2 = 0, set w ∗ = w k , break from loop.
P4
I d k = − i=1 ∇e i (w k ).
I αk = argminα>0 E (w k + αd k ).
I w k+1 = w k + αk d k .
Output w ∗ .

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 33 / 100
Recap Gradient Descent

Gradient Descent for our MLP Problem

Recall: For MLP, the loss minimization problem is:

4 4
!2
X
i
X
i 1
min E (w ) = e (w ) = y −  1 i 1 xi

1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2

Gradient Descent:

I Function values E (w k ) exhibit O(1/ k) convergence under minor
assumptions and the assumption of existence of a local optimum.
I O(1/k 2 ) convergence possible.
I Linear convergence also possible for strongly convex and smooth
function E .
I Arbitrary accuracy possible |E (w gd ) − E (w ∗ )| ≈ O(10−15 ).

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 34 / 100
Recap Gradient Descent

Gradient Descent for our MLP Problem

Recall: For MLP, the loss minimization problem is:


4 4
!2
X X 1
min E (w ) = e i (w ) = yi −  1 i 1 xi

1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2

Gradient Descent:
I Blind to structure of E (w ).
I Finding proper αk at each k is computationally intensive - takes at
least O(Sd) time.
I Storage complexity: O(d)

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 35 / 100
Stochastic Gradient Descent

Stochastic Gradient Descent for our MLP Problem

Recall: For MLP, the loss minimization problem is:


4 4
!2
X X 1
min E (w ) = e i (w ) = yi −  1 i 1 xi

1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2

Stochastic Gradient Descent Algorithm to solve MLP Loss


Minimization Problem
Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . .
I Choose a sample jk ∈ {1, . . . , 4}.

I w k+1 ← w k − γk ∇w e jk (w k ).

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 36 / 100
Stochastic Gradient Descent

Stochastic Gradient Descent for our MLP Problem


Stochastic Gradient Descent Algorithm to solve MLP Loss
Minimization Problem
Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . .
I Choose a sample jk ∈ {1, . . . , 4}.

I w k+1 ← w k − γk ∇w e jk (w k ).

∇w e jk (w k ): Gradient at point w k , of e ik with respect to w . Takes only O(d)


time.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 37 / 100
Stochastic Gradient Descent

Stochastic Gradient Descent for our MLP Problem


Stochastic Gradient Descent Algorithm to solve MLP Loss
Minimization Problem
Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . .
I Choose a sample jk ∈ {1, . . . , 4}.

I w k+1 ← w k − γk ∇w e jk (w k ).

∇w e jk (w k ): Gradient at point w k , of e jk with respect to w . Takes only O(d)


time.
Under suitable conditions on γk ( k γk2 < ∞, k γk → ∞), this procedure
P P
converges asymptotically.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 38 / 100
Stochastic Gradient Descent

Stochastic Gradient Descent for our MLP Problem


Stochastic Gradient Descent Algorithm to solve MLP Loss
Minimization Problem
Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . .
I Choose a sample jk ∈ {1, . . . , 4}.

I w k+1 ← w k − γk ∇w e jk (w k ).

∇w e jk (w k ): Gradient at point w k , of e jk with respect to w . Takes only O(d)


time.
Under suitable conditions on γk ( k γk2 < ∞, k −γk → ∞), this procedure
P P
converges asymptotically.
For smooth functions, O(1/k) convergence possible (in theory!).

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 39 / 100
Stochastic Gradient Descent

Stochastic Gradient Descent for our MLP Problem


Stochastic Gradient Descent Algorithm to solve MLP Loss
Minimization Problem
Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . .
I Choose a sample jk ∈ {1, . . . , 4}.

I w k+1 ← w k − γk ∇w e jk (w k ).

∇w e jk (w k ): Gradient at point w k , of e jk with respect to w . Takes only O(d)


time.
Under suitable conditions on γk ( k γk2 < ∞, k γk → ∞), this procedure
P P
converges asymptotically.
For smooth functions, O(1/k) convergence possible (in theory!).
1
Typical choice: γk = k+1 .

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 40 / 100
Stochastic Gradient Descent Mini-batch SGD

Mini-Batch Stochastic Gradient Descent for our MLP


Problem

Mini-batch SGD Algorithm to solve MLP Loss Minimization Problem


Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . .
I Choose a block of samples Bk ⊆ {1, . . . , 4}.

w k+1 ← w k − γk ∇w e j (w k ).
P
I
j∈Bk

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 41 / 100
Stochastic Gradient Descent Mini-batch SGD

Mini-batch Stochastic Gradient Descent for our MLP


Problem

Mini-batch SGD Algorithm to solve MLP Loss Minimization Problem


Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . .
I Choose a block of samples Bk ⊆ {1, . . . , 4}.

w k+1 ← w k − γk ∇w e j (w k ).
P
I
j∈Bk

Restrictions on γk similar to that in SGD.


Asymptotic convergence !

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 42 / 100
Stochastic Gradient Descent Mini-batch SGD

GD/SGD: Crucial Step


Recall: For MLP, the loss minimization problem is:

4 4
!2
X X 1
min E (w ) = e i (w ) = yi −  1 i 1 xi

1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2

Crucial step in Gradient Descent Algorithm


P4
w k+1 = w k − αk i=1 ∇e i (w k )

Crucial step in Stochastic Gradient Descent Algorithm


w k+1 ← w k − γk ∇w e jk (w k ).

Crucial step in Mini-batch SGD Algorithm


w k+1 ← w k − γk ∇w e j (w k ).
P
j∈Bk

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 43 / 100
Stochastic Gradient Descent Mini-batch SGD

GD/SGD for MLP: Crucial Step


Recall: For MLP, the loss minimization problem is:

4 4
!2
X X 1
min E (w ) = e i (w ) = yi −  1 i 1 xi

1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2

Crucial step in Gradient Descent Algorithm


P4
w k+1 = w k − αk i=1 ∇e i (w k )

Crucial step in Stochastic Gradient Descent Algorithm


w k+1 ← w k − γk ∇w e jk (w k ).

Crucial step in Mini-batch SGD Algorithm


w k+1 ← w k − γk ∇w e j (w k ).
P
j∈Bk

Note: ∇e i (w k ) and ∇w e jk (w k ) denote sample-wise gradient computation.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 44 / 100
Stochastic Gradient Descent Mini-batch SGD

GD/SGD for MLP: Sample-wise Gradient Computation

Consider an arbitrary training sample (x, y ) ∈ D.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 45 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Consider an arbitrary training sample (x, y ) ∈ D.


At layer L3 , ŷ = a13 = φ z13 = φ w11
3 a2 + w 3 a2 .
 
1 12 2

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 46 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Consider an arbitrary training sample (x, y ) ∈ D.


At layer L3 , ŷ = a13 = φ z13 = φ w11
3 a2 + w 3 a2 .
 
1 12 2
Sample-wise error: e = (ŷ − y )2 .

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 47 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Consider an arbitrary training sample (x, y ) ∈ D.


At layer L3 , ŷ = a13 = φ z13 = φ w11
3 a2 + w 3 a2 .
 
1 12 2
Sample-wise error: e = (ŷ − y )2 .
Aim: To find ∇w e = [∇w 1 e ∇w 1 e . . . ∇w 3 e]> .
11 12 12

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 48 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Consider an arbitrary training sample (x, y ) ∈ D.


At layer L3 , ŷ = a13 = φ z13 = φ w11
3 a2 + w 3 a2 .
 
1 12 2
Sample-wise error: e = (ŷ − y )2 .
3
∂e ∂z1
Note: ∇w 3 e = 3 3
∂z1 ∂w11
.
11

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 49 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Consider an arbitrary training sample (x, y ) ∈ D.


At layer L3 , ŷ = a13 = φ z13 = φ w11
3 a2 + w 3 a2 .
 
1 12 2
Sample-wise error: e = (ŷ − y )2 .
3
∂e ∂z1 ∂e 2
Note: ∇w 3 e = 3 3
∂z1 ∂w11
= a .
∂z13 1
11

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 50 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Consider an arbitrary training sample (x, y ) ∈ D.


At layer L3 , ŷ = a13 = φ z13 = φ w11
3 a2 + w 3 a2 .
 
1 12 2
Sample-wise error: e = (ŷ − y )2 .
3 3
∂e ∂z1 ∂e ∂a1 2
Note: ∇w 3 e = 3 3
∂z1 ∂w11
= 3 a .
∂a1 ∂z13 1
11

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 51 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Consider an arbitrary training sample (x, y ) ∈ D.


At layer L3 , ŷ = a13 = φ z13 = φ w11
3 a2 + w 3 a2 .
 
1 12 2
Sample-wise error: e = (ŷ − y )2 .
3 3
∂e ∂z1 ∂e ∂a1 2 ∂e 0 3 2
Note: ∇w 3 e = 3 3
∂z1 ∂w11
= a
∂a1 ∂z13 1
3 = ∂ ŷ φ (z1 )a1 .
11

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 52 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Consider an arbitrary training sample (x, y ) ∈ D.


At layer L3 , ŷ = a13 = φ z13 = φ w11
3 a2 + w 3 a2 .
 
1 12 2
Sample-wise error: e = (ŷ − y )2 .
3 3
∂e ∂z1 ∂e ∂a1 2 ∂e 0 3 2
Note: ∇w 3 e = 3 3
∂z1 ∂w11
= a
∂a1 ∂z13 1
3 = ∂ ŷ φ (z1 )a1 .
11
∂e 0 3 2
Similarly, ∇w 3 e = ∂ ŷ φ (z1 )a2 .
12

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 53 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

2 a1 + w 2 a1 .
We have at layer L2 : a12 = φ z12 = φ w11
 
1 12 2

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 54 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

2 a1 + w 2 a1 .
We have at layer L2 : a12 = φ z12 = φ w11
 
1 12 2
2
∂e ∂z1 ∂e 1
Hence, ∇w 2 e = ∂z12 ∂w11
2 = a .
∂z12 1
11

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 55 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

2 a1 + w 2 a1 .
We have at layer L2 : a12 = φ z12 = φ w11
 
1 12 2
2
∂e ∂z1 ∂e 1
Hence, ∇w 2 e = ∂z12 ∂w11
2 = a .
∂z12 1
11

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 56 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

2 a1 + w 2 a1 .
We have at layer L2 : a12 = φ z12 = φ w11
 
1 12 2
2 2
∂e ∂z1 ∂e 1 ∂e ∂a1 1 ∂e 0 2 1
Hence, ∇w 2 e = ∂z12 ∂w11
2 = a
∂z12 1
= a
∂a12 ∂z12 1
= ∂a12
φ (z1 )a1 .
11

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 57 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

2 a1 + w 2 a1 .
We have at layer L2 : a12 = φ z12 = φ w11
 
1 12 2
2 2
∂e ∂z1 ∂e 1 ∂e ∂a1 1 ∂e 0 2 1
Hence, ∇w 2 e = 2 = ∂z 2 a1 = ∂a2 ∂z 2 a1
∂z12 ∂w11
= ∂a12
φ (z1 )a1 .
11 1
1 1
Now recall that z13 = w11 3 a2 + w 3 a2 .
1 12 2

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 58 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

We have at layer L2 : a12 = φ z12 = φ w11


2 a1 + w 2 a1 .
 
1 12 2
2 2
∂e ∂z1 ∂e 1 ∂e ∂a1 1 ∂e 0 2 1
Hence, ∇w 2 e = 2 = ∂z 2 a1 = ∂a2 ∂z 2 a1
∂z12 ∂w11
= ∂a12
φ (z1 )a1 .
11 1
1 1
Now recall that z13 = w11 3 a2 + w 3 a2 .
1 12 2
3
∂e ∂e ∂z1 ∂e 3
Hence ∂a2 = ∂z 3 ∂a2 = ∂z 3 w11 .
1 1 1 1

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 59 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

We have at layer L2 : a12 = φ z12 = φ w11


2 a1 + w 2 a1 .
 
1 12 2
2 2
∂e ∂z1 ∂e 1 ∂e ∂a1 1 ∂e 0 2 1
Hence, ∇w 2 e = 2 = ∂z 2 a1 = ∂a2 ∂z 2 a1
∂z12 ∂w11
= ∂a12
φ (z1 )a1 .
11 1
1 1
Now recall that z13 = w11 3 a2 + w 3 a2 .
1 12 2
3
∂e ∂e ∂z1 ∂e 3
Hence ∂a2 = ∂z 3 ∂a2 = ∂z 3 w11 .
1 1 1 1
∂e ∂e 0 3
Recall: We have already computed ∂z13
= ∂ ŷ φ (z1 ).

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 60 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

We have at layer L2 : a12 = φ z12 = φ w11


2 a1 + w 2 a1 .
 
1 12 2
2 2
∂e ∂z1 ∂e 1 ∂e ∂a1 1 ∂e 0 2 1
Hence, ∇w 2 e = 2 = ∂z 2 a1 = ∂a2 ∂z 2 a1 = ∂a2 φ (z1 )a1 .
∂z12 ∂w11
11 1 1 1 1

Now recall that z13 = w11 3 a2 + w 3 a2 .



1 12 2
3
∂e ∂z1
Hence ∂a2 = ∂z 3 ∂a2 = ∂z 3 w11 = ∂∂eŷ φ0 (z13 )w11
∂e ∂e 3 3 .
1 1 1 1

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 61 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

We have at layer L2 : a12 = φ z12 = φ w11


2 a1 + w 2 a1 .
 
1 12 2
2 2
∂e ∂z1 ∂e 1 ∂e ∂a1 1 ∂e 0 2 1
Hence, ∇w 2 e = 2 = ∂z 2 a1 = ∂a2 ∂z 2 a1 = ∂a2 φ (z1 )a1 .
∂z12 ∂w11
11 1 1 1 1

Now recall that z13 = w11 3 a2 + w 3 a2 .



1 12 2
3
∂e ∂z1
Hence ∂a2 = ∂z 3 ∂a2 = ∂z 3 w11 = ∂∂eŷ φ0 (z13 )w11
∂e ∂e 3 3 .
1 1 1 1
∂e 0 3 3 0 2 1
Combining, we have ∇w 2 e = ∂ ŷ φ (z1 )w11 φ (z1 )a1 .
11

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 62 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

∂e 0 3 3 0 2 1
Thus, ∇w 2 e = ∂ ŷ φ (z1 )w11 φ (z1 )a1 .
11

Similarly, ∇w 2 e = ∂∂eŷ φ0 (z13 )w11


3 φ0 (z 2 )a1 .
1 2
12

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 63 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Also, we have at layer L2 : a22 = φ z22 = φ w21


2 a1 + w 2 a1 .
 
1 22 2
Hence, ∇w 2 e =?, ∇w 2 e =?
21 22

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 64 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

We have at layer L1 : a11 = φ z11 = φ w11


1 x + w1 x .
 
1 12 2

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 65 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

We have at layer L1 : a11 = φ z11 = φ w11


1 x + w1 x .
 
1 12 2
1
∂e ∂z1 ∂e
Note: ∇w 1 e = ∂z11 ∂w11
1 = x .
∂z11 1
11

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 66 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

We have at layer L1 : a11 = φ z11 = φ w11


1 x + w1 x .
 
1 12 2
∂e ∂e 0 1
Note: ∇w 1 e = x
∂z11 1
= ∂a11
φ (z1 )x1 .
11

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 67 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

We have at layer L1 : a11 = φ z11 = φ w11


1 x + w1 x .
 
1 12 2
∂e ∂e 0 1
Note: ∇w 1 e = x = ∂a
∂z11 1 1 φ (z1 )x1 .
11 1

Now we see that a11 contributes to both z12 and z22 .

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 68 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

We have at layer L1 : a11 = φ z11 = φ w11


1 x + w1 x .
 
1 12 2
∂e ∂e 0 1
Note: ∇w 1 e = x = ∂a
∂z11 1 1 φ (z1 )x1 .
11 1

Now we see that a11 contributes to both z12 and z22 .


Recall: z12 = w11
2 a1 + w 2 a1 and z 2 = w 2 a1 +
1 12 2 2 21 1
2 a1 .
w22 2

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 69 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

We have at layer L1 : a11 = φ z11 = φ w11


1 x + w1 x .
 
1 12 2
∂e ∂e 0 1
Note: ∇w 1 e = x = ∂a
∂z11 1 1 φ (z1 )x1 .
11 1

Now we see that a11 contributes to both z12 and z22 .


Recall: z12 = w11
2 a1 + w 2 a1 and z 2 = w 2 a1 +
1 12 2 2 21 1
2 a1 .
w22 2
∂e P2 ∂e ∂zi2
Hence ∂a1 = i=1 ∂z 2 ∂a1 .
1 i 1

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 70 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

We have at layer L1 : a11 = φ z11 = φ w11


1 x + w1 x .
 
1 12 2
∂e ∂e 0 1
Note: ∇w 1 e = x = ∂a
∂z11 1 1 φ (z1 )x1 .
11 1

Now we see that a11 contributes to both z12 and z22 .


Recall: z12 = w11
2 a1 + w 2 a1 and z 2 = w 2 a1 +
1 12 2 2 21 1
2 a1 .
w22 2
∂e P2 ∂e ∂zi2 P2 ∂e 2
Hence ∂a1 = i=1 ∂z 2 ∂a1 = i=1 ∂z 2 wi1 .
1 1 i i

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 71 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

1 x + w1 x .
We have at layer L1 : a11 = φ z11 = φ w11
 
1 12 2
∂e ∂e 0 1
Note: ∇w 1 e = x = ∂a
∂z11 1 1 φ (z1 )x1 .
11 1
Now we see that a11 contributes to both z12 and z22 .
Recall: z12 = w11
2 a1 + w 2 a1 and z 2 = w 2 a1 +
1 12 2 2 21 1
2 a1 .
w22 2
∂e P2 ∂e ∂zi2 P2 ∂e 2
Hence ∂a1 = i=1 ∂z 2 ∂a1 = i=1 ∂z 2 wi1 .
1 1 i i
∂e
Recall: We have already computed ∂zi2
,i = 1, 2.
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 72 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

1 x + w1 x .
We have at layer L1 : a11 = φ z11 = φ w11
 
1 12 2
∂e ∂e 0 1
Note: ∇w 1 e = x = ∂a
∂z11 1 1 φ (z1 )x1 .
11 1
Now we see that a11 contributes to both z12 and z22 .
Recall: z12 = w11
2 a1 + w 2 a1 and z 2 = w 2 a1 +
1 12 2 2 21 1
2 a1 .
w22 2
∂e P2 ∂e ∂zi2 P2 ∂e 2
Hence ∂a1 = i=1 ∂z 2 ∂a1 = i=1 ∂z 2 wi1 .
1 1 i i
∂e ∂e 0 2
Recall: We have already computed ∂zi2
= ∂ai2
φ (zi ), i = 1, 2.
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 73 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

∂e ∂e `−1
= a
∂wij` ∂zi` j

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 74 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 75 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
N`+1
∂e X ∂e
= w `+1
`+1 mi
∂ai` m=1 ∂zm

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 76 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
N`+1
∂e X ∂e
`
= w `+1
`+1 mi
∂ai m=1 ∂z m
N`+1
X ∂e
= `+1
φ0 (zm
`+1 `+1
)wmi
m=1 ∂am

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 77 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
N`+1
∂e X ∂e
= w `+1
`+1 mi
∂ai` m=1 ∂zm
N`+1
X ∂e
= `+1
φ0 (zm
`+1 `+1
)wmi
m=1 ∂am
∂e
 
∂a1`+1
h i ..

= φ0 (z1`+1 )w11
`+1
. . . φ0 (zN`+1 )wN`+1 1 
 
.

`+1 `+1  
 ∂e 
`+1
∂aN
`+1

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 78 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
 ∂e    ∂e 
φ0 (z1`+1 )w11
`+1
φ0 (zN`+1 )wN`+1 1  ∂a1`+1

∂a` ...
 1   `+1 `+1
 .

.
 .   .. ..  .

 . = . ... .  .


   ∂e
∂e `+1 `+1 `+1 `+1
φ0 (z1 )w1N φ0 (zN )wN N

∂a` ... ∂a`+1
N` ` `+1 `+1 ` N`+1

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 79 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
 ∂e    ∂e 
φ0 (z1`+1 )w11`+1
φ0 (zN`+1 )wN`+1 1  ∂a1`+1

∂a` ...
 1   `+1 `+1
 .

 ..   .. ..  .

 . = . ... .  .


   ∂e
∂e 0 (z `+1 )w `+1 0 `+1 `+1 
∂a ` φ 1 1N ... φ (zN )wN N ∂a`+1
N` ` `+1 `+1 ` N`+1

∂e
 
 ∂e 
`+1
wN`+1 1 φ0 (z1`+1 )
  
∂a1`w11 ... ∂a1`+1
`+1  
 ..  
 
.. ..   ..  .. 
=⇒  .  =   . 
.

   . ... .   
∂e ∂e
`+1 `+1 φ0 (zN`+1
 
∂a` w1N ... wN N ) `+1
N` ` `+1 ` `+1 ∂aN
`+1

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 80 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation


Generalized setting:

∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
 ∂e    ∂e 
φ0 (z1`+1 )w11`+1
φ0 (zN`+1 )wN`+1 1  ∂a1`+1

∂a` ...
 1   `+1 `+1
 .

 ..   . .  .

 . = .. ... ..  .


   ∂e
∂e 0 (z `+1 )w `+1 φ0 (zN`+1 )wN`+1 N

∂a` φ 1 1N ... ∂a`+1
N` ` `+1 `+1 ` N`+1

∂e
 
 ∂e 
`+1
wN`+1 1 φ0 (z1`+1 )
  
∂a1` w11 ... ∂a1`+1
`+1  
 .
  
.. ..  ..  .. 
=⇒  .. =
 
  . ... .

 . 
 .


∂e ∂e
`+1
wN`+1 φ0 (zN`+1
 
` w1N ... ) `+1
∂aN ` `+1 N` `+1 ∂aN
` `+1
` `+1 > `+10 `+1
δ = (W ) Diag(φ )δ

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 81 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation


Generalized setting:

∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
 ∂e    ∂e 
φ0 (z1`+1 )w11`+1
φ0 (zN`+1 )wN`+1 1  ∂a1`+1

∂a` ...
 1   `+1 `+1
 .

 ..   . .  .

 . = .. ... ..  .


   ∂e
∂e 0 (z `+1 )w `+1 φ0 (zN`+1 )wN`+1 N

∂a` φ 1 1N ... ∂a`+1
N` ` `+1 `+1 ` N`+1

∂e
 
 ∂e 
`+1
wN`+1 1 φ0 (z1`+1 )
  
∂a1` w11 ... ∂a1`+1
`+1  
 .
  
.. ..  ..  .. 
=⇒  .. =
 
  . ... .

 . 
 .


∂e ∂e
`+1
wN`+1 φ0 (zN`+1
 
` w1N ... ) `+1
∂aN ` `+1 N` `+1 ∂aN
` `+1
` `+1 > `+10 `+1 `+1 `+1
δ = (W ) Diag(φ )δ =V δ

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 82 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation


Generalized setting:

∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
 ∂e    ∂e 
φ0 (z1`+1 )w11`+1
φ0 (zN`+1 )wN`+1 1  ∂a1`+1

∂a` ...
 1   `+1 `+1
 .

 ..   . .  .

 . = .. ... ..  .


   ∂e
∂e 0 (z `+1 )w `+1 φ0 (zN`+1 )wN`+1 N

∂a` φ 1 1N ... ∂a`+1
N` ` `+1 `+1 ` N`+1

∂e
 
 ∂e 
`+1
wN`+1 1 φ0 (z1`+1 )
  
∂a1` w11 ... ∂a1`+1
`+1  
 .
  
.. ..  ..  .. 
=⇒  .. =
 
  . ... .

 . 
 .


∂e ∂e
`+1
wN`+1 φ0 (zN`+1
 
` w1N ... ) `+1
∂aN ` `+1 N` `+1 ∂aN
` `+1
` `+1 > `+10 `+1 `+1 `+1 `+1 `+2 `+2
δ = (W ) Diag(φ )δ =V δ =V V δ

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 83 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation


Generalized setting:

∂e ∂e `−1
= a
∂wij` ∂zi` j
∂e ∂e 0 `
= φ (zi )
∂zi` ∂ai`
 ∂e    ∂e 
φ0 (z1`+1 )w11
`+1
φ0 (zN`+1 )wN`+1 1  ∂a1`+1

∂a` ...
 1   `+1 `+1
 .

 ..   .. ..  .

 . = . ... .  .


   ∂e
∂e 0 `+1 `+1 0 `+1 `+1 
∂a` φ (z1 )w1N ... φ (zN )wN N ∂a`+1
N` ` `+1 `+1 ` N`+1

∂e
 
 ∂e 
`+1
wN`+1 1 φ0 (z1`+1 )
  
∂a1` w11 ... ∂a1`+1
`+1  
 .  
 
.. ..  ..  .. 
=⇒  ..  =  . ... .
 . 
 .


   
∂e ∂e
`+1
wN`+1 φ0 (zN`+1
 
` w1N ... ) `+1
∂aN ` `+1 N` `+1 ∂aN
` `+1
0
δ ` = (W `+1 )> Diag(φ`+1 )δ `+1 = V `+1 δ `+1 = V `+1 V `+2 δ `+2 = V `+1 V `+2 . . . V L δ L

Assume: The last layer in the network is L.


P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 84 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

∂e ∂e `−1 ∂e 0 ` `−1
= a = φ (zi )aj
∂wij` ∂zi` j ∂ai`
∂e
  
∂e 0 ` `−1 
∂w1j ` ` φ (z1 )aj
   ∂a1
..  ..
 
=⇒  . =
 
.
 
  
 ∂e  ∂e
φ 0 (z ` )a`−1
`
∂wN j `
∂aN N
` j
` `

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 85 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

∂e ∂e `−1 ∂e 0 ` `−1
= a = φ (zi )aj
∂wij` ∂zi` j ∂ai`
∂e
  
∂e 0 ` `−1    ∂e 
` φ (z1 )aj
 0 `
` φ (z1 )
 ∂w1j
  ∂a1 ∂a1`
..   ..   .. 
   h
..
i
a`−1
 
=⇒  . = =
  
  .  .  .   j
∂e  ∂e 0 (z ` )a`−1 φ0 (zN` ) ∂e

` ` φ N j `
∂aN
∂wN j ∂aN ` `
` ` `

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 86 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

∂e ∂e `−1 ∂e 0 ` `−1
= a = φ (zi )aj
∂wij` ∂zi` j ∂ai`
∂e
 
∂e 0 ` `−1    ∂e
φ (z1 )aj
  0 ` 
` φ (z1 )
 ∂w1j
  ∂a1` ∂a1`
.. ..  ..
   h
.
i
  
.  `−1
=⇒  = =  aj
 
. .  .

 .     
∂e ∂e 0 ` `−1
φ0 (zN` ) ∂e
 
`
∂wN ` φ (zN )aj
∂aN ` ` ∂a`
`j ` N`
 ∂e ∂e 
  ∂e 
...
` `
φ0 (z1` )

∂w11 ∂w1N ∂a1`
 `−1 
.. ..  ..  h `−1

..
i
   `−1
=⇒  =
.
  a ... aN
 . ... .    .  1 `−1
∂e ∂e φ0 (zN` ) ∂e
 
` ... ` ` `
∂aN
∂wN ∂wN
`1 N ` `−1 `

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 87 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

∂e ∂e `−1 ∂e 0 ` `−1
= a = φ (zi )aj
∂wij` ∂zi` j ∂ai`
∂e
 
∂e 0 ` `−1    ∂e
φ (z1 )aj
  0 ` 
` φ (z1 )
 ∂w1j
  ∂a1` ∂a1`
.. ..   ..
   h
.
i
  
..  `−1
=⇒  = =  aj
  
 .   .   . 
∂e ∂e 0 ` `−1
φ0 (zN` ) ∂e
 
`
∂wN ` φ (zN )aj
∂aN ` ` ∂a`
`j ` N`
 ∂e ∂e 
  ∂e 
...
` `
φ0 (z1` )

∂w11 ∂w1N ∂a1`
 `−1 
.. ..   .. 
..  h i
a`−1 `−1
  
=⇒  =
.

 .  ... aN
 . ... .    1 `−1
∂e ∂e φ0 (zN` ) ∂e
 
` ... ` ` `
∂aN
∂wN ∂wN
`1 N ` `−1 `
0
=⇒ ∇W ` e = Diag(φ` )δ ` (a`−1 )>

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 88 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

∂e ∂e `−1 ∂e 0 ` `−1
= a = φ (zi )aj
∂wij` ∂zi` j ∂ai`
∂e
 
∂e 0 ` `−1    ∂e
φ (z1 )aj
  0 ` 
` φ (z1 )
 ∂w1j
  ∂a1` ∂a1`
.. ..   ..
   h
.
i
  
..  `−1
=⇒  = =  aj
  
 .   .   . 
∂e ∂e 0 ` `−1
φ0 (zN` ) ∂e
 
`
∂wN ` φ (zN )aj
∂aN ` ` ∂a`
`j ` N`
 ∂e ∂e 
  ∂e 
...
` `
φ0 (z1` )

∂w11 ∂w1N ∂a1`
 `−1 
.. ..   .. 
..  h i
a`−1 `−1
  
=⇒  =
.

 .  ... aN
 . ... .    1 `−1
∂e ∂e φ0 (zN` ) ∂e
 
` ... ` ` `
∂aN
∂wN ∂wN
`1 N ` `−1 `
0 0
=⇒ ∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 89 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation


Generalized setting:

∂e ∂e `−1 ∂e 0 ` `−1
= a = φ (zi )aj
∂wij` ∂zi` j ∂ai`
∂e
 
∂e 0 ` `−1    ∂e
φ (z1 )aj
  0 ` 
` φ (z1 )
 ∂w1j
  ∂a1` ∂a1`
.. ..   ..
   h
.
i
  
..  `−1
=⇒  = =  aj
  
 .   .   . 
∂e ∂e 0 ` `−1
φ0 (zN` ) ∂e
 
`
∂wN ` φ (zN )aj
∂aN ` ` ∂a`
`j ` N`
 ∂e ∂e 
∂e
` ... `
 0 `  
∂w11 ∂w1N φ (z1 ) ∂a1`
 `−1 
.. ..  ..  h `−1

..
i
   `−1
=⇒  =
.

 .  a ... aN
. ... .  1
    `−1
∂e ∂e φ0 (zN` ) ∂e
 
` ... ` ` `
∂aN
∂wN ∂wN
`1 N ` `−1 `

`0 `−1 > `0
=⇒ ∇W ` e = Diag(φ )δ (a `
) = Diag(φ )V ` . . . V L δ L (a`−1 )>

Homework: Assume each neuron with a bias term and compute the
gradients of loss with respect to bias terms.
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 90 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>

Recall: W ` represents the matrix of weights connecting layer ` − 1 to


layer `.
Recall: δ L represents the error gradients with respect to the
activations at the last layer.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 91 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>

Recall: W ` represents the matrix of weights connecting layer ` − 1 to


layer `.
Recall: δ L represents the error gradients with respect to the
activations at the last layer.

Hence, the error gradients with respect to weights W ` depend on the


error gradients δ L at the last layer.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 92 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>

Recall: W ` represents the matrix of weights connecting layer ` − 1 to


layer `.
Recall: δ L represents the error gradients with respect to the
activations at the last layer.

Hence, the error gradients with respect to weights W ` depend on the


error gradients δ L at the last layer.
Or the error gradients at the last layer flow back into the previous
layers.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 93 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation


Generalized setting:

0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>

Recall: W ` represents the matrix of weights connecting layer ` − 1 to


layer `.
Recall: δ L represents the error gradients with respect to the
activations at the last layer.

Hence, the error gradients with respect to weights W ` depend on the


error gradients δ L at the last layer.
Or the error gradients at the last layer flow back into the previous
layers.
This error gradient flow back is called Backpropagation!

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 94 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>

If V ` . . . V L δ L leads to large values (in magnitude), then ∇W ` e


gradients can also become large (in magnitude).

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 95 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>

If V ` . . . V L δ L leads to large values (in magnitude), then ∇W ` e


gradients can also become large (in magnitude).
Similarly, if V ` . . . V L δ L leads to small values (in magnitude), then
∇W ` e gradients can also approach zero (in magnitude).

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 96 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>

If V ` . . . V L δ L leads to large values (in magnitude), then ∇W ` e


gradients can also become large (in magnitude). This problem is
called exploding gradient problem.
Similarly, if V ` . . . V L δ L leads to small values (in magnitude), then
∇W ` e gradients can also approach zero (in magnitude). This problem
is called vanishing gradient problem.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 97 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>
`0
=⇒ k∇W ` ek2 ≤ kDiag(φ )k2 kV ` . . . V L δ L k2 k(a`−1 )> k2

If V ` . . . V L δ L leads to large values (in magnitude), then ∇W ` e


gradients can also become large (in magnitude). This problem is
called exploding gradient problem.
Similarly, if V ` . . . V L δ L leads to small values (in magnitude), then
∇W ` e gradients can also approach zero (in magnitude). This problem
is called vanishing gradient problem.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 98 / 100
Sample-wise Gradient Computation

GD/SGD for MLP: Sample-wise Gradient Computation

Generalized setting:

0 0
∇W ` e = Diag(φ` )δ ` (a`−1 )> = Diag(φ` )V ` . . . V L δ L (a`−1 )>
 ∂e 
∂aL
 1 
 . 
L
recall:δ =  .. 
 
∂e
L
∂aN
L

∂e ∂e
∂aiL
=: ∂ ŷi denotes the gradient term with respect to a i-th neuron in
the last (L-th) layer.
So far we have considered squared error function.
We will see more examples of constructing appropriate error functions
and the corresponding gradient computation.

P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 99 / 100
Sample-wise Gradient Computation

References
Gradient Descent introduced in:
Cauchy, A.: Méthode générale pour la résolution des systèmes
d'équations simultanées. Comptes rendus des séances de l'Académie
des sciences de Paris 25, 536–538, 1847.
Idea of SGD introduced in:
H. Robbins, and S. Monro: A stochastic approximation method.
Annals of Mathematical Statistics. Vol. 22(3), pp. 400-407, 1951.
Backpropagation introduced in:
P. J. Werbos: Beyond regression: new tools for prediction and analysis
in the behavioral sciences. PhD Thesis. Harvard University, 1974.
D. E. Rumelhart, G. E. Hinton, R. J. Williams: Learning internal
representations by error propagation. Chapter in Parallel Distributed
Processing: Explorations in the Microstructure of Cognition. Vol I:
Foundation, MIT Press, 1986.

Acknowledgments:
CalcPlot3D website for plotting.
P. Balamurugan Deep Learning - Theory and Practice Sep 8 & Sep 11, 2020. 100 / 100

You might also like