Download as pdf or txt
Download as pdf or txt
You are on page 1of 75

Optimization and

Backpropagation

I2DL: Prof. Niessner 1


Lecture 3 Recap

I2DL: Prof. Niessner 2


Neural Network
• Linear score function 𝒇 = 𝑾𝒙

On CIFAR-10

On ImageNet Credit:
I2DL: Prof. Niessner Li/Karpathy/Johnson
3
Neural Network
• Linear score function 𝒇 = 𝑾𝒙

• Neural network is a nesting of ‘functions’


– 2-layers: 𝒇 = 𝑾𝟐 max(𝟎, 𝑾𝟏 𝒙)
– 3-layers: 𝒇 = 𝑾𝟑 max(𝟎, 𝑾𝟐 max(𝟎, 𝑾𝟏 𝒙))
– 4-layers: 𝒇 = 𝑾𝟒 tanh (𝑾𝟑 , max(𝟎, 𝑾𝟐 max(𝟎, 𝑾𝟏 𝒙)))
– 5-layers: 𝒇 =
𝑾𝟓 𝜎(𝑾𝟒 tanh(𝑾𝟑 , max(𝟎, 𝑾𝟐 max(𝟎, 𝑾𝟏 𝒙))))
– … up to hundreds of layers

I2DL: Prof. Niessner 4


Neural Network

Output layer
Input layer
Hidden layer
Credit: Li/Karpathy/Johnson
I2DL: Prof. Niessner 5
Neural Network
Hidden Layer 1 Hidden Layer 2 Hidden Layer 3
Input Layer

Output Layer
Width

Depth
I2DL: Prof. Niessner 6
Activation Functions
1 Leaky ReLU: max 0.1𝑥, 𝑥
Sigmoid: 𝜎 𝑥 = (1+𝑒 −𝑥 )

tanh: tanh 𝑥

Parametric ReLU: max 𝛼𝑥, 𝑥


Maxout max 𝑤1𝑇 𝑥 + 𝑏1 , 𝑤2𝑇 𝑥 + 𝑏2
ReLU: max 0, 𝑥 𝑥 if 𝑥 > 0
ELU f x = ቊ
α e𝑥− 1 if 𝑥 ≤ 0
I2DL: Prof. Niessner 7
Loss Functions
• Measure the goodness of the predictions (or
equivalently, the network's performance)

• Regression loss
1
• L1 loss 𝑳 𝒚, 𝒚
ෝ; 𝜽 = σ𝑛𝑖 𝑦𝑖 − 𝑦ො𝑖 1
𝑛
1 2
• MSE loss 𝑳 𝒚, 𝒚 ෝ; 𝜽 = σ𝑛𝑖 𝑦𝑖 − 𝑦ො𝑖
𝑛 2

• Classification loss (for multi-class classification)


– ො 𝜃 = − σ𝑛𝑖=1 σ𝐾
Cross Entropy loss 𝐸 𝑦, 𝑦; 𝑘=1(𝑦𝑖𝑘 ⋅ log 𝑦
ො𝑖𝑘 )

I2DL: Prof. Niessner 8


Computational Graphs
• Neural network is a computational graph

– It has compute nodes

– It has edges that connect nodes

– It is directional

– It is organized in ‘layers’

I2DL: Prof. Niessner 9


Backprop

I2DL: Prof. Niessner 10


The Importance of Gradients
• Our optimization schemes are based on computing
gradients
𝛻𝜽 𝐿 𝜽
• One can compute gradients analytically but what if
our function is too complex?

• Break down gradient computation Backpropagation

Done by many people before, but often credited to Rumelhart 1986


I2DL: Prof. Niessner 11
Backprop: Forward Pass
• 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧
Initialization 𝑥 = 1, 𝑦 = −3, 𝑧 = 4
1 1

𝑑 = −2
−3
sum 𝑓 = −8
−3

mult
4
4

I2DL: Prof. Niessner 12


Backprop: Backward Pass
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1

with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2

sum 𝑓 = −8
−3

mult
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4
𝜕𝑥

𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑

𝜕𝑓 𝜕𝑓 𝜕𝑓
What is 𝜕𝑥 , 𝜕𝑦 , 𝜕𝑧 ?

I2DL: Prof. Niessner 13


Backprop: Backward Pass
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1

with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2

sum 𝑓 = −8
−3

mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4
𝜕𝑥

𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑
𝜕𝑓
𝜕𝑓
𝜕𝑓 𝜕𝑓 𝜕𝑓
What is 𝜕𝑥 , 𝜕𝑦 , 𝜕𝑧 ?

I2DL: Prof. Niessner 14


Backprop: Backward Pass
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1

with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2

sum 𝑓 = −8
−3

mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4 −2
𝜕𝑥
−2
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑
𝜕𝑓
𝜕𝑧
𝜕𝑓 𝜕𝑓 𝜕𝑓
What is 𝜕𝑥 , 𝜕𝑦 , 𝜕𝑧 ?

I2DL: Prof. Niessner 15


Backprop: Backward Pass
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1

with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2

sum 4 𝑓 = −8
−3

mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4 −2
𝜕𝑥
−2
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑
𝜕𝑓
𝜕𝑑
𝜕𝑓 𝜕𝑓 𝜕𝑓
What is 𝜕𝑥 , 𝜕𝑦 , 𝜕𝑧 ?

I2DL: Prof. Niessner 16


Backprop: Backward Pass
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1

with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2

4 sum 4 𝑓 = −8
−3
4 mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4 −2
𝜕𝑥
−2
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑 Chain Rule:
𝜕𝑓
𝜕𝑓 𝜕𝑓 𝜕𝑑
= ⋅ 𝜕𝑦
𝜕𝑓 𝜕𝑓 𝜕𝑓
What is 𝜕𝑥 , 𝜕𝑦 , 𝜕𝑧 ? 𝜕𝑦 𝜕𝑑 𝜕𝑦
𝜕𝑓
→ =4⋅1=4
I2DL: Prof. Niessner 𝜕𝑦 17
Backprop: Backward Pass
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1
4 4
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2

4 sum 4 𝑓 = −8
−3
4 mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4 −2
𝜕𝑥
−2
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑 Chain Rule:
𝜕𝑓
𝜕𝑓 𝜕𝑓 𝜕𝑑
= ⋅ 𝜕𝑥
𝜕𝑓 𝜕𝑓 𝜕𝑓
What is 𝜕𝑥 , 𝜕𝑦 , 𝜕𝑧 ? 𝜕𝑦 𝜕𝑑 𝜕𝑦
𝜕𝑓
→ =4⋅1=4
I2DL: Prof. Niessner 𝜕𝑥 18
Compute Graphs -> Neural Networks
• 𝑥𝑘 input variables
• 𝑤𝑙,𝑚,𝑛 network weights (note 3 indices)
– 𝑙 which layer
– 𝑚 which neuron in layer
– 𝑛 which weight in neuron
• 𝑦ො𝑖 computed output (𝑖 output dim; 𝑛𝑜𝑢𝑡 )
• 𝑦𝑖 ground truth targets
• 𝐿 loss function

I2DL: Prof. Niessner 19


Compute Graphs -> Neural Networks
Input layer Output layer

𝑥0 ∗ 𝑤0
𝑥0
+ −𝑦0 𝑥 ∗𝑥 Loss/
𝑦ො0 𝑦0 cost
𝑥1 ∗ 𝑤1
𝑥1
Input Weights L2 Loss
(unknowns!) function

e.g., class label/


regression target

I2DL: Prof. Niessner 20


Compute Graphs -> Neural Networks
Input layer Output layer

𝑥0 ∗ 𝑤0
Loss/
𝑥0 + max(0, 𝑥) −𝑦0 𝑥 ∗𝑥
cost
𝑦ො0 𝑦0 𝑥1 ∗ 𝑤1
𝑥1 ReLU Activation
Input Weights (not arguing this L2 Loss
(unknowns!) is the right choice here) function

e.g., class label/ We want to compute gradients w.r.t. all weights 𝑾


regression target
I2DL: Prof. Niessner 21
Compute Graphs -> Neural Networks
∗ 𝑤0,0
Input layer Output layer Loss/
+ −𝑦0 𝑥 ∗𝑥
cost
∗ 𝑤0,1

∗ 𝑤1,0
𝑦ො0 𝑦0 𝑥0 Loss/
+ −𝑦0 𝑥 ∗𝑥
𝑥0 cost
∗ 𝑤1,1
𝑦ො1 𝑦1 𝑥1

𝑥1 ∗ 𝑤2,0
𝑦ො2 𝑦2 Loss/
+ −𝑦0 𝑥 ∗𝑥
cost
∗ 𝑤2,1
We want to compute gradients w.r.t. all weights 𝑾
I2DL: Prof. Niessner 22
Compute Graphs -> Neural Networks
Input layer Output layer Goal: We want to compute gradients of
the loss function 𝐿 w.r.t. all weights 𝑾

𝑥0 𝐿 = ෍ 𝐿𝑖
𝑦ො0 𝑦0 𝑖
𝐿: sum over loss per sample, e.g.
L2 loss ⟶ simply sum up squares:

𝑦ො1 𝑦1 𝐿𝑖 = 𝑦ො𝑖 − 𝑦𝑖 2

𝑥𝑘 ⟶ use chain rule to compute partials


𝑦ො𝑖 = 𝐴(𝑏𝑖 + ෍ 𝑥𝑘 𝑤𝑖,𝑘 ) 𝜕𝐿 𝜕𝐿 𝜕𝑦ො𝑖
𝑘 = ⋅
𝜕𝑤𝑖,𝑘 𝜕𝑦ො𝑖 𝜕𝑤𝑖,𝑘
Activation bias
function We want to compute gradients w.r.t.
all weights 𝑾 AND all biases 𝒃
I2DL: Prof. Niessner 23
NNs as Computational Graphs
• We can express any kind of functions in a
1
computational graph, e.g. 𝑓 𝒘, 𝒙 = − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1
1+𝑒
𝑤1

𝑥1 Sigmoid function
+ 1
𝜎 𝑥 =
𝑤0 1 + 𝑒 −𝑥

1
𝑥0 + ∙ −1 exp(∙) +1 ∙

𝑏
I2DL: Prof. Niessner 24
NNs as Computational Graphs
1
• 𝑓 𝒘, 𝒙 =
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1
2
𝑤1

−1 −2
𝑥1
+
−3 𝑤 6 4
0
∗ 1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥0 + ∙ −1 +1 ∙

−3
𝑏

I2DL: Prof. Niessner 25


NNs as Computational Graphs
1 1 𝜕𝑔 1
• 𝑓 𝒘, 𝒙 = 𝑔 𝑥 = ⇒ =− 2
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1 𝑥 𝜕𝑥
𝜕𝑔
𝑥

2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
𝑤1 𝜕𝑔
𝑔 𝑥 = 𝑒𝑥 ⇒ = 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝑥1 𝑔𝛼 𝑥 = 𝛼𝑥 ⇒ =𝛼
𝜕𝑥
1
+ 1 ∙ − 1.372 = −0.53
−3 𝑤 6 4
0
∗ 1 −1 exp(∙)0.37 1.37 1 0.73
−2 + ∙ −1 +1
𝑥0 ∙
−0.53 1
−3
𝑏

I2DL: Prof. Niessner 26


NNs as Computational Graphs
1 1 𝜕𝑔 1
• 𝑓 𝒘, 𝒙 = 𝑔 𝑥 = ⇒ =− 2
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1 𝑥 𝜕𝑥
𝜕𝑔
𝑥

2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
𝑤1 𝜕𝑔
𝑔 𝑥 = 𝑒𝑥 ⇒ = 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝑥1 𝑔𝛼 𝑥 = 𝛼𝑥 ⇒ =𝛼
𝜕𝑥
+
−3 𝑤 6 4 −0.53 ∙ 1 = −0.53
0
∗ 1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥0 + ∙ −1 +1 ∙
−0.53 −0.53 1
−3
𝑏

I2DL: Prof. Niessner 27


NNs as Computational Graphs
1 1 𝜕𝑔 1
• 𝑓 𝒘, 𝒙 = 𝑔 𝑥 = ⇒ =− 2
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1 𝑥 𝜕𝑥
𝜕𝑔
𝑥

2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
𝑤1 𝜕𝑔
𝑔 𝑥 = 𝑒𝑥 ⇒ = 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝑥1 𝑔𝛼 𝑥 = 𝛼𝑥 ⇒ =𝛼
𝜕𝑥
+
−3 𝑤 6 4 −0.53 ∙ e−1 = −0.2
0
∗ 1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥0 + ∙ −1 +1 ∙

−0.2 −0.53 −0.53 1


−3
𝑏

I2DL: Prof. Niessner 28


NNs as Computational Graphs
1 1 𝜕𝑔 1
• 𝑓 𝒘, 𝒙 = 𝑔 𝑥 = ⇒ =− 2
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1 𝑥 𝜕𝑥
𝜕𝑔
𝑥

2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
𝑤1 𝜕𝑔
𝑔 𝑥 = 𝑒𝑥 ⇒ = 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝑥1 𝑔𝛼 𝑥 = 𝛼𝑥 ⇒ =𝛼
𝜕𝑥
+
−3 𝑤 6 4 −0.2 ∙ −1 = 0.2
0
∗ 1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥0 + ∙ −1 +1 ∙

−3 0.2 −0.2 −0.53 −0.53 1


𝑏

I2DL: Prof. Niessner 29


NNs as Computational Graphs
1 1 𝜕𝑔 1
• 𝑓 𝒘, 𝒙 = 𝑔 𝑥 = ⇒ =− 2
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1 𝑥 𝜕𝑥
𝜕𝑔
𝑥

2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
−0.2 𝑤1 𝑔 𝑥 = 𝑒𝑥 ⇒
𝜕𝑔
= 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝑥 𝑔𝛼 𝑥 = 𝛼𝑥 ⇒ =𝛼
0.4 1 0.2 𝜕𝑥
+
−3 𝑤 6 4
−0.4 0
∗ 0.2 0.2
1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥 + ∙ −1 +1 ∙
−0.6 0
−3 0.2 −0.2 −0.53 −0.53 1
0.2 𝑏

I2DL: Prof. Niessner 30


Gradient Descent

I2DL: Prof. Niessner 31


Gradient Descent
𝒙∗ = arg min 𝑓(𝒙)

Initialization

Optimum

I2DL: Prof. Niessner 32


Gradient Descent
• From derivative to gradient Direction of
greatest increase
ⅆ𝑓 𝑥 of the function
𝛻𝑥 𝑓 𝑥
ⅆ𝑥

• Gradient steps in direction of negative gradient


𝛻𝑥 𝑓(𝑥)
𝑥
𝑥 ′ = 𝑥 − 𝛼𝛻𝑥 𝑓 𝑥

Learning rate

I2DL: Prof. Niessner 33


Gradient Descent for Neural Networks
Hidden Layer 1 Hidden Layer 2 Hidden Layer 3
Input Layer

Output Layer
𝑚 Neurons

𝑙 Layers
I2DL: Prof. Niessner 34
Gradient Descent for Neural Networks
For a given training pair {𝒙, 𝒚}, we want to update all
weights, i.e., we need to compute the derivatives w.r.t. to
all weights:
𝜕𝑓
Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3
𝜕𝑤0,0,0

𝑚 Neurons
Output Layer
𝛻𝑾 𝑓 𝒙,𝒚 (𝑾) = …
𝜕𝑓
𝜕𝑤𝑙,𝑚,𝑛
Gradient step:
𝑾′ = 𝑾 − 𝛼𝛻𝑾 𝑓 𝒙,𝒚 (𝑾)
𝑙 Layers

I2DL: Prof. Niessner 35


NNs can Become Quite Complex…
• These graphs can be huge!

[Szegedy et al.,CVPR’15] Going Deeper with Convolutions


I2DL: Prof. Niessner 36
The Flow of the Gradients
• Many many many many of these nodes form a neural
network

NEURONS

• Each one has its own work to do


FORWARD AND BACKWARD PASS

I2DL: Prof. Niessner 37


The Flow of the Gradients
𝜕𝐿 𝜕𝐿 𝜕𝑧
=
𝑥 𝜕𝑥 𝜕𝑧 𝜕𝑥

𝜕𝑧
Activations

𝑓
𝜕𝑥 𝑧 = 𝑓(𝑥, 𝑦)
𝜕𝑧
𝜕𝐿
𝜕𝑦
𝜕𝑧
𝑦

I2DL: Prof. Niessner


Activation function 38
Gradient Descent for Neural Networks
Loss function
𝐿𝑖 = 𝑦ො𝑖 − 𝑦𝑖 2

𝑦ො𝑖 = 𝐴(𝑏1,𝑖 + ෍ ℎ𝑗 𝑤1,𝑖,𝑗 )


𝑗
Just simple:
ℎ𝑗 = 𝐴(𝑏0,𝑗 + ෍ 𝑥𝑘 𝑤0,𝑗,𝑘 )
𝐴 𝑥 = max(0, 𝑥)
𝑘
I2DL: Prof. Niessner 39
Gradient Descent for Neural Networks
Backpropagation

Just go through layer by layer


𝜕𝐿 𝜕𝐿 𝜕𝑦ො𝑖
= ⋅
𝜕𝑤1,𝑖,𝑗 𝜕𝑦ො𝑖 𝜕𝑤1,𝑖,𝑗

𝜕𝐿𝑖
= 2(𝑦ොi − yi )
𝜕𝑦ො𝑖
𝜕𝑦ො 𝑖
ℎ𝑗 = 𝐴(𝑏0,𝑗 + ෍ 𝑥𝑘 𝑤0,𝑗,𝑘 ) = ℎ𝑗 if > 0, else 0
𝜕𝑤1,𝑖,𝑗
𝑘

𝑦ො𝑖 = 𝐴(𝑏1,𝑖 + ෍ ℎ𝑗 𝑤1,𝑖,𝑗 ) 𝜕𝐿 𝜕𝐿 𝜕 𝑦ො𝑖 𝜕ℎ𝑗


= ⋅ ⋅
𝑗 𝜕𝑤0,𝑗,𝑘 𝜕𝑦ො𝑖 𝜕ℎ𝑗 𝜕𝑤0,𝑗,𝑘
𝐿𝑖 = 𝑦ො𝑖 − 𝑦𝑖 2

I2DL: Prof. Niessner 40
Gradient Descent for Neural Networks

How many unknown weights?


• Output layer: 2 ⋅ 4 + 2
• Hidden Layer: 4 ⋅ 3 + 4
#neurons ∙ #input channels + #biases
ℎ𝑗 = 𝐴(𝑏0,𝑗 + ෍ 𝑥𝑘 𝑤0,𝑗,𝑘 )
𝑘
Note that some activations
𝑦ො𝑖 = 𝐴(𝑏1,𝑖 + ෍ ℎ𝑗 𝑤1,𝑖,𝑗 )
have also weights
𝑗

𝐿𝑖 = 𝑦ො𝑖 − 𝑦𝑖 2

I2DL: Prof. Niessner 41


Derivatives of Cross Entropy Loss
Gradients of weights of last layer:
𝜕𝐿 𝜕𝐿 𝜕 𝑦ො𝑖 𝜕𝑠𝑖
= ∙ ∙
𝜕𝑤𝑗𝑖 𝜕𝑦ො𝑖 𝜕𝑠𝑖 𝜕𝑤𝑗𝑖
𝜕𝐿 −𝑦𝑖 1 − 𝑦𝑖 𝑦ො𝑖 − 𝑦𝑖
= + = ,
𝜕𝑦ො𝑖 𝑦ො𝑖 1 − 𝑦ො𝑖 𝑦ො𝑖 (1 − 𝑦ො𝑖 )
𝜕𝑦ො𝑖
Binary Cross Entropy loss = 𝑦ො𝑖 1 − 𝑦ො𝑖 ,
𝑛𝑜𝑢𝑡
𝜕𝑠𝑖
𝜕𝑠𝑖
𝐿 = − ෍ (𝑦𝑖 log 𝑦ො𝑖 + 1 − 𝑦𝑖 log(1 − 𝑦ො𝑖 )) = ℎ𝑗
𝜕𝑤𝑗𝑖
𝑖=1
1
𝑦ො𝑖 = 𝑠𝑖 = ෍ ℎ𝑗 𝑤𝑗𝑖 𝜕𝐿 𝜕𝐿
1 + 𝑒 −𝑠𝑖 ⟹ = (𝑦ො𝑖 − 𝑦𝑖 )ℎ𝑗 , = 𝑦ො𝑖 − 𝑦𝑖
j 𝜕𝑠𝑖
𝜕𝑤𝑗𝑖
output scores
I2DL: Prof. Niessner 42
Derivatives of Cross Entropy Loss
Gradients of weights of first layer:
𝑛𝑜𝑢𝑡 𝑛𝑜𝑢𝑡 𝑛𝑜𝑢𝑡
𝜕𝐿 𝜕𝐿 𝜕𝑦ො𝑖 𝜕𝑠𝑖 𝜕𝐿
= ෍ = ෍ 𝑦ො 1 − 𝑦ො𝑖 𝑤𝑗𝑖 = ෍ 𝑦ො𝑖 − 𝑦𝑖 𝑤𝑗𝑖
𝜕ℎ𝑗 𝜕𝑦ො𝑖 𝜕𝑠i 𝜕ℎ𝑗 𝜕𝑦ො𝑖 𝑖
𝑖=1 𝑖=1 𝑖=1

𝑛𝑜𝑢𝑡 𝑛𝑜𝑢𝑡
𝜕𝐿 𝜕𝐿 𝜕𝑠𝑖 𝜕ℎ𝑗
= ෍ = ෍ (𝑦ො𝑖 − 𝑦𝑖 )𝑤𝑗𝑖 (ℎ𝑗 1 − ℎ𝑗 )
𝜕𝑠𝑗1 𝜕𝑠𝑖 𝜕ℎ𝑗 𝜕𝑠𝑗1
𝑖=1 𝑖=1

𝑛𝑜𝑢𝑡 𝑛𝑜𝑢𝑡
𝜕𝐿 𝜕𝐿 𝜕𝑠𝑗1
1 = ෍ 1 = ෍ 𝑦
ො𝑖 − 𝑦𝑖 𝑤𝑗𝑖 ℎ𝑗 1 − ℎ𝑗 𝑥𝑘
𝜕𝑤𝑘𝑗 𝜕𝑠𝑗1 𝜕𝑤𝑘𝑗
𝑖=1 𝑖=1

I2DL: Prof. Niessner 43


Back to Compute Graphs & NNs
𝑥
• Inputs 𝒙 and targets 𝒚 𝑧 𝜎

• Two-layer NN for max(0,∙)
𝑦ො
𝑤1 ∗
regression with ReLU
activation 𝑤2 𝐿

• Function we want to 𝑦

optimize:
𝑛
2
෍ 𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 2
𝑖=1

I2DL: Prof. Niessner 44


Gradient Descent for Neural Networks
1 1
1 𝑥 3 3
Initialize 𝑥 = 1, 𝑦 = 0, 𝑧 𝜎 2
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3
𝑦ො
3 𝑤1 ∗
1 𝑛 2 4
ෝ; 𝜽 =
𝐿 𝒚, 𝒚 σ | 𝑦ො𝑖 − 𝑦𝑖 | 2 2 𝑤
𝑛 𝑖 𝐿 9
2
In our case 𝑛, 𝑑 = 1: 1
𝜕𝐿 0 𝑦
𝐿 = 𝑦ො − 𝑦 2 ⇒ = 2(𝑦ො − 𝑦)
𝜕𝑦ො
𝜕𝑦ො
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ 𝜕𝑤 = 𝜎 Backpropagation
2 𝜕𝐿 𝜕𝐿 𝜕𝑦ො
= ⋅
𝜕𝑤2 𝜕𝑦ො 𝜕𝑤2

I2DL: Prof. Niessner 45


Gradient Descent for Neural Networks
1 1
1 𝑥 3 3
Initialize 𝑥 = 1, 𝑦 = 0, 𝑧 𝜎 2
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3
𝑦ො 4
3 𝑤1 ∗ 3
1 4
ෝ; 𝜽 = σ𝑛𝑖 | 𝑦ො𝑖 − 𝑦𝑖 |
𝐿 𝒚, 𝒚 2
2 2 𝑤 9
𝑛
2 𝐿
In our case 𝑛, 𝑑 = 1: 1
𝜕𝐿 0 𝑦
𝐿 = 𝑦ො − 𝑦 2 ⇒ = 2(𝑦ො − 𝑦)
𝜕𝑦ො
𝜕𝑦ො
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ 𝜕𝑤 = 𝜎 Backpropagation
2 𝜕𝐿 𝜕𝐿 𝜕𝑦ො
= ⋅
𝜕𝑤2 𝜕𝑦ො 𝜕𝑤2
2 ∙ 23

I2DL: Prof. Niessner 46


Gradient Descent for Neural Networks
1 1
1 𝑥 3 3
Initialize 𝑥 = 1, 𝑦 = 0, 𝑧 𝜎 2
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3
𝑦ො 4
3 𝑤1 4 ∗ 3
1 9 4
ෝ; 𝜽 = σ𝑛𝑖 | 𝑦ො𝑖 − 𝑦𝑖 |
𝐿 𝒚, 𝒚 2
2 2 𝑤 9
𝑛
2 𝐿
4 1
In our case 𝑛, 𝑑 = 1: 9
𝜕𝐿 0 𝑦
𝐿 = 𝑦ො − 𝑦 2 ⇒ = 2(𝑦ො − 𝑦)
𝜕𝑦ො
𝜕𝑦ො
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ 𝜕𝑤 = 𝜎 Backpropagation
2 𝜕𝐿 𝜕𝐿 𝜕𝑦ො
= ⋅
𝜕𝑤2 𝜕𝑦ො 𝜕𝑤2
2 ∙ 23 ∙ 13

I2DL: Prof. Niessner 47


Gradient Descent for Neural Networks
1 1
1 𝑥 3 3
Initialize 𝑥 = 1, 𝑦 = 0, 𝑧 𝜎 2
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3
𝑦ො
3 𝑤1 4 ∗
9
In our case 𝑛, 𝑑 = 1: 2 𝑤
4
𝐿 9
𝜕𝐿 2
𝐿 = 𝑦ො − 𝑦 2 ⇒ 𝜕𝑦ො = 2(𝑦ො − 𝑦) 4 1
9
𝜕𝑦ො 0 𝑦
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ = 𝑤2
𝜕𝜎
𝜕𝜎 1 if 𝑥 > 0 Backpropagation
𝜎 = max 0, 𝑧 ⇒ =ቊ
𝜕𝑧 0 else 𝜕𝐿 𝜕𝐿 𝜕𝑦ො 𝜕𝜎 𝜕𝑧
𝜕𝑧 = ⋅ ⋅ ⋅
𝑧 = 𝑥 ∙ 𝑤1 ⇒ 𝜕𝑤 = 𝑥 𝜕𝑤1 𝜕𝑦ො 𝜕𝜎 𝜕𝑧 𝜕𝑤1
1

I2DL: Prof. Niessner 48


Gradient Descent for Neural Networks
1 1
1 𝑥 3 3
Initialize 𝑥 = 1, 𝑦 = 0, 𝑧 𝜎 2
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3
𝑦ො 4
3 𝑤1 4 ∗ 3
9
In our case 𝑛, 𝑑 = 1: 2 𝑤
4
𝐿 9
𝜕𝐿 2
𝐿 = 𝑦ො − 𝑦 2 ⇒ 𝜕𝑦ො = 2(𝑦ො − 𝑦) 4
9
𝜕𝑦ො 0 𝑦
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ = 𝑤2
𝜕𝜎
𝜕𝜎 1 if 𝑥 > 0 Backpropagation
𝜎 = max 0, 𝑧 ⇒ =ቊ
𝜕𝑧 0 else 𝜕𝐿 𝜕𝐿 𝜕𝑦ො 𝜕𝜎 𝜕𝑧
𝜕𝑧 = ⋅ ⋅ ⋅
𝑧 = 𝑥 ∙ 𝑤1 ⇒ 𝜕𝑤 = 𝑥 𝜕𝑤1 𝜕𝑦ො 𝜕𝜎 𝜕𝑧 𝜕𝑤1
1

2 ∙ 23

I2DL: Prof. Niessner 49


Gradient Descent for Neural Networks
1 1
1 𝑥 3 3
Initialize 𝑥 = 1, 𝑦 = 0, 𝑧 𝜎 8 2
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3 3
𝑦ො 4
3 𝑤1 4 ∗ 3
9
In our case 𝑛, 𝑑 = 1: 2 𝑤
4
𝐿 9
𝜕𝐿 2
𝐿 = 𝑦ො − 𝑦 2 ⇒ 𝜕𝑦ො = 2(𝑦ො − 𝑦) 4 1
9
𝜕𝑦ො 0 𝑦
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ = 𝑤2
𝜕𝜎
𝜕𝜎 1 if 𝑥 > 0 Backpropagation
𝜎 = max 0, 𝑧 ⇒ =ቊ
𝜕𝑧 0 else 𝜕𝐿 𝜕𝐿 𝜕𝑦ො 𝜕𝜎 𝜕𝑧
𝜕𝑧 = ⋅ ⋅ ⋅
𝑧 = 𝑥 ∙ 𝑤1 ⇒ 𝜕𝑤 = 𝑥 𝜕𝑤1 𝜕𝑦ො 𝜕𝜎 𝜕𝑧 𝜕𝑤1
1

2 ∙ 23 ∙ 2

I2DL: Prof. Niessner 50


Gradient Descent for Neural Networks
1 1
1 𝑥 3 3
Initialize 𝑥 = 1, 𝑦 = 0, 𝑧 8
𝜎 8 2
3
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3 3
𝑦ො 4
3 𝑤1 4 ∗ 3
9
In our case 𝑛, 𝑑 = 1: 2 𝑤
4
𝐿 9
𝜕𝐿 2
𝐿 = 𝑦ො − 𝑦 2 ⇒ 𝜕𝑦ො = 2(𝑦ො − 𝑦) 4 1
9
𝜕𝑦ො 0 𝑦
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ = 𝑤2
𝜕𝜎
𝜕𝜎 1 if 𝑥 > 0 Backpropagation
𝜎 = max 0, 𝑧 ⇒ =ቊ
𝜕𝑧 0 else 𝜕𝐿 𝜕𝐿 𝜕𝑦ො 𝜕𝜎 𝜕𝑧
𝜕𝑧 = ⋅ ⋅ ⋅
𝑧 = 𝑥 ∙ 𝑤1 ⇒ 𝜕𝑤 = 𝑥 𝜕𝑤1 𝜕𝑦ො 𝜕𝜎 𝜕𝑧 𝜕𝑤1
1

2 ∙ 23 ∙ 2 ∙ 1

I2DL: Prof. Niessner 51


Gradient Descent for Neural Networks
1 1
1 𝑥 3 3
Initialize 𝑥 = 1, 𝑦 = 0, 𝑧 8
𝜎 8 2
8 3
𝑤1 = 13, 𝑤2 = 2 1 3
∗ max(0,∙) 3 3
𝑦ො 4
3 𝑤1 4 ∗ 3
9
In our case 𝑛, 𝑑 = 1: 8
2 𝑤
4
3 𝐿 9
𝜕𝐿 2
𝐿 = 𝑦ො − 𝑦 2 ⇒ 𝜕𝑦ො = 2(𝑦ො − 𝑦) 4 1
9
𝜕𝑦ො 0 𝑦
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ = 𝑤2
𝜕𝜎
𝜕𝜎 1 if 𝑥 > 0 Backpropagation
𝜎 = max 0, 𝑧 ⇒ =ቊ
𝜕𝑧 0 else 𝜕𝐿 𝜕𝐿 𝜕𝑦ො 𝜕𝜎 𝜕𝑧
𝜕𝑧 = ⋅ ⋅ ⋅
𝑧 = 𝑥 ∙ 𝑤1 ⇒ 𝜕𝑤 = 𝑥 𝜕𝑤1 𝜕𝑦ො 𝜕𝜎 𝜕𝑧 𝜕𝑤1
1

2 ∙ 23 ∙ 2 ∙ 1 ∙1

I2DL: Prof. Niessner 52


Gradient Descent for Neural Networks
1 1
• Function we want to 1 𝑥 3
8
3
𝑧 𝜎 2
optimize: 1
8
3

3
max(0,∙)
8
3 3
𝑛 𝑦ො 4
3 𝑤1 4 ∗
2 3
𝑓 𝑥, 𝒘 = ෍ 𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 2 8 9 4
𝑖=1 3 2 𝑤 𝐿 9
4 2
1
• Computed gradients wrt 9
0 𝑦
to weights 𝑤1 and 𝑤2
• Now: update the weights
𝑤1
𝒘′ = 𝒘 − 𝛼 ∙ 𝛻𝒘 𝑓 = 𝑤 − 𝛼 ∙
𝛻𝑤1 𝑓 But: how to choose a
𝛻𝑤2 𝑓
1 8
2
good learning rate 𝛼 ?
3
= 3 − 𝛼∙ 4
2 9
I2DL: Prof. Niessner 53
Gradient Descent
• How to pick good learning rate?

• How to compute gradient for single training pair?

• How to compute gradient for large training set?

• How to speed things up? More to see in next


lectures…

I2DL: Prof. Niessner 54


Regularization

I2DL: Prof. Niessner 55


Recap: Basic Recipe for ML
• Split your data

60% 20% 20%


train validation test

Find your hyperparameters

Other splits are also possible (e.g., 80%/10%/10%)


I2DL: Prof. Niessner 56
Over- and Underfitting

Underfitted Appropriate Overfitted

Source: Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017

I2DL: Prof. Niessner 57


Training a Neural Network
• Training/ Validation curve
How can we
prevent our model
from overfitting?

Training error Generalization Regularization


too high gap is too big

Credits: Deep Learning. Goodfellow et al.

I2DL: Prof. Niessner 59


Regularization
ෝ, 𝜽) = σ𝑛𝑖=1 𝑦ො𝑖 − 𝑦𝑖
• Loss function 𝐿(𝒚, 𝒚 2
+ 𝜆𝑅(𝜽)

• Regularization techniques
– L2 regularization Add regularization
– L1 regularization term to loss function
– Max norm regularization
– Dropout
– Early stopping
– ...

I2DL: Prof. Niessner 60


Regularization
ෝ, 𝜽) = σ𝑛𝑖=1 𝑦ො𝑖 − 𝑦𝑖
• Loss function 𝐿(𝒚, 𝒚 2
+ 𝜆𝑅(𝜽)

• Regularization techniques
– L2 regularization Add regularization
– L1 regularization term to loss function
– Max norm regularization
– Dropout
More details later
– Early stopping
– ...

I2DL: Prof. Niessner 61


Regularization: Example
• Input: 3 features 𝒙 = [1, 2, 1]

• Two linear classifiers that give the same result:

• 𝜃1 = 0, 0.75, 0 Ignores 2 features

• 𝜃2 = [0.25, 0.5, 0.25] Takes information


from all features

I2DL: Prof. Niessner 62


Regularization: Example
2
• Loss 𝐿 𝒚, 𝒚
ෝ, 𝜽 = σ𝑛𝑖=1 𝑥𝑖 𝜃𝑗𝑖 − 𝑦𝑖 + 𝜆𝑅 𝜽
𝑛

• L2 regularization 𝑅 𝜽 = ෍ 𝜃𝑖2
𝑖=1
𝜃1 0 + 0.752 + 0 = 0.5625
𝜃2 0.252 + 0.52 + 0.252 = 0.375 Minimization

𝑥 = 1, 2, 1 , 𝜃1 = 0, 0.75, 0 , 𝜃2 = [0.25, 0.5, 0.25]


I2DL: Prof. Niessner 63
Regularization: Example
2
• Loss 𝐿 𝒚, 𝒚
ෝ, 𝜽 = σ𝑛𝑖=1 𝑥𝑖 𝜃𝑗𝑖 − 𝑦𝑖 + 𝜆𝑅 𝜽
𝑛

• L1 regularization 𝑅 𝜽 = ෍ |𝜃𝑖 |
𝑖=1
𝜃1 0 + 0.75 + 0 = 0.75 Minimization
𝜃2 0.25 + 0.5 + 0.25 = 1

𝑥 = 1, 2, 1 , 𝜃1 = 0, 0.75, 0 , 𝜃2 = [0.25, 0.5, 0.25]


I2DL: Prof. Niessner 64
Regularization: Example
• Input: 3 features 𝒙 = [1, 2, 1]

• Two linear classifiers that give the same result:

• 𝜃1 = 0, 0.75, 0 Ignores 2 features

• 𝜃2 = [0.25, 0.5, 0.25] Takes information


from all features

I2DL: Prof. Niessner 65


Regularization: Example
• Input: 3 features 𝒙 = [1, 2, 1]

• Two linear classifiers that give the same result:

• 𝜃1 = 0, 0.75, 0 L1 regularization
enforces sparsity

• 𝜃2 = [0.25, 0.5, 0.25] L2 regularization


enforces that the weights
have similar values
I2DL: Prof. Niessner 66
Regularization: Effect
• Dog classifier takes different inputs

Furry
L1 regularization
Has two eyes
will focus all the
attention to a
Has a tail
few key features
Has paws

Has two ears


I2DL: Prof. Niessner 67
Regularization: Effect
• Dog classifier takes different inputs

Furry
L2 regularization
Has two eyes will take all
information into
Has a tail account to make
decisions
Has paws

Has two ears


I2DL: Prof. Niessner 68
Regularization for Neural Networks
𝑥
∗ max(0,∙)

𝑤1 ∗
𝐿2
𝑤2 + 𝐿
loss
𝑦

𝑅(𝑤1 , 𝑤2 ) ∙𝜆
Combining nodes: 𝑛
Network output + L2-loss + ෍ 𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 2
2 + 𝜆𝑅 𝑤1 , 𝑤2
regularization 𝑖=1
I2DL: Prof. Niessner 69
Regularization for Neural Networks
𝑥
∗ max(0,∙)

𝑤1 ∗
𝐿2
𝑤2 + 𝐿
loss
𝑦

𝑅(𝑤1 , 𝑤2 ) ∙𝜆
Combining nodes: 𝑛
2
𝑤1
Network output + L2-loss + ෍ 𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 22 +𝜆 𝑤2
regularization 𝑖=1
2

I2DL: Prof. Niessner 70


Regularization for Neural Networks
𝑥
∗ max(0,∙)

𝑤1 ∗
𝐿2
𝑤2 + 𝐿
loss
𝑦

𝑅(𝑤1 , 𝑤2 ) ∙𝜆
Combining nodes: 𝑛
Network output + L2-loss + ෍ 𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 2
2 + 𝜆(𝑤12 + 𝑤22 )
regularization 𝑖=1
I2DL: Prof. Niessner 71
Regularization
Regularization 𝜆=0 𝜆 = 0.00001 𝜆 = 0.001 𝜆=1 𝜆 = 10

Decision
Boundary

Credit: University of Washington

What happens to the training error?


What is the goal of regularization?
I2DL: Prof. Niessner 72
Regularization
• Any strategy that aims to

Lower Increasing
validation error training error

I2DL: Prof. Niessner 73


Next Lecture
• This week:
– Check exercises!
– Check piazza / post questions ☺

• Next lecture
– Optimization of Neural Networks
– In particular, introduction to SGD (our main method!)

I2DL: Prof. Niessner 74


See you next week ☺

I2DL: Prof. Niessner 75


Further Reading
• Backpropagation
– Chapter 6.5 (6.5.1 - 6.5.3) in
http://www.deeplearningbook.org/contents/mlp.html
– Chapter 5.3 in Bishop, Pattern Recognition and Machine Learning
– http://cs231n.github.io/optimization-2/
• Regularization
– Chapter 7.1 (esp. 7.1.1 & 7.1.2)
http://www.deeplearningbook.org/contents/regularization.html
– Chapter 5.5 in Bishop, Pattern Recognition and Machine Learning

I2DL: Prof. Niessner 76

You might also like