lec-4-opt_and_bp

Optimization and
Backpropagation
I2DL: Prof. Niessner 1

Lecture 3 Recap

Neural Network
• Linear score function 𝒇 = 𝑾𝒙
On CIFAR-10
On ImageNet Credit:
I2DL: Prof. Niessner Li/Karpathy/Johnson
3
Neural Network
• Linear score function 𝒇 = 𝑾𝒙
• Neural network is a nesting of ‘functions’

– 2-layers: 𝒇 = 𝑾𝟐 max(𝟎, 𝑾𝟏 𝒙)
– 3-layers: 𝒇 = 𝑾𝟑 max(𝟎, 𝑾𝟐 max(𝟎, 𝑾𝟏 𝒙))
– 4-layers: 𝒇 = 𝑾𝟒 tanh (𝑾𝟑 , max(𝟎, 𝑾𝟐 max(𝟎, 𝑾𝟏 𝒙)))
– 5-layers: 𝒇 =
𝑾𝟓 𝜎(𝑾𝟒 tanh(𝑾𝟑 , max(𝟎, 𝑾𝟐 max(𝟎, 𝑾𝟏 𝒙))))
– … up to hundreds of layers

Neural Network
Output layer
Input layer
Hidden layer
Credit: Li/Karpathy/Johnson
Neural Network
Hidden Layer 1 Hidden Layer 2 Hidden Layer 3
Input Layer
Output Layer
Width
Depth
Activation Functions
1 Leaky ReLU: max 0.1𝑥, 𝑥
Sigmoid: 𝜎 𝑥 = (1+𝑒 −𝑥 )
tanh: tanh 𝑥
Parametric ReLU: max 𝛼𝑥, 𝑥

Maxout max 𝑤1𝑇 𝑥 + 𝑏1 , 𝑤2𝑇 𝑥 + 𝑏2
ReLU: max 0, 𝑥 𝑥 if 𝑥 > 0
ELU f x = ቊ
α e𝑥− 1 if 𝑥 ≤ 0
Loss Functions
• Measure the goodness of the predictions (or
equivalently, the network's performance)
• Regression loss
1
• L1 loss 𝑳 𝒚, 𝒚
ෝ; 𝜽 = σ𝑛𝑖 𝑦𝑖 − 𝑦ො𝑖 1
𝑛
1 2
• MSE loss 𝑳 𝒚, 𝒚 ෝ; 𝜽 = σ𝑛𝑖 𝑦𝑖 − 𝑦ො𝑖
𝑛 2
• Classification loss (for multi-class classification)

– ො 𝜃 = − σ𝑛𝑖=1 σ𝐾
Cross Entropy loss 𝐸 𝑦, 𝑦; 𝑘=1(𝑦𝑖𝑘 ⋅ log 𝑦
ො𝑖𝑘 )

Computational Graphs
• Neural network is a computational graph
– It has compute nodes
– It has edges that connect nodes
– It is directional
– It is organized in ‘layers’

Backprop

The Importance of Gradients
• Our optimization schemes are based on computing
gradients
𝛻𝜽 𝐿 𝜽
• One can compute gradients analytically but what if
our function is too complex?
• Break down gradient computation Backpropagation
Done by many people before, but often credited to Rumelhart 1986

Backprop: Forward Pass
• 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧
Initialization 𝑥 = 1, 𝑦 = −3, 𝑧 = 4
1 1
𝑑 = −2
−3
sum 𝑓 = −8
−3
mult
4
4

Backprop: Backward Pass
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2
sum 𝑓 = −8
−3
mult
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4
𝜕𝑥
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑
𝜕𝑓 𝜕𝑓 𝜕𝑓
What is 𝜕𝑥 , 𝜕𝑦 , 𝜕𝑧 ?

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2
sum 𝑓 = −8
−3
mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4
𝜕𝑥
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑
𝜕𝑓
𝜕𝑓
𝜕𝑓 𝜕𝑓 𝜕𝑓

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2
sum 𝑓 = −8
−3
mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4 −2
𝜕𝑥
−2
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑
𝜕𝑓
𝜕𝑧
𝜕𝑓 𝜕𝑓 𝜕𝑓

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2
sum 4 𝑓 = −8
−3
mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4 −2
𝜕𝑥
−2
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑
𝜕𝑓
𝜕𝑑
𝜕𝑓 𝜕𝑓 𝜕𝑓

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2
4 sum 4 𝑓 = −8
−3
4 mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4 −2
𝜕𝑥
−2
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑 Chain Rule:
𝜕𝑓
𝜕𝑓 𝜕𝑓 𝜕𝑑
= ⋅ 𝜕𝑦
𝜕𝑓 𝜕𝑓 𝜕𝑓
What is 𝜕𝑥 , 𝜕𝑦 , 𝜕𝑧 ? 𝜕𝑦 𝜕𝑑 𝜕𝑦
𝜕𝑓
→ =4⋅1=4
I2DL: Prof. Niessner 𝜕𝑦 17
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 1 1
4 4
with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4 −3
𝑑 = −2
4 sum 4 𝑓 = −8
−3
4 mult 1
𝜕𝑑 𝜕𝑑 4
𝑑 =𝑥+𝑦 = 1, 𝜕𝑦 = 1 4 −2
𝜕𝑥
−2
𝜕𝑓 𝜕𝑓
𝑓 =𝑑⋅𝑧 = 𝑧, 𝜕𝑧 = 𝑑
𝜕𝑑 Chain Rule:
𝜕𝑓
𝜕𝑓 𝜕𝑓 𝜕𝑑
= ⋅ 𝜕𝑥
𝜕𝑓 𝜕𝑓 𝜕𝑓
What is 𝜕𝑥 , 𝜕𝑦 , 𝜕𝑧 ? 𝜕𝑦 𝜕𝑑 𝜕𝑦
𝜕𝑓
→ =4⋅1=4
I2DL: Prof. Niessner 𝜕𝑥 18
Compute Graphs -> Neural Networks
• 𝑥𝑘 input variables
• 𝑤𝑙,𝑚,𝑛 network weights (note 3 indices)
– 𝑙 which layer
– 𝑚 which neuron in layer
– 𝑛 which weight in neuron
• 𝑦ො𝑖 computed output (𝑖 output dim; 𝑛𝑜𝑢𝑡 )
• 𝑦𝑖 ground truth targets
• 𝐿 loss function

Input layer Output layer
𝑥0 ∗ 𝑤0
𝑥0
+ −𝑦0 𝑥 ∗𝑥 Loss/
𝑦ො0 𝑦0 cost
𝑥1 ∗ 𝑤1
𝑥1
Input Weights L2 Loss
(unknowns!) function
e.g., class label/

regression target

Input layer Output layer
𝑥0 ∗ 𝑤0
Loss/
𝑥0 + max(0, 𝑥) −𝑦0 𝑥 ∗𝑥
cost
𝑦ො0 𝑦0 𝑥1 ∗ 𝑤1
𝑥1 ReLU Activation
Input Weights (not arguing this L2 Loss
(unknowns!) is the right choice here) function
e.g., class label/ We want to compute gradients w.r.t. all weights 𝑾

regression target
∗ 𝑤0,0
Input layer Output layer Loss/
+ −𝑦0 𝑥 ∗𝑥
cost
∗ 𝑤0,1
∗ 𝑤1,0
𝑦ො0 𝑦0 𝑥0 Loss/
+ −𝑦0 𝑥 ∗𝑥
𝑥0 cost
∗ 𝑤1,1
𝑦ො1 𝑦1 𝑥1
𝑥1 ∗ 𝑤2,0
𝑦ො2 𝑦2 Loss/
+ −𝑦0 𝑥 ∗𝑥
cost
∗ 𝑤2,1
We want to compute gradients w.r.t. all weights 𝑾
Input layer Output layer Goal: We want to compute gradients of
the loss function 𝐿 w.r.t. all weights 𝑾
𝑥0 𝐿 = ෍ 𝐿𝑖
𝑦ො0 𝑦0 𝑖
𝐿: sum over loss per sample, e.g.
L2 loss ⟶ simply sum up squares:
…
𝑦ො1 𝑦1 𝐿𝑖 = 𝑦ො𝑖 − 𝑦𝑖 2
𝑥𝑘 ⟶ use chain rule to compute partials

𝑦ො𝑖 = 𝐴(𝑏𝑖 + ෍ 𝑥𝑘 𝑤𝑖,𝑘 ) 𝜕𝐿 𝜕𝐿 𝜕𝑦ො𝑖
𝑘 = ⋅
𝜕𝑤𝑖,𝑘 𝜕𝑦ො𝑖 𝜕𝑤𝑖,𝑘
Activation bias
function We want to compute gradients w.r.t.
all weights 𝑾 AND all biases 𝒃
NNs as Computational Graphs
• We can express any kind of functions in a
1
computational graph, e.g. 𝑓 𝒘, 𝒙 = − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1
1+𝑒
𝑤1
∗
𝑥1 Sigmoid function
+ 1
𝜎 𝑥 =
𝑤0 1 + 𝑒 −𝑥
∗
1
𝑥0 + ∙ −1 exp(∙) +1 ∙
𝑏
1
• 𝑓 𝒘, 𝒙 =
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1
2
𝑤1
∗
−1 −2
𝑥1
+
−3 𝑤 6 4
0
∗ 1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥0 + ∙ −1 +1 ∙
−3
𝑏

1 1 𝜕𝑔 1
• 𝑓 𝒘, 𝒙 = 𝑔 𝑥 = ⇒ =− 2
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1 𝑥 𝜕𝑥
𝜕𝑔
𝑥
2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
𝑤1 𝜕𝑔
𝑔 𝑥 = 𝑒𝑥 ⇒ = 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝑥1 𝑔𝛼 𝑥 = 𝛼𝑥 ⇒ =𝛼
𝜕𝑥
1
+ 1 ∙ − 1.372 = −0.53
−3 𝑤 6 4
0
∗ 1 −1 exp(∙)0.37 1.37 1 0.73
−2 + ∙ −1 +1
𝑥0 ∙
−0.53 1
−3
𝑏

1 1 𝜕𝑔 1
• 𝑓 𝒘, 𝒙 = 𝑔 𝑥 = ⇒ =− 2
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1 𝑥 𝜕𝑥
𝜕𝑔
𝑥
2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
𝑤1 𝜕𝑔
𝑔 𝑥 = 𝑒𝑥 ⇒ = 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝜕𝑥
+
−3 𝑤 6 4 −0.53 ∙ 1 = −0.53
0
∗ 1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥0 + ∙ −1 +1 ∙
−0.53 −0.53 1
−3
𝑏

1 1 𝜕𝑔 1
• 𝑓 𝒘, 𝒙 = 𝑔 𝑥 = ⇒ =− 2
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1 𝑥 𝜕𝑥
𝜕𝑔
𝑥
2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
𝑤1 𝜕𝑔
𝑔 𝑥 = 𝑒𝑥 ⇒ = 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝜕𝑥
+
−3 𝑤 6 4 −0.53 ∙ e−1 = −0.2
0
∗ 1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥0 + ∙ −1 +1 ∙
−0.2 −0.53 −0.53 1

−3
𝑏

1 1 𝜕𝑔 1
• 𝑓 𝒘, 𝒙 = 𝑔 𝑥 = ⇒ =− 2
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1 𝑥 𝜕𝑥
𝜕𝑔
𝑥
2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
𝑤1 𝜕𝑔
𝑔 𝑥 = 𝑒𝑥 ⇒ = 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝜕𝑥
+
−3 𝑤 6 4 −0.2 ∙ −1 = 0.2
0
∗ 1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥0 + ∙ −1 +1 ∙
−3 0.2 −0.2 −0.53 −0.53 1

𝑏

1 1 𝜕𝑔 1
• 𝑓 𝒘, 𝒙 = 𝑔 𝑥 = ⇒ =− 2
1+𝑒 − 𝑏+𝑤0 𝑥0 +𝑤1 𝑥1 𝑥 𝜕𝑥
𝜕𝑔
𝑥
2 𝑔𝛼 𝑥 = 𝛼 + 𝑥 ⇒ 𝜕𝑥
=1
−0.2 𝑤1 𝑔 𝑥 = 𝑒𝑥 ⇒
𝜕𝑔
= 𝑒 𝑥
∗ 𝜕𝑥
−1 −2 𝜕𝑔
𝑥 𝑔𝛼 𝑥 = 𝛼𝑥 ⇒ =𝛼
0.4 1 0.2 𝜕𝑥
+
−3 𝑤 6 4
−0.4 0
∗ 0.2 0.2
1 −1 exp(∙)0.37 1.37 0.73
−2 1
𝑥 + ∙ −1 +1 ∙
−0.6 0
−3 0.2 −0.2 −0.53 −0.53 1
0.2 𝑏

Gradient Descent

Gradient Descent
𝒙∗ = arg min 𝑓(𝒙)
Initialization
Optimum

Gradient Descent
• From derivative to gradient Direction of
greatest increase
ⅆ𝑓 𝑥 of the function
𝛻𝑥 𝑓 𝑥
ⅆ𝑥
• Gradient steps in direction of negative gradient

𝛻𝑥 𝑓(𝑥)
𝑥
𝑥 ′ = 𝑥 − 𝛼𝛻𝑥 𝑓 𝑥
Learning rate

Gradient Descent for Neural Networks
Hidden Layer 1 Hidden Layer 2 Hidden Layer 3
Input Layer
Output Layer
𝑚 Neurons
𝑙 Layers
For a given training pair {𝒙, 𝒚}, we want to update all
weights, i.e., we need to compute the derivatives w.r.t. to
all weights:
𝜕𝑓
Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3
𝜕𝑤0,0,0
…
𝑚 Neurons
Output Layer
𝛻𝑾 𝑓 𝒙,𝒚 (𝑾) = …
𝜕𝑓
𝜕𝑤𝑙,𝑚,𝑛
Gradient step:
𝑾′ = 𝑾 − 𝛼𝛻𝑾 𝑓 𝒙,𝒚 (𝑾)
𝑙 Layers

NNs can Become Quite Complex…
• These graphs can be huge!
[Szegedy et al.,CVPR’15] Going Deeper with Convolutions

The Flow of the Gradients
• Many many many many of these nodes form a neural
network
NEURONS
• Each one has its own work to do

FORWARD AND BACKWARD PASS

The Flow of the Gradients
𝜕𝐿 𝜕𝐿 𝜕𝑧
=
𝑥 𝜕𝑥 𝜕𝑧 𝜕𝑥
𝜕𝑧
Activations
𝑓
𝜕𝑥 𝑧 = 𝑓(𝑥, 𝑦)
𝜕𝑧
𝜕𝐿
𝜕𝑦
𝜕𝑧
𝑦
I2DL: Prof. Niessner

Activation function 38
Loss function
𝐿𝑖 = 𝑦ො𝑖 − 𝑦𝑖 2
𝑦ො𝑖 = 𝐴(𝑏1,𝑖 + ෍ ℎ𝑗 𝑤1,𝑖,𝑗 )

𝑗
Just simple:
ℎ𝑗 = 𝐴(𝑏0,𝑗 + ෍ 𝑥𝑘 𝑤0,𝑗,𝑘 )
𝐴 𝑥 = max(0, 𝑥)
𝑘
Backpropagation
Just go through layer by layer

𝜕𝐿 𝜕𝐿 𝜕𝑦ො𝑖
= ⋅
𝜕𝑤1,𝑖,𝑗 𝜕𝑦ො𝑖 𝜕𝑤1,𝑖,𝑗
𝜕𝐿𝑖
= 2(𝑦ොi − yi )
𝜕𝑦ො𝑖
𝜕𝑦ො 𝑖
ℎ𝑗 = 𝐴(𝑏0,𝑗 + ෍ 𝑥𝑘 𝑤0,𝑗,𝑘 ) = ℎ𝑗 if > 0, else 0
𝜕𝑤1,𝑖,𝑗
𝑘
𝑦ො𝑖 = 𝐴(𝑏1,𝑖 + ෍ ℎ𝑗 𝑤1,𝑖,𝑗 ) 𝜕𝐿 𝜕𝐿 𝜕 𝑦ො𝑖 𝜕ℎ𝑗

= ⋅ ⋅
𝑗 𝜕𝑤0,𝑗,𝑘 𝜕𝑦ො𝑖 𝜕ℎ𝑗 𝜕𝑤0,𝑗,𝑘
…
How many unknown weights?

• Output layer: 2 ⋅ 4 + 2
• Hidden Layer: 4 ⋅ 3 + 4
#neurons ∙ #input channels + #biases
ℎ𝑗 = 𝐴(𝑏0,𝑗 + ෍ 𝑥𝑘 𝑤0,𝑗,𝑘 )
𝑘
Note that some activations
𝑦ො𝑖 = 𝐴(𝑏1,𝑖 + ෍ ℎ𝑗 𝑤1,𝑖,𝑗 )
have also weights
𝑗

Derivatives of Cross Entropy Loss
Gradients of weights of last layer:
𝜕𝐿 𝜕𝐿 𝜕 𝑦ො𝑖 𝜕𝑠𝑖
= ∙ ∙
𝜕𝑤𝑗𝑖 𝜕𝑦ො𝑖 𝜕𝑠𝑖 𝜕𝑤𝑗𝑖
𝜕𝐿 −𝑦𝑖 1 − 𝑦𝑖 𝑦ො𝑖 − 𝑦𝑖
= + = ,
𝜕𝑦ො𝑖 𝑦ො𝑖 1 − 𝑦ො𝑖 𝑦ො𝑖 (1 − 𝑦ො𝑖 )
𝜕𝑦ො𝑖
Binary Cross Entropy loss = 𝑦ො𝑖 1 − 𝑦ො𝑖 ,
𝑛𝑜𝑢𝑡
𝜕𝑠𝑖
𝜕𝑠𝑖
𝐿 = − ෍ (𝑦𝑖 log 𝑦ො𝑖 + 1 − 𝑦𝑖 log(1 − 𝑦ො𝑖 )) = ℎ𝑗
𝜕𝑤𝑗𝑖
𝑖=1
1
𝑦ො𝑖 = 𝑠𝑖 = ෍ ℎ𝑗 𝑤𝑗𝑖 𝜕𝐿 𝜕𝐿
1 + 𝑒 −𝑠𝑖 ⟹ = (𝑦ො𝑖 − 𝑦𝑖 )ℎ𝑗 , = 𝑦ො𝑖 − 𝑦𝑖
j 𝜕𝑠𝑖
𝜕𝑤𝑗𝑖
output scores
Derivatives of Cross Entropy Loss
Gradients of weights of first layer:
𝑛𝑜𝑢𝑡 𝑛𝑜𝑢𝑡 𝑛𝑜𝑢𝑡
𝜕𝐿 𝜕𝐿 𝜕𝑦ො𝑖 𝜕𝑠𝑖 𝜕𝐿
= ෍ = ෍ 𝑦ො 1 − 𝑦ො𝑖 𝑤𝑗𝑖 = ෍ 𝑦ො𝑖 − 𝑦𝑖 𝑤𝑗𝑖
𝜕ℎ𝑗 𝜕𝑦ො𝑖 𝜕𝑠i 𝜕ℎ𝑗 𝜕𝑦ො𝑖 𝑖
𝑖=1 𝑖=1 𝑖=1
𝑛𝑜𝑢𝑡 𝑛𝑜𝑢𝑡
𝜕𝐿 𝜕𝐿 𝜕𝑠𝑖 𝜕ℎ𝑗
= ෍ = ෍ (𝑦ො𝑖 − 𝑦𝑖 )𝑤𝑗𝑖 (ℎ𝑗 1 − ℎ𝑗 )
𝜕𝑠𝑗1 𝜕𝑠𝑖 𝜕ℎ𝑗 𝜕𝑠𝑗1
𝑖=1 𝑖=1
𝑛𝑜𝑢𝑡 𝑛𝑜𝑢𝑡
𝜕𝐿 𝜕𝐿 𝜕𝑠𝑗1
1 = ෍ 1 = ෍ 𝑦
ො𝑖 − 𝑦𝑖 𝑤𝑗𝑖 ℎ𝑗 1 − ℎ𝑗 𝑥𝑘
𝜕𝑤𝑘𝑗 𝜕𝑠𝑗1 𝜕𝑤𝑘𝑗
𝑖=1 𝑖=1

Back to Compute Graphs & NNs
𝑥
• Inputs 𝒙 and targets 𝒚 𝑧 𝜎
∗
• Two-layer NN for max(0,∙)
𝑦ො
𝑤1 ∗
regression with ReLU
activation 𝑤2 𝐿
• Function we want to 𝑦
optimize:
𝑛
2
෍ 𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 2
𝑖=1

1 1
1 𝑥 3 3
Initialize 𝑥 = 1, 𝑦 = 0, 𝑧 𝜎 2
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3
𝑦ො
3 𝑤1 ∗
1 𝑛 2 4
ෝ; 𝜽 =
𝐿 𝒚, 𝒚 σ | 𝑦ො𝑖 − 𝑦𝑖 | 2 2 𝑤
𝑛 𝑖 𝐿 9
2
In our case 𝑛, 𝑑 = 1: 1
𝜕𝐿 0 𝑦
𝐿 = 𝑦ො − 𝑦 2 ⇒ = 2(𝑦ො − 𝑦)
𝜕𝑦ො
𝜕𝑦ො
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ 𝜕𝑤 = 𝜎 Backpropagation
2 𝜕𝐿 𝜕𝐿 𝜕𝑦ො
= ⋅
𝜕𝑤2 𝜕𝑦ො 𝜕𝑤2

1 1
1 𝑥 3 3
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3
𝑦ො 4
3 𝑤1 ∗ 3
1 4
ෝ; 𝜽 = σ𝑛𝑖 | 𝑦ො𝑖 − 𝑦𝑖 |
𝐿 𝒚, 𝒚 2
2 2 𝑤 9
𝑛
2 𝐿
𝜕𝐿 0 𝑦
𝐿 = 𝑦ො − 𝑦 2 ⇒ = 2(𝑦ො − 𝑦)
𝜕𝑦ො
𝜕𝑦ො
2 𝜕𝐿 𝜕𝐿 𝜕𝑦ො
= ⋅
𝜕𝑤2 𝜕𝑦ො 𝜕𝑤2
2 ∙ 23

1 1
1 𝑥 3 3
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3
𝑦ො 4
3 𝑤1 4 ∗ 3
1 9 4
ෝ; 𝜽 = σ𝑛𝑖 | 𝑦ො𝑖 − 𝑦𝑖 |
𝐿 𝒚, 𝒚 2
2 2 𝑤 9
𝑛
2 𝐿
4 1
𝜕𝐿 0 𝑦
𝐿 = 𝑦ො − 𝑦 2 ⇒ = 2(𝑦ො − 𝑦)
𝜕𝑦ො
𝜕𝑦ො
2 𝜕𝐿 𝜕𝐿 𝜕𝑦ො
= ⋅
𝜕𝑤2 𝜕𝑦ො 𝜕𝑤2
2 ∙ 23 ∙ 13

1 1
1 𝑥 3 3
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3
𝑦ො
3 𝑤1 4 ∗
9
In our case 𝑛, 𝑑 = 1: 2 𝑤
4
𝐿 9
𝜕𝐿 2
𝐿 = 𝑦ො − 𝑦 2 ⇒ 𝜕𝑦ො = 2(𝑦ො − 𝑦) 4 1
9
𝜕𝑦ො 0 𝑦
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ = 𝑤2
𝜕𝜎
𝜕𝜎 1 if 𝑥 > 0 Backpropagation
𝜎 = max 0, 𝑧 ⇒ =ቊ
𝜕𝑧 0 else 𝜕𝐿 𝜕𝐿 𝜕𝑦ො 𝜕𝜎 𝜕𝑧
𝜕𝑧 = ⋅ ⋅ ⋅
𝑧 = 𝑥 ∙ 𝑤1 ⇒ 𝜕𝑤 = 𝑥 𝜕𝑤1 𝜕𝑦ො 𝜕𝜎 𝜕𝑧 𝜕𝑤1
1

1 1
1 𝑥 3 3
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3
𝑦ො 4
3 𝑤1 4 ∗ 3
9
4
𝐿 9
𝜕𝐿 2
𝐿 = 𝑦ො − 𝑦 2 ⇒ 𝜕𝑦ො = 2(𝑦ො − 𝑦) 4
9
𝜕𝑦ො 0 𝑦
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ = 𝑤2
𝜕𝜎
𝜎 = max 0, 𝑧 ⇒ =ቊ
𝜕𝑧 = ⋅ ⋅ ⋅
𝑧 = 𝑥 ∙ 𝑤1 ⇒ 𝜕𝑤 = 𝑥 𝜕𝑤1 𝜕𝑦ො 𝜕𝜎 𝜕𝑧 𝜕𝑤1
1
2 ∙ 23

1 1
1 𝑥 3 3
Initialize 𝑥 = 1, 𝑦 = 0, 𝑧 𝜎 8 2
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3 3
𝑦ො 4
3 𝑤1 4 ∗ 3
9
4
𝐿 9
𝜕𝐿 2
𝐿 = 𝑦ො − 𝑦 2 ⇒ 𝜕𝑦ො = 2(𝑦ො − 𝑦) 4 1
9
𝜕𝑦ො 0 𝑦
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ = 𝑤2
𝜕𝜎
𝜎 = max 0, 𝑧 ⇒ =ቊ
𝜕𝑧 = ⋅ ⋅ ⋅
𝑧 = 𝑥 ∙ 𝑤1 ⇒ 𝜕𝑤 = 𝑥 𝜕𝑤1 𝜕𝑦ො 𝜕𝜎 𝜕𝑧 𝜕𝑤1
1
2 ∙ 23 ∙ 2

1 1
1 𝑥 3 3
Initialize 𝑥 = 1, 𝑦 = 0, 𝑧 8
𝜎 8 2
3
𝑤1 = 13, 𝑤2 = 2 1
∗ max(0,∙) 3 3
𝑦ො 4
3 𝑤1 4 ∗ 3
9
4
𝐿 9
𝜕𝐿 2
𝐿 = 𝑦ො − 𝑦 2 ⇒ 𝜕𝑦ො = 2(𝑦ො − 𝑦) 4 1
9
𝜕𝑦ො 0 𝑦
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ = 𝑤2
𝜕𝜎
𝜎 = max 0, 𝑧 ⇒ =ቊ
𝜕𝑧 = ⋅ ⋅ ⋅
𝑧 = 𝑥 ∙ 𝑤1 ⇒ 𝜕𝑤 = 𝑥 𝜕𝑤1 𝜕𝑦ො 𝜕𝜎 𝜕𝑧 𝜕𝑤1
1
2 ∙ 23 ∙ 2 ∙ 1

1 1
1 𝑥 3 3
Initialize 𝑥 = 1, 𝑦 = 0, 𝑧 8
𝜎 8 2
8 3
𝑤1 = 13, 𝑤2 = 2 1 3
∗ max(0,∙) 3 3
𝑦ො 4
3 𝑤1 4 ∗ 3
9
2 𝑤
4
3 𝐿 9
𝜕𝐿 2
𝐿 = 𝑦ො − 𝑦 2 ⇒ 𝜕𝑦ො = 2(𝑦ො − 𝑦) 4 1
9
𝜕𝑦ො 0 𝑦
𝑦ො = 𝑤2 ∙ 𝜎 ⇒ = 𝑤2
𝜕𝜎
𝜎 = max 0, 𝑧 ⇒ =ቊ
𝜕𝑧 = ⋅ ⋅ ⋅
𝑧 = 𝑥 ∙ 𝑤1 ⇒ 𝜕𝑤 = 𝑥 𝜕𝑤1 𝜕𝑦ො 𝜕𝜎 𝜕𝑧 𝜕𝑤1
1
2 ∙ 23 ∙ 2 ∙ 1 ∙1

1 1
• Function we want to 1 𝑥 3
8
3
𝑧 𝜎 2
optimize: 1
8
3
∗
3
max(0,∙)
8
3 3
𝑛 𝑦ො 4
3 𝑤1 4 ∗
2 3
𝑓 𝑥, 𝒘 = ෍ 𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 2 8 9 4
𝑖=1 3 2 𝑤 𝐿 9
4 2
1
• Computed gradients wrt 9
0 𝑦
to weights 𝑤1 and 𝑤2
• Now: update the weights
𝑤1
𝒘′ = 𝒘 − 𝛼 ∙ 𝛻𝒘 𝑓 = 𝑤 − 𝛼 ∙
𝛻𝑤1 𝑓 But: how to choose a
𝛻𝑤2 𝑓
1 8
2
good learning rate 𝛼 ?
3
= 3 − 𝛼∙ 4
2 9
Gradient Descent
• How to pick good learning rate?
• How to compute gradient for single training pair?
• How to compute gradient for large training set?
• How to speed things up? More to see in next

lectures…

Regularization

Recap: Basic Recipe for ML
• Split your data
60% 20% 20%

train validation test
Find your hyperparameters
Other splits are also possible (e.g., 80%/10%/10%)

Over- and Underfitting
Underfitted Appropriate Overfitted
Source: Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017

Training a Neural Network
• Training/ Validation curve
How can we
prevent our model
from overfitting?
Training error Generalization Regularization

too high gap is too big
Credits: Deep Learning. Goodfellow et al.

Regularization
ෝ, 𝜽) = σ𝑛𝑖=1 𝑦ො𝑖 − 𝑦𝑖
• Loss function 𝐿(𝒚, 𝒚 2
+ 𝜆𝑅(𝜽)
• Regularization techniques
– L2 regularization Add regularization
– L1 regularization term to loss function
– Max norm regularization
– Dropout
– Early stopping
– ...

Regularization
ෝ, 𝜽) = σ𝑛𝑖=1 𝑦ො𝑖 − 𝑦𝑖
• Loss function 𝐿(𝒚, 𝒚 2
+ 𝜆𝑅(𝜽)
• Regularization techniques
– L2 regularization Add regularization
– L1 regularization term to loss function
– Max norm regularization
– Dropout
More details later
– Early stopping
– ...

Regularization: Example
• Input: 3 features 𝒙 = [1, 2, 1]
• Two linear classifiers that give the same result:
• 𝜃1 = 0, 0.75, 0 Ignores 2 features
• 𝜃2 = [0.25, 0.5, 0.25] Takes information

from all features

2
• Loss 𝐿 𝒚, 𝒚
ෝ, 𝜽 = σ𝑛𝑖=1 𝑥𝑖 𝜃𝑗𝑖 − 𝑦𝑖 + 𝜆𝑅 𝜽
𝑛
• L2 regularization 𝑅 𝜽 = ෍ 𝜃𝑖2
𝑖=1
𝜃1 0 + 0.752 + 0 = 0.5625
𝜃2 0.252 + 0.52 + 0.252 = 0.375 Minimization
𝑥 = 1, 2, 1 , 𝜃1 = 0, 0.75, 0 , 𝜃2 = [0.25, 0.5, 0.25]

2
• Loss 𝐿 𝒚, 𝒚
ෝ, 𝜽 = σ𝑛𝑖=1 𝑥𝑖 𝜃𝑗𝑖 − 𝑦𝑖 + 𝜆𝑅 𝜽
𝑛
• L1 regularization 𝑅 𝜽 = ෍ |𝜃𝑖 |
𝑖=1
𝜃1 0 + 0.75 + 0 = 0.75 Minimization
𝜃2 0.25 + 0.5 + 0.25 = 1
𝑥 = 1, 2, 1 , 𝜃1 = 0, 0.75, 0 , 𝜃2 = [0.25, 0.5, 0.25]

• 𝜃1 = 0, 0.75, 0 Ignores 2 features
• 𝜃2 = [0.25, 0.5, 0.25] Takes information

from all features

• 𝜃1 = 0, 0.75, 0 L1 regularization
enforces sparsity
• 𝜃2 = [0.25, 0.5, 0.25] L2 regularization

enforces that the weights
have similar values
Regularization: Effect
• Dog classifier takes different inputs
Furry
L1 regularization
Has two eyes
will focus all the
attention to a
Has a tail
few key features
Has paws
Has two ears

Regularization: Effect
• Dog classifier takes different inputs
Furry
L2 regularization
Has two eyes will take all
information into
Has a tail account to make
decisions
Has paws
Has two ears

Regularization for Neural Networks
𝑥
∗ max(0,∙)
𝑤1 ∗
𝐿2
𝑤2 + 𝐿
loss
𝑦
𝑅(𝑤1 , 𝑤2 ) ∙𝜆
Combining nodes: 𝑛
Network output + L2-loss + ෍ 𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 2
2 + 𝜆𝑅 𝑤1 , 𝑤2
regularization 𝑖=1
𝑥
∗ max(0,∙)
𝑤1 ∗
𝐿2
𝑤2 + 𝐿
loss
𝑦
𝑅(𝑤1 , 𝑤2 ) ∙𝜆
2
𝑤1
Network output + L2-loss + ෍ 𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 22 +𝜆 𝑤2
2

𝑥
∗ max(0,∙)
𝑤1 ∗
𝐿2
𝑤2 + 𝐿
loss
𝑦
𝑅(𝑤1 , 𝑤2 ) ∙𝜆
Network output + L2-loss + ෍ 𝑤2 max 0, 𝑤1 𝑥𝑖 − 𝑦𝑖 2
2 + 𝜆(𝑤12 + 𝑤22 )
Regularization
Regularization 𝜆=0 𝜆 = 0.00001 𝜆 = 0.001 𝜆=1 𝜆 = 10
Decision
Boundary
Credit: University of Washington
What happens to the training error?

What is the goal of regularization?
Regularization
• Any strategy that aims to
Lower Increasing
validation error training error

Next Lecture
• This week:
– Check exercises!
– Check piazza / post questions ☺
• Next lecture
– Optimization of Neural Networks
– In particular, introduction to SGD (our main method!)

See you next week ☺

Further Reading
• Backpropagation
– Chapter 6.5 (6.5.1 - 6.5.3) in
http://www.deeplearningbook.org/contents/mlp.html
– Chapter 5.3 in Bishop, Pattern Recognition and Machine Learning
– http://cs231n.github.io/optimization-2/
• Regularization
– Chapter 7.1 (esp. 7.1.1 & 7.1.2)
http://www.deeplearningbook.org/contents/regularization.html
– Chapter 5.5 in Bishop, Pattern Recognition and Machine Learning

lec-4-opt_and_bp

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

lec-4-opt_and_bp

Uploaded by

Copyright:

Available Formats

Optimization and

I2DL: Prof. Niessner 1

I2DL: Prof. Niessner 2

• Neural network is a nesting of ‘functions’

I2DL: Prof. Niessner 4

Parametric ReLU: max 𝛼𝑥, 𝑥

• Classification loss (for multi-class classification)

I2DL: Prof. Niessner 8

– It has compute nodes

– It has edges that connect nodes

I2DL: Prof. Niessner 9

I2DL: Prof. Niessner 10

• Break down gradient computation Backpropagation

Done by many people before, but often credited to Rumelhart 1986

I2DL: Prof. Niessner 12

I2DL: Prof. Niessner 13

I2DL: Prof. Niessner 14

I2DL: Prof. Niessner 15

I2DL: Prof. Niessner 16

I2DL: Prof. Niessner 19

e.g., class label/

I2DL: Prof. Niessner 20

e.g., class label/ We want to compute gradients w.r.t. all weights 𝑾

𝑥𝑘 ⟶ use chain rule to compute partials

I2DL: Prof. Niessner 25

I2DL: Prof. Niessner 26

I2DL: Prof. Niessner 27

−0.2 −0.53 −0.53 1

I2DL: Prof. Niessner 28

−3 0.2 −0.2 −0.53 −0.53 1

I2DL: Prof. Niessner 29

I2DL: Prof. Niessner 30

I2DL: Prof. Niessner 31

I2DL: Prof. Niessner 32

• Gradient steps in direction of negative gradient

I2DL: Prof. Niessner 33

I2DL: Prof. Niessner 35

[Szegedy et al.,CVPR’15] Going Deeper with Convolutions

• Each one has its own work to do

I2DL: Prof. Niessner 37

I2DL: Prof. Niessner

𝑦ො𝑖 = 𝐴(𝑏1,𝑖 + ෍ ℎ𝑗 𝑤1,𝑖,𝑗 )

Just go through layer by layer

𝑦ො𝑖 = 𝐴(𝑏1,𝑖 + ෍ ℎ𝑗 𝑤1,𝑖,𝑗 ) 𝜕𝐿 𝜕𝐿 𝜕 𝑦ො𝑖 𝜕ℎ𝑗

How many unknown weights?

I2DL: Prof. Niessner 41

I2DL: Prof. Niessner 43

I2DL: Prof. Niessner 44

I2DL: Prof. Niessner 45

I2DL: Prof. Niessner 46

I2DL: Prof. Niessner 47

I2DL: Prof. Niessner 48

I2DL: Prof. Niessner 49

I2DL: Prof. Niessner 50

I2DL: Prof. Niessner 51

I2DL: Prof. Niessner 52

• How to compute gradient for single training pair?

• How to compute gradient for large training set?

• How to speed things up? More to see in next

I2DL: Prof. Niessner 54

I2DL: Prof. Niessner 55

60% 20% 20%

Find your hyperparameters

Other splits are also possible (e.g., 80%/10%/10%)