Professional Documents
Culture Documents
Lec6 PDF
Lec6 PDF
Sudeshna Sarkar
Spring 2018
16 Jan 2018
BACKPROPAGATION:
INTRODUCTION
How do we learn weights?
x
s (scores)
* hinge
loss + L
W
R
input image
weights
loss
e.g. x = -2, y = 5, z = -4
We can write
f 𝑥, 𝑦, 𝑧 = 𝑔 ℎ 𝑥, 𝑦 , 𝑧
Where ℎ 𝑥, 𝑦 = 𝑥 + 𝑦, and 𝑔 𝑎, 𝑏 = 𝑎 ∗ 𝑏
𝑑𝑑 𝑑𝑑 𝑑𝑑 𝑑𝑑 𝑑𝑔 𝑑𝑑
By the chain rule, = and =
𝑑𝑑 𝑑ℎ 𝑑𝑑 𝑑𝑦 𝑑ℎ 𝑑𝑦
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
17
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Differentiating a Computation Graph
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
“local gradient”
“local gradient”
gradients
“local gradient”
gradients
“local gradient”
gradients
“local gradient”
gradients
30
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Another example:
sigmoid gate
sigmoid gate
x
z
*
x
z
*
inputs x
outputs y
complex graph
inputs x
outputs y
complex graph
reverse-mode differentiation (if you want effect of many things on one thing)
forward-mode differentiation (if you want effect of one thing on many things)
for many different y
inputs x
outputs y
complex graph
reverse-mode differentiation (if you want effect of many things on one thing)
More common simply because many nets have a scalar loss function as output.
f
“local gradient”
gradients
4096-d 4096-d
input vector f(x) = max(0,x) output vector
(elementwise)
Jacobian matrix
4096-d 4096-d
input vector f(x) = max(0,x) output vector
(elementwise)
Jacobian matrix
4096-d 4096-d
input vector max(0,x)
f(x) = max(0,x) output vector
(elementwise)
(elementwise)
in practice we process an
entire minibatch (e.g. 100) of
examples at one time:
• neural nets will be very large: no hope of writing down gradient formula
by hand for all parameters
• backpropagation = recursive application of the chain rule along a
computational graph to compute the gradients of all
inputs/parameters/intermediates
• implementations maintain a graph structure, where the nodes
implement the forward() / backward() API.
• forward: compute result of an operation and save any intermediates
needed for gradient computation in memory
• backward: apply the chain rule to compute the gradient of the loss
function with respect to the inputs.
x W1 W2
h s
3072 100 10
Sigmoid
Leaky ReLU
max(0.1x, x)
Maxout
tanh tanh(x)
ELU
ReLU max(0,x)
68
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson