Professional Documents
Culture Documents
1d_Backprop4
1d_Backprop4
3 4
Two-Layer Neural Network The XOR Problem
Output units ai
Wj,i x1 x2 target
0 0 0
0 1 1
Hidden units aj 1 0 1
1 1 0
Wk,j XOR data cannot be learned with a perceptron, but can be achieved using a
2-layer network with two hidden units
Input units ak
5 6
7 8
Local Search in Weight Space Continuous Activation Functions (3.10)
ai ai ai
+1 +1 +1
−1
Key Idea: Replace the (discontinuous) step function with a differentiable function,
such as the sigmoid: 1
g(s) =
1 + e−s
Problem: because of the step function, the landscape will not be smooth, or hyperbolic tangent
es − e−s " 1 #
but will instead consist almost entirely of flat local regions and “shoulders”, g(s) = tanh(s) = s =2 −1
e + e−s 1 + e−2s
with occasional discontinuous jumps.
9 10
11 12
Forward Pass Backpropagation
Useful notation
z Partial Derivatives
∂E ∂E ∂E
s ∂E δout = δ1 = δ2 =
= z−t ∂s ∂u1 ∂u2
c ∂z
u1 = b1 + w11 x1 + w12 x2 Then
v1 v2 dz
y1 y2 y1 = g(u1 ) = g′ (s) = z(1 − z) δout = (z − t) z (1 − z)
ds
s = c + v1 y1 + v2 y2 ∂E
u1 u2 ∂s
= v1 = δout y1
∂y1 ∂v1
b2 z = g(s)
b1 1! dy1
δ1 = δout v1 y1 (1 − y1 )
w11 w22
E =
2
(z − t)2 = y1 (1 − y1 ) ∂E
w21 du1
∂w11
= δ1 x1
w12
x1 x2 Partial derivatives can be calculated efficiently by packpropagating deltas through
the network.
13 14
15 16
Training Tips ALVINN (Pomerleau 1991, 1993)
17 18
ALVINN ALVINN
Sharp Straight Sharp
Left Ahead Right
19 20
Data Augmentation Momentum (8.3)
If the landscape is shaped like a “rain gutter”, weights will tend to oscillate without
much improvement. We can add a momentum factor
∂E
δw ← α δw − η
∂w
w ← w + δw
Hopefully, this will dampen sideways oscillations but amplify downhill motion by
1
1−α . Momentum can also help to escape from local minima, or move quickly
across flat regions in the loss landscape.
When momentum is used, we generally reduce the learning rate at the same time,
1
in order to compensate for the implicit factor of 1−α .
21 22
mt = β1 mt−1 + (1 − β1 )gt
vt = β2 vt−1 + (1 − β2 )gt2
To speed up training in the early stages, compensating for the fact that mt , gt are
initialized to zero, we rescale as follows:
mt vt
m̂t = , v̂t =
1 − β1t 1 − β2t
23