Professional Documents
Culture Documents
1d_Backprop
1d_Backprop
Deep Learning
Week 1d. Backpropagation
Alan Blair
School of Computer Science and Engineering
May 28, 2024
Outline
2
Recall: Limitations of Perceptrons
Problem: many useful functions are not linearly separable (e.g. XOR)
I1 I1 I1
1 1 1
0 0 0
0 1 I2 0 1 I2 0 1 I2
(a) I1 and I2 (b) I1 or I2 (c) I1 xor I2
Possible solution:
x1 XOR x2 can be written as: (x1 AND x2 ) NOR (x1 NOR x2 )
Recall that AND, OR and NOR can be implemented by perceptrons.
3
Multi-Layer Neural Networks
XOR
+0.5
−1 −1
NOR
+1 −1
AND NOR −1.5 +0.5
+1 −1
4
Two-Layer Neural Network
Output units ai
Wj,i
Hidden units aj
Wk,j
Input units ak
5
The XOR Problem
x1 x2 target
0 0 0
0 1 1
1 0 1
1 1 0
XOR data cannot be learned with a perceptron, but can be achieved using a
2-layer network with two hidden units
6
Neural Network Equations
z
s
c
v1 v2 u1 = b1 + w11 x1 + w12 x2
y1 y2 y1 = g(u1 )
u1 u2 s = c + v1 y1 + v2 y2
b1 b2 z = g(s)
w11 w22
w21
w12
x1 x2
We sometimes use w as a shorthand for any of the trainable weights
{c, v1 , v2 , b1 , b2 , w11 , w21 , w12 , w22 }.
7
NN Training as Cost Minimization
We define an error function or loss function E to be (half) the sum over all input
patterns of the square of the difference between actual output and target output
1X
E= (zi − ti )2
2
i
8
Local Search in Weight Space
Problem: because of the step function, the landscape will not be smooth,
but will instead consist almost entirely of flat local regions and “shoulders”,
with occasional discontinuous jumps.
9
Continuous Activation Functions (3.10)
ai ai ai
+1 +1 +1
−1
Key Idea: Replace the (discontinuous) step function with a differentiable function,
such as the sigmoid: 1
g(s) =
1 + e−s
or hyperbolic tangent
es − e−s 1
g(s) = tanh(s) = s =2 −1
e +e −s 1+e −2s
10
Gradient Descent (4.3)
Recall that the loss function E is (half) the sum over all input patterns of the
square of the difference between actual output and target output
1X
E= (zi − ti )2
2
i
∂E
w ←w−η
∂w
Parameter η is called the learning rate.
11
Chain Rule (6.5.2)
If, say
y = y(u)
u = u(x)
Then
∂y ∂y ∂u
=
∂x ∂u ∂x
This principle can be used to compute the partial derivatives in an efficient and
localized manner. Note that the transfer function must be differentiable (usually
sigmoid, or tanh).
1
Note: if z(s) = , z ′ (s) = z(1 − z).
1 + e−s
if z(s) = tanh(s), z ′ (s) = 1 − z 2 .
12
Forward Pass
z
s
c
v1 v2 u1 = b1 + w11 x1 + w12 x2
y1 y2 y1 = g(u1 )
u1 u2 s = c + v1 y1 + v2 y2
b2 z = g(s)
b1 1X
w11 w22
E =
2
(z − t)2
w21
w12
x1 x2
13
Backpropagation
Useful notation
Partial Derivatives
∂E ∂E ∂E
∂E δout = δ1 = δ2 =
= z−t ∂s ∂u1 ∂u2
∂z
Then
dz ′
= g (s) = z(1 − z) δout = (z − t) z (1 − z)
ds
∂s ∂E
= v1 = δout y1
∂y1 ∂v1
δ1 = δout v1 y1 (1 − y1 )
dy1
= y1 (1 − y1 ) ∂E
du1 = δ1 x1
∂w11
Partial derivatives can be calculated efficiently by packpropagating deltas through
the network.
14
Two-Layer NN’s – Applications
➛ Medical Dignosis
➛ Autonomous Driving
➛ Game Playing
➛ Credit Card Fraud Detection
➛ Handwriting Recognition
➛ Financial Prediction
15
Example: Pima Indians Diabetes Dataset
Based on these inputs, try to predict whether the patient will develop
Diabetes (1) or Not (0).
16
Training Tips
17
ALVINN (Pomerleau 1991, 1993)
18
ALVINN
Sharp Straight Sharp
Left Ahead Right
30 Output
Units
4 Hidden
Units
30x32 Sensor
Input Retina
19
ALVINN
20
Data Augmentation
21
Momentum (8.3)
If the landscape is shaped like a “rain gutter”, weights will tend to oscillate without
much improvement. We can add a momentum factor
∂E
δw ← α δw − η
∂w
w ← w + δw
Hopefully, this will dampen sideways oscillations but amplify downhill motion by
1
1−α . Momentum can also help to escape from local minima, or move quickly
across flat regions in the loss landscape.
When momentum is used, we generally reduce the learning rate at the same time,
1
in order to compensate for the implicit factor of 1−α .
22
Adaptive Moment Estimation (Adam)
Maintain a running average of the gradients (mt ) and squared gradients (vt ) for
each weight in the network.
mt = β1 mt−1 + (1 − β1 )gt
vt = β2 vt−1 + (1 − β2 )gt2
To speed up training in the early stages, compensating for the fact that mt , gt are
initialized to zero, we rescale as follows:
mt vt
m̂t = , v̂t =
1 − β1t 1 − β2t
23