Professional Documents
Culture Documents
L3 Backpropagation
L3 Backpropagation
L3 Backpropagation
• 2V demo here
• Instead of showing the weights get closer to a good set of weights, show
that the actual output values get closer the target values.
Linear neurons
z = b + å xi wi
• These give a real-valued 1
y=
output that is a smooth and
bounded function of their 1+ e
-z
i
total input.
1
– They have nice
derivatives which make y 0.5
learning easy.
0
0 z
The derivatives of a logistic neuron
z = b + å xi wi 1
y=
i
1+ e
-z
¶z ¶z
= xi = wi dy
¶wi ¶xi = y (1 - y)
dz
The derivatives of a logistic neuron
= (1 + e-z )-1
1
y=
1+ e
-z
dy -1(-e-z ) æ 1 ö æ e-z ö
= = ç ÷ çç ÷ = y(1- y)
dz (1 + e-z )2 è 1 + e-z ø è 1 + e-z ÷ø
• To learn the weights we need the derivative of the output with respect
to each weight:
¶y ¶z dy
= = xi y (1- y)
¶wi ¶wi dz
delta-rule
¶E ¶y n ¶E
¶wi
= å ¶wi ¶y n
= - å xi
n n
y (1- y n
) (t n
- y n
)
n n
extra term = slope of logistic
BACKPROPAGATION
12
Sketch of the backpropagation algorithm on a single case
yi ¶E dz j ¶E ¶E
¶yi
= å dy ¶z = å wij ¶z j
i j i j j
¶E ¶z j ¶E ¶E
= = yi
¶wij ¶wij ¶z j ¶z j
THE SOFTMAX OUTPUT FUNCTION
Softmax
zi
e
yi =
The output units in a softmax group use
å
a non-local non-linearity:
zj
yi e
softmax
group jÎgroup
zi
¶yi
= yi (1 - yi )
this is called the “logit”
¶zi
Cross-entropy: the right cost function to use with softmax
Credit: Matt
Mazur
Values to work with
• In order to have some numbers to work with,
here are the initial weights, the biases, and
training inputs/outputs:
Backpropagation Goal
The goal of backpropagation is to optimize the
weights so that the neural network can learn
how to correctly map arbitrary inputs to
outputs.
Forward Pass
• We figure out the total net input to each
hidden layer neuron, squash the total net
input using an activation function (here we
use the logistic function), then repeat the
process with the output layer neurons
Here’s how we calculate the total net input for :
Targ
et
Targ
et
The Backward Pass
• Our goal with backpropagation is to update
each of the weights in the network so that
they cause the actual output to be closer the
target output, thereby minimizing the error
for each output neuron and the network as a
whole.
The Backward Pass (Cont…)
The Backward Pass (Cont…)
The Backward Pass (Cont…)
The Backward Pass (Cont…)
The Backward Pass (Cont…)
The Backward Pass (Cont…)
The Backward Pass (Cont…)
The Backward Pass (Cont…)
The Backward Pass (Cont…)
The Backward Pass (Cont…)
The Backward Pass (Cont…)
GRADIENT DESCENT
42
Closed Form
• set derivatives equal to zero and solve for parameters
43
Background: Matrix Derivatives
• For f : R mn R , define:
A f f
A1 n
11
A f ( A) =
f f
A1 m Amn
n
trA = Aii , tra = a , trABC = trCAB = trBCA
• Trace: i =1
1
J =
2
(
)
tr T X T X − T X T y − y T X + y T y
T
• To minimize J(θ), take
1
X T
X =
derivative and set toy
X
( )
= tr T X T X − 2 try T X + try T y
zero: 2 The normal equations
1
(
= X T X + X T X − 2 X T y
2
)
= (X X )
= X T X − X T y = 0 * T −1 T
X y 45
Closed Form – why not then?
• It is straight-forward to write down a set of equations, one per datum, and to solve
for the best set of parameters. So why don’t we?
• We want a method that can be generalized to non-convex, non-linear error
models.
– The analytic solution relies on it being linear and having a squared error measure.
– Iterative methods are usually less efficient but they are much easier to generalize.
46
47
48
In summary (the big expressions)
Learning: finds the parameters that minimize some objective
function
49
Gradient Descent