L3 Backpropagation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

Deep Learning

Learning the weights of a neural


network
LEARNING THE WEIGHTS OF A
LINEAR OUTPUT NEURON
Why the perceptron learning procedure cannot be
generalised to hidden layers
• The perceptron convergence procedure works by ensuring that every time
the weights change, they get closer to every generously feasible set of
weights.

• This type of guarantee cannot be extended to more complex networks in


which the average of two good solutions may be a bad solution.

• 2V demo here

• Instead of showing the weights get closer to a good set of weights, show
that the actual output values get closer the target values.
Linear neurons

• The neuron has a real-valued weight


output which is a weighted vector
sum of its inputs
• The aim of learning is to
minimize the error summed
y = å wi xi = wT x
over all examples. i
– The error is the squared neuron’s input
difference between the estimate of the vector
desired output and the desired output
actual output.
Why don’t we solve it analytically?

• It is straight-forward to write down a set of equations, one per example, and


to solve for the best set of weights.
– This is the standard engineering approach so why don’t we use it?
• Scientific answer: We want a method that real neurons could use.
• Engineering answer: We want a method that can be generalized to multi-
layer, non-linear neural networks.
– The analytic solution relies on it being linear and having a squared error
measure.
– Iterative methods are usually less efficient but they are much easier to
generalize.
Deriving the delta rule
• The simplest example is a linear
neuron with a squared error E= 1
2 å (t n - y n )2
measure. nÎtraining
• Define the error as the squared
¶E 1 ¶y n dE n
residuals summed over all = 2å n
examples: ¶wi n
¶wi dy
• Now differentiate to get error
derivatives for weights = -å xin (t n - y n )
• The batch delta rule changes the n
¶E
weights in proportion to their
Dwi = -e = åe xin (t n - y n )
error derivatives summed over all ¶wi n
examples
LEARNING THE WEIGHTS OF A LOGISTIC
OUTPUT NEURON
Logistic neurons

z = b + å xi wi
• These give a real-valued 1
y=
output that is a smooth and
bounded function of their 1+ e
-z
i
total input.
1
– They have nice
derivatives which make y 0.5
learning easy.
0
0 z
The derivatives of a logistic neuron

• The derivatives of the logit, z, • The derivative of the output with


with respect to the inputs and the respect to the logit is simple if you
weights are very simple: express it in terms of the output:

z = b + å xi wi 1
y=
i
1+ e
-z
¶z ¶z
= xi = wi dy
¶wi ¶xi = y (1 - y)
dz
The derivatives of a logistic neuron

= (1 + e-z )-1
1
y=
1+ e
-z

dy -1(-e-z ) æ 1 ö æ e-z ö
= = ç ÷ çç ÷ = y(1- y)
dz (1 + e-z )2 è 1 + e-z ø è 1 + e-z ÷ø

e-z (1+ e-z ) -1 (1+ e-z ) -1


because = = = 1- y
1+ e
-z 1+ e
-z 1+ e
-z 1 + e-z
Using the chain rule to get the derivatives needed for
learning the weights of a logistic unit

• To learn the weights we need the derivative of the output with respect
to each weight:
¶y ¶z dy
= = xi y (1- y)
¶wi ¶wi dz
delta-rule
¶E ¶y n ¶E
¶wi
= å ¶wi ¶y n
= - å xi
n n
y (1- y n
) (t n
- y n
)
n n
extra term = slope of logistic
BACKPROPAGATION

12
Sketch of the backpropagation algorithm on a single case

• First convert the discrepancy between


E = 1
2 å (t j - y j )2
jÎoutput
each output and its target value into an
¶E
error derivative. = -(t j - y j )
• Then compute error derivatives in each ¶y j
hidden layer from error derivatives in
the layer above. ¶E
¶y j
• Then use error derivatives w.r.t.
activities to get error derivatives w.r.t.
the incoming weights. ¶E
¶yi
Backpropagating dE/dy
yj
j ¶E dy j ¶E ¶E
= = y j (1- y j )
zj ¶z j dz j ¶y j ¶y j

yi ¶E dz j ¶E ¶E
¶yi
= å dy ¶z = å wij ¶z j
i j i j j

¶E ¶z j ¶E ¶E
= = yi
¶wij ¶wij ¶z j ¶z j
THE SOFTMAX OUTPUT FUNCTION
Softmax
zi
e
yi =
The output units in a softmax group use

å
a non-local non-linearity:
zj
yi e
softmax
group jÎgroup
zi
¶yi
= yi (1 - yi )
this is called the “logit”
¶zi
Cross-entropy: the right cost function to use with softmax

• The right cost function is the negative


log probability of the right answer.
C = - åt j log y j
• C has a very big gradient when the
target value is 1 and the output is j target value
almost zero.
– A value of 0.000001 is much better
than 0.000000001
¶C ¶C ¶y j

– The steepness of dC/dy exactly
balances the flatness of dy/dz = yi -ti
¶zi j ¶y j ¶zi
Problems with squared error

• The squared error measure has some drawbacks:


– If the desired output is 1 and the actual output is 0.00000001 there is
almost no gradient for a logistic unit to fix up the error.
– If we are trying to assign probabilities to mutually exclusive class
labels, we know that the outputs should sum to 1, but we are
depriving the network of this knowledge.

• Is there a different cost function that works better?


– Yes: Force the outputs to represent a probability distribution across
discrete alternatives.
A solved example ….
What is Backprop ?
• Backpropagation is the most common
method for training a neural network.
• Burn it into your mental firmware!!
Basic Structure

Credit: Matt
Mazur
Values to work with
• In order to have some numbers to work with,
here are the initial weights, the biases, and
training inputs/outputs:
Backpropagation Goal
The goal of backpropagation is to optimize the
weights so that the neural network can learn
how to correctly map arbitrary inputs to
outputs.
Forward Pass
• We figure out the total net input to each
hidden layer neuron, squash the total net
input using an activation function (here we
use the logistic function), then repeat the
process with the output layer neurons
Here’s how we calculate the total net input for :

Forward Pass (Cont…)


Forward Pass (Cont…)
Error Calculation – Mean Squared
Error

Targ
et

Targ
et
The Backward Pass
• Our goal with backpropagation is to update
each of the weights in the network so that
they cause the actual output to be closer the
target output, thereby minimizing the error
for each output neuron and the network as a
whole.
The Backward Pass (Cont…)
The Backward Pass (Cont…)
The Backward Pass (Cont…)
The Backward Pass (Cont…)
The Backward Pass (Cont…)
The Backward Pass (Cont…)
The Backward Pass (Cont…)
The Backward Pass (Cont…)
The Backward Pass (Cont…)
The Backward Pass (Cont…)
The Backward Pass (Cont…)
GRADIENT DESCENT

42
Closed Form
• set derivatives equal to zero and solve for parameters

43
Background: Matrix Derivatives
• For f : R mn  R , define:
   
 A f  f
A1 n
 11 
 A f ( A) =     
   
 f  f
 A1 m Amn 
 
n
trA =  Aii , tra = a , trABC = trCAB = trBCA
• Trace: i =1

• Some fact of matrixT derivatives (without proof)


 A trAB = B ,
T T T −1 T
 A trABA C = CAB + C AB , A A = A A ( )
Credit: Eric Xing 44
The normal equations
• Write the cost function in matrix form:
− − x 1 − −
1 n − − x − − 
J ( ) =  ( x i  − yi ) 2
T
X= 2
2 i =1     
 
1 T  − − x n − −
= ( X − y ) ( X − y )
2  y1 
1  
(    
=  T X T X −  T X T y − y T X + y T y
2
)   y2 
y=

 
 yn 

1
 J =
2
(    
)
 tr  T X T X −  T X T y − y T X + y T y
T 
• To minimize J(θ), take
1
 X T
X =
derivative and set toy
X
( )   
=  tr T X T X − 2 try T X +  try T y
zero: 2 The normal equations
1
(
= X T X + X T X − 2 X T y
2

) 
 = (X X )

= X T X − X T y = 0 * T −1  T
X y 45
Closed Form – why not then?
• It is straight-forward to write down a set of equations, one per datum, and to solve
for the best set of parameters. So why don’t we?
• We want a method that can be generalized to non-convex, non-linear error
models.
– The analytic solution relies on it being linear and having a squared error measure.
– Iterative methods are usually less efficient but they are much easier to generalize.

46
47
48
In summary (the big expressions)
Learning: finds the parameters that minimize some objective
function

We minimize the sum of the squares:

49
Gradient Descent

In order to apply GD to Linear


Regression all we need is the
gradient of the objective
function (i.e. vector of partial
derivatives).
50
Gradient Descent

There are many possible ways to detect convergence. For


example, we could check whether the L2 norm of the
gradient is below some small tolerance.

Alternatively we could check that the reduction in the


objective function from one iteration to the next is small. 51
52
53
54
55
56
57
Example: your favorite EM instance
• As an example, a mixture model induces a multi-modal
likelihood.
• Hence gradient ascent can only find a local maximum.

Eric Xing © Eric Xing @ CMU, 2006-2011 58


59
60

You might also like