Back Propagation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Back-Propagation

Neuron #1 x'1 X1 Neuron #2 . . X


n

x'2 x'i

Neuron #1 . . Neuron #p

Y 1(m+1)

Input

. . Neuron #k

Output
Y p(m+1)

Layer 0 Layer m

Layer m+1

Multi-Layer Perceptron

Back Propagation
Supervised learning mechanism for multilayered, generalized feed forward network Discovered by researchers independently [Werbos(1974), Parker(1982), and Rumelhart(1986)] Played a major role in the reemergence of neural networks as a tool for solving a wide variety of problems

Back Propagation
The most well known and widely used among the current types of NN systems Use differentiable activation function Robust and Stable (based on gradient descent technique, i.e. approximate steepest descent) Recognize patterns similar to those they have learned (do not have ability to recognize new patterns - True for all supervised learning)

BackProp: Basic Neuron

Activation functions

Sigmoid function performs a sort of soft threshold that is rounded and differentiable compared to the step function
Judith Dayhoff, Neural Network Architectures: An Introduction, Van Nostrand Reinhold

Sigmoid Function
Nonlinear activation function Smooth - differentiable Relatively flat at both ends Rapid rise in the middle Advantage of automatic gain control (no saturation)

Judith Dayhoff, Neural Network Architectures: An Introduction, Van Nostrand Reinhold

BP-Model

Limin Fu, Neural Networks in Computer Intelligence, McGraw Hill

3 layered BP network

Judith Dayhoff, Neural Network Architectures: An Introduction, Van Nostrand Reinhold

5 layered BP network

Judith Dayhoff, Neural Network Architectures: An Introduction, Van Nostrand Reinhold

BP Learning Procedure
The back-prop algorithm is an iterative gradient algorithm designed to minimize the mean square error between the actual output of a multilayer feed-forward perceptron and the desired output. It requires continuous differentiable nonlinearities. The following assumes a sigmoid logistic non-linearity is used.

BP Learning Procedure
Neuron #1 x' X1 Neuron #2 . . X
n 1

x' x'
i

Neuron #1 . . Neuron #p

Y 1(m+1)

Input

. . Neuron #k

Output
Y p(m+1)

Layer 0 Layer m

Layer m+1

BP Learning Procedure
Step 1: Initialize Weights (Wkm1..Wkmn) and Threshold (Wkm0)
Set all weights and thresholds (optional) to small bipolar random values (). Note that k represents neuron k of layer m and n represents the total number of inputs to the neuron k.

BP Learning Procedure
Step 2:
Present New Input and Desired Output Present input vector x1, x2, .....xn along with the desired output dk(t). Note: ** x0 is a fixed bias and always set equal to 1.
** dk(t) is the desired output for output neuron k and takes the value of 0 to 1.

** The input could be new on each trial or samples from the training set could be presented cyclically until weights stabilize.

BP Learning Procedure
Step 3:
Calculate Actual Outputs [ykm(t)]
n *

ykm(t) = Fs( wkmi(t)


i=0

xi (t) )

where

Fs(sum) = 1/ (1+e-sum)
wkmi(t) is the weight associated with the i input to neuron k of layer m ykm(t) is the output from neuron k of layer m xi(t) is the i input to neuron k of layer m

BP Forward Pass

ykm(t) = Fs( wkmi(t)


i=0

xi (t) )

BP -Forward Pass: Example

Judith Dayhoff, Neural Network Architectures: An Introduction, Van Nostrand Reinhold

BP Learning Procedure
Step 4: Adapt Weights
Use a recursive algorithm starting at the output nodes and working back to the first hidden layer. Adjust weights by wkmi(t+1) = wkmi(t) + * km * xi (t)
where

is the learning rate and usually is a small number ranging from 0 to 1 (typically <= 1/n) xi(t) is the i input to neuron k of layer m km is an error term associated with neuron k of layer m

0<i<n

and

BP Learning Procedure
km is defined as
) If m= output layer, km = ykm*(1- ykm)*(dk- ykm) B) If m= hidden layer, p km = ykm*(1- ykm)*( wq(m+1)(t) * q(m+1))
q=1

where
ykm is the actual output of neuron k in layer m, wq(m+1)k(t) is a weight associated with input k of neuron q in layer m+1, and p is the total number of neurons in layer m+1 q(m+1) is an error term associated with neuron q of layer m+1

BP Backward Pass

wkmi(t+1) = wkmi(t) + * km * xi (t) km = ykm*(1- ykm)*(dk- ykm)

BP - Backward at Output

Judith Dayhoff, Neural Network Architectures: An Introduction, Van Nostrand Reinhold

BP Backward Pass

wkmi(t+1) = wkmi(t) + * km * xi (t)


p km = ykm*(1- ykm)*( wq(m+1)(t) * q(m+1))
q=1

BP - Backward at Hidden layer

Judith Dayhoff, Neural Network Architectures: An Introduction, Van Nostrand Reinhold

BP Learning Procedure
Step 5: Repeat step 2 to 4
Repeat until the difference between the desired outputs and the actual network outputs are within an acceptable error range(such as 1 % error) for all the input vectors of the training set.

Summary-BP Learning
Apply next exemplar Calculate the output of the network Adjust weights (error back prop) All patterns are not trained to desired accuracy ()

Derivation of BP
BP learning rule

LiMin Fu, Neural Networks in Computer Intelligence, McGraw Hill

Derivation of BP

LiMin Fu, Neural Networks in Computer Intelligence, McGraw Hill

Derivation of BP

LiMin Fu, Neural Networks in Computer Intelligence, McGraw Hill

Derivation of BP
If activation function is a Sigmoid function, F(x) = 1/(1+e-x) Then F(x) = (1-F(x))*F(x) Since F(x) = Oj At output Layer = (1-Oj)Oj (Tj-Oj) At Hidden Layer = (1-Oj)OjkWkj

LiMin Fu, Neural Networks in Computer Intelligence, McGraw Hill

Derivation of BP
If activation function is a bipolar Sigmoid function, F(x) = 2/(1+e-x) -1 Then F(x) = 1/2(1-F(x))*(1+F(x)) Since F(x) = Oj At output Layer = 1/2(1-Oj)(1+Oj )(Tj-Oj) At Hidden Layer = 1/2(1-Oj)(1+Oj)kWkj

LiMin Fu, Neural Networks in Computer Intelligence, McGraw Hill

Choice of initial weights and Biases


Random Initialization
Influence whether the net reaches a global (or only a local minimum of the error and, if so, how quickly it converges -0.5 and 0.5 (or between -1 and 1)

Nguyen-Widrow Initialization
Designed to improve the ability of the hidden units to learn by distributing the initial weights and bias so that, for each input pattern, it is likely that the net input to one of the hidden units will be in the range in which that hidden neuron will learn most readily
Laurene Fausett, Fundamentals of Neural Networks, Prentice Hall

Nguyen-Widrow Initialization
Let n =number of input units p = number of hidden units = scale factor: = 0.7 (p) 1/n For each hidden unit (j = 1,.,p): wij(old) = random number between -0.5 and 0.5 Compute ||wj(old)|| Reinitialize weights: wij = wij(old)/ ||wj(old)|| Set bias: w0j = random number between - and
Laurene Fausett, Fundamentals of Neural Networks, Prentice Hall

Nguyen-Widrow Initialization
Nguyen-Widrow analysis is based on the activation function tanh(x) = (ex-e-x)/(ex+e-x)

Laurene Fausett, Fundamentals of Neural Networks, Prentice Hall

You might also like