Professional Documents
Culture Documents
Back Propagation
Back Propagation
Back Propagation
x'2 x'i
Neuron #1 . . Neuron #p
Y 1(m+1)
Input
. . Neuron #k
Output
Y p(m+1)
Layer 0 Layer m
Layer m+1
Multi-Layer Perceptron
Back Propagation
Supervised learning mechanism for multilayered, generalized feed forward network Discovered by researchers independently [Werbos(1974), Parker(1982), and Rumelhart(1986)] Played a major role in the reemergence of neural networks as a tool for solving a wide variety of problems
Back Propagation
The most well known and widely used among the current types of NN systems Use differentiable activation function Robust and Stable (based on gradient descent technique, i.e. approximate steepest descent) Recognize patterns similar to those they have learned (do not have ability to recognize new patterns - True for all supervised learning)
Activation functions
Sigmoid function performs a sort of soft threshold that is rounded and differentiable compared to the step function
Judith Dayhoff, Neural Network Architectures: An Introduction, Van Nostrand Reinhold
Sigmoid Function
Nonlinear activation function Smooth - differentiable Relatively flat at both ends Rapid rise in the middle Advantage of automatic gain control (no saturation)
BP-Model
3 layered BP network
5 layered BP network
BP Learning Procedure
The back-prop algorithm is an iterative gradient algorithm designed to minimize the mean square error between the actual output of a multilayer feed-forward perceptron and the desired output. It requires continuous differentiable nonlinearities. The following assumes a sigmoid logistic non-linearity is used.
BP Learning Procedure
Neuron #1 x' X1 Neuron #2 . . X
n 1
x' x'
i
Neuron #1 . . Neuron #p
Y 1(m+1)
Input
. . Neuron #k
Output
Y p(m+1)
Layer 0 Layer m
Layer m+1
BP Learning Procedure
Step 1: Initialize Weights (Wkm1..Wkmn) and Threshold (Wkm0)
Set all weights and thresholds (optional) to small bipolar random values (). Note that k represents neuron k of layer m and n represents the total number of inputs to the neuron k.
BP Learning Procedure
Step 2:
Present New Input and Desired Output Present input vector x1, x2, .....xn along with the desired output dk(t). Note: ** x0 is a fixed bias and always set equal to 1.
** dk(t) is the desired output for output neuron k and takes the value of 0 to 1.
** The input could be new on each trial or samples from the training set could be presented cyclically until weights stabilize.
BP Learning Procedure
Step 3:
Calculate Actual Outputs [ykm(t)]
n *
xi (t) )
where
Fs(sum) = 1/ (1+e-sum)
wkmi(t) is the weight associated with the i input to neuron k of layer m ykm(t) is the output from neuron k of layer m xi(t) is the i input to neuron k of layer m
BP Forward Pass
xi (t) )
BP Learning Procedure
Step 4: Adapt Weights
Use a recursive algorithm starting at the output nodes and working back to the first hidden layer. Adjust weights by wkmi(t+1) = wkmi(t) + * km * xi (t)
where
is the learning rate and usually is a small number ranging from 0 to 1 (typically <= 1/n) xi(t) is the i input to neuron k of layer m km is an error term associated with neuron k of layer m
0<i<n
and
BP Learning Procedure
km is defined as
) If m= output layer, km = ykm*(1- ykm)*(dk- ykm) B) If m= hidden layer, p km = ykm*(1- ykm)*( wq(m+1)(t) * q(m+1))
q=1
where
ykm is the actual output of neuron k in layer m, wq(m+1)k(t) is a weight associated with input k of neuron q in layer m+1, and p is the total number of neurons in layer m+1 q(m+1) is an error term associated with neuron q of layer m+1
BP Backward Pass
BP - Backward at Output
BP Backward Pass
BP Learning Procedure
Step 5: Repeat step 2 to 4
Repeat until the difference between the desired outputs and the actual network outputs are within an acceptable error range(such as 1 % error) for all the input vectors of the training set.
Summary-BP Learning
Apply next exemplar Calculate the output of the network Adjust weights (error back prop) All patterns are not trained to desired accuracy ()
Derivation of BP
BP learning rule
Derivation of BP
Derivation of BP
Derivation of BP
If activation function is a Sigmoid function, F(x) = 1/(1+e-x) Then F(x) = (1-F(x))*F(x) Since F(x) = Oj At output Layer = (1-Oj)Oj (Tj-Oj) At Hidden Layer = (1-Oj)OjkWkj
Derivation of BP
If activation function is a bipolar Sigmoid function, F(x) = 2/(1+e-x) -1 Then F(x) = 1/2(1-F(x))*(1+F(x)) Since F(x) = Oj At output Layer = 1/2(1-Oj)(1+Oj )(Tj-Oj) At Hidden Layer = 1/2(1-Oj)(1+Oj)kWkj
Nguyen-Widrow Initialization
Designed to improve the ability of the hidden units to learn by distributing the initial weights and bias so that, for each input pattern, it is likely that the net input to one of the hidden units will be in the range in which that hidden neuron will learn most readily
Laurene Fausett, Fundamentals of Neural Networks, Prentice Hall
Nguyen-Widrow Initialization
Let n =number of input units p = number of hidden units = scale factor: = 0.7 (p) 1/n For each hidden unit (j = 1,.,p): wij(old) = random number between -0.5 and 0.5 Compute ||wj(old)|| Reinitialize weights: wij = wij(old)/ ||wj(old)|| Set bias: w0j = random number between - and
Laurene Fausett, Fundamentals of Neural Networks, Prentice Hall
Nguyen-Widrow Initialization
Nguyen-Widrow analysis is based on the activation function tanh(x) = (ex-e-x)/(ex+e-x)