Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Application of Soft

Computing (KCS
056)
Date: the 13th of November 2021

Unit – ii

Neural Networks – ii

Back Propagation Networks

Backpropagation:

Backpropagation is a supervised learning algorithm, for training Multi-layer Perceptrons


(Artificial Neural Networks).

2
3
Why We Need Backpropagation?

While designing a Neural Network, in the beginning, we initialize weights with some random
values or any variable for that fact.

Now obviously, we are not superhuman. So, it’s not necessary that whatever weight values we
have selected will be correct, or it fits our model the best.

Okay, fine, we have selected some weight values in the beginning, but our model output is way
different than our actual output i.e. the error value is huge.

Now, how will we reduce the error?

Basically, what we need to do, we need to somehow explain the model to change the parameters
(weights), such that error becomes minimum.

4
Let’s put it in an another way, we need to train our model.

One way to train our model is called as Backpropagation. Consider the diagram below:

Let me summarize the steps for us:

• Calculate the error – How far is your model output from the actual output.
• Minimum Error – Check whether the error is minimized or not.
• Update the parameters – If the error is huge then, update the parameters (weights and
biases). After that again check the error. Repeat the process until the error becomes
minimum.
• Model is ready to make a prediction – Once the error becomes minimum, we can feed
some inputs to our model and it will produce the output.

I am pretty sure, now we know, why we need Backpropagation or why and what is the meaning
of training a model.

Now is the correct time to understand what is Backpropagation.

What is Backpropagation?

The Backpropagation algorithm looks for the minimum value of the error function in weight
space using a technique called the delta rule or gradient descent. The weights that minimize the
error function is then considered to be a solution to the learning problem.

5
Let’s understand how it works with an example:

We have a dataset, which has labels.

Consider the below table:

Input Desired Output


0 0
1 2

2 4
Now the output of our model when ‘W” value is 3:

Input Desired Output Model output (W=3)


0 0 0
1 2 3
2 4 6
Notice the difference between the actual output and the desired output:

Input Desired Output Model output (W=3) Absolute Error Square Error
0 0 0 0 0
1 2 3 1 1
2 4 6 2 4

Model
Desired Model output
Input Absolute Error Square Error output Square Error
Output (W=3)
(W=4)
0 0 0 0 0 0 0
1 2 3 1 1 4 4
2 4 6 2 4 8 16

6
Let’s change the value of ‘W’. Notice the error when ‘W’ = ‘4’

Now if we notice, when we increase the value of ‘W’ the error has increased. So, obviously there
is no point in increasing the value of ‘W’ further. But, what happens if I decrease the value of
‘W’? Consider the table below:

Model Model
Square
Input Desired Output output Absolute Error Square Error output
Error
(W=3) (W=2)
0 0 0 0 0 0 0
1 2 3 2 4 3 0

2 4 6 2 4 4 0
Now, what we did here:

• We first initialized some random value to ‘W’ and propagated forward.


• Then, we noticed that there is some error. To reduce that error, we propagated backwards
and increased the value of ‘W’.
• After that, also we noticed that the error has increased. We came to know that, we can’t
increase the ‘W’ value.
• So, we again propagated backwards and we decreased ‘W’ value.
• Now, we noticed that the error has reduced.

So, we are trying to get the value of weight such that the error becomes minimum. Basically, we
need to figure out whether we need to increase or decrease the weight value. Once we know that,
we keep on updating the weight value in that direction until error becomes minimum. You might
reach a point, where if you further update the weight, the error will increase. At that time you
need to stop, and that is your final weight value.

7
Consider the graph below:

We need to reach the ‘Global Loss Minimum’.

This is nothing but Backpropagation.

Let’s now understand the math behind Backpropagation.

How Backpropagation Works?

Consider the below Neural Network:

8
The above network contains the following:

• two inputs
• two hidden neurons
• two output neurons
• two biases

Below are the steps involved in Backpropagation:

• Step – 1: Forward Propagation


• Step – 2: Backward Propagation
• Step – 3: Putting all the values together and calculating the updated weight value

Step – 1: Forward Propagation

We will start by propagating forward.

9
We will repeat this process for the output layer neurons, using the output from the hidden layer
neurons as inputs.

10
Now, let’s see what is the value of the error:

Step – 2: Backward Propagation

11
Now, we will propagate backwards. This way we will try to reduce the error by changing the
values of weights and biases.

Consider W5, we will calculate the rate of change of error w.r.t change in weight W5.

Since we are propagating backwards, first thing we need to do is, calculate the change in total
errors w.r.t the output O1 and O2.

Now, we will propagate further backwards and calculate the change in output O1 w.r.t to its total
net input.

12
Let’s see now how much does the total net input of O1 changes w.r.t W5?

Step – 3: Putting all the values together and calculating the updated weight value

Now, let’s put all the values together:

Let’s calculate the updated value of W5:

13
• Similarly, we can calculate the other weight values as well.
• After that we will again propagate forward and calculate the output. Again, we will
calculate the error.
• If the error is minimum we will stop right there, else we will again propagate backwards
and update the weight values.
• This process will keep on repeating until error becomes minimum.

Types of Backpropagation Network

Static Backpropagation

Static backpropagation is one type of network that aims in producing a mapping of a static input
for static output. These kinds of networks are capable of solving static classification problems
like optical character recognition (OCR).

Recurrent Backpropagation

The recurrent backpropagation is another type of network employed in fixed-point learning. The
activations in recurrent backpropagation are fed forward till it attains a fixed value. Following

14
this, an error is calculated and propagated backward. A software, NeuroSolutions has the ability
to perform the recurrent backpropagation.
The key differences: The static backpropagation offers immediate mapping, while mapping
recurrent backpropagation is not immediate.

Conclusion:

Well, if I have to conclude Backpropagation, the best option is to write pseudo code for the
same.

Perceptron Architecture

Perceptron is a single layer neural network and a multi-layer perceptron is called Neural
Networks.

Perceptron is a linear classifier (binary). Also, it is used in supervised learning. It helps to classify
the given input data. But how the heck it works ?

A normal neural network looks like this as we all know

15
As we can see it has multiple layers.

The perceptron consists of 4 parts.

1. Input values or One input layer

2. Weights and Bias

3. Net sum

4. Activation Function

The Neural Networks work the same way as the perceptron. So, if we want to know how neural
network works, learn how perceptron works.

16
Fig : Perceptron

But how does it work?

The perceptron works on these simple steps

a. All the inputs x are multiplied with their weights w. Let’s call it k.

Fig: Multiplying inputs with weights for 5 inputs

17
b. Add all the multiplied values and call them Weighted Sum.

Fig: Adding with Summation

c. Apply that weighted sum to the correct Activation Function.

For Example: Unit Step Activation Function.

Fig: Unit Step Activation Function

Why do we need Weights and Bias?

Weights shows the strength of the particular node.

A bias value allows you to shift the activation function curve up or down.

18
Why do we need Activation Function?

In short, the activation functions are used to map the input between the required values like
(0, 1) or (-1, 1).

Where we use Perceptron?

Perceptron is usually used to classify the data into two parts. Therefore, it is also known as

a Linear Binary Classifier.

19
Perceptron Algorithm
The Perceptron algorithm is a two-class (binary) classification machine learning algorithm.
It is a type of neural network model, perhaps the simplest type of neural network model.

It consists of a single node or neuron that takes a row of data as input and predicts a class label.
This is achieved by calculating the weighted sum of the inputs and a bias (set to 1). The weighted
sum of the input of the model is called the activation.

• Activation = Weights * Inputs + Bias


If the activation is above 0.0, the model will output 1.0; otherwise, it will output 0.0.

• Predict 1: If Activation > 0.0


• Predict 0: If Activation <= 0.0
Given that the inputs are multiplied by model coefficients, like linear regression and logistic
regression, it is good practice to normalize or standardize data prior to using the model.

The Perceptron is a linear classification algorithm. This means that it learns a decision boundary
that separates two classes using a line (called a hyperplane) in the feature space. As such, it is

20
appropriate for those problems where the classes can be separated well by a line or linear model,
referred to as linearly separable.

The coefficients of the model are referred to as input weights and are trained using the stochastic
gradient descent optimization algorithm.

Examples from the training dataset are shown to the model one at a time, the model makes a
prediction, and error is calculated. The weights of the model are then updated to reduce the errors
for the example. This is called the Perceptron update rule. This process is repeated for all
examples in the training dataset, called an epoch. This process of updating the model using
examples is then repeated for many epochs.
Model weights are updated with a small proportion of the error each batch, and the proportion is
controlled by a hyperparameter called the learning rate, typically set to a small value. This is to
ensure learning does not occur too quickly, resulting in a possibly lower skill model, referred to
as premature convergence of the optimization (search) procedure for the model weights.

• weights(t + 1) = weights(t) + learning_rate * (expected_i – predicted_) * input_i


Training is stopped when the error made by the model falls to a low level or no longer improves,
or a maximum number of epochs is performed.

The initial values for the model weights are set to small random values. Additionally, the training
dataset is shuffled prior to each training epoch. This is by design to accelerate and improve the
model training process. Because of this, the learning algorithm is stochastic and may achieve
different results each time it is run. As such, it is good practice to summarize the performance of
the algorithm on a dataset using repeated evaluation and reporting the mean classification
accuracy.

The learning rate and number of training epochs are hyperparameters of the algorithm that can be
set using heuristics or hyperparameter tuning.

Multi-Layer Perceptrons

21
The field of artificial neural networks is often just called neural networks or multi-layer
perceptrons after perhaps the most useful type of neural network. A perceptron is a single neuron
model that was a precursor to larger neural networks.

It is a field that investigates how simple models of biological brains can be used to solve difficult
computational tasks like the predictive modeling tasks we see in machine learning. The goal is
not to create realistic models of the brain, but instead to develop robust algorithms and data
structures that we can use to model difficult problems.

The power of neural networks comes from their ability to learn the representation in your
training data and how to best relate it to the output variable that you want to predict. In this sense
neural networks learn a mapping. Mathematically, they are capable of learning any mapping
function and have been proven to be a universal approximation algorithm.

The predictive capability of neural networks comes from the hierarchical or multi-layered
structure of the networks. The data structure can pick out (learn to represent) features at different
scales or resolutions and combine them into higher-order features. For example from lines, to
collections of lines to shapes.

2. Neurons
The building block for neural networks are artificial neurons.

These are simple computational units that have weighted input signals and produce an output
signal using an activation function.

22
Model of a Simple Neuron

Neuron Weights
You may be familiar with linear regression, in which case the weights on the inputs are very
much like the coefficients used in a regression equation.

Like linear regression, each neuron also has a bias which can be thought of as an input that
always has the value 1.0 and it too must be weighted.

For example, a neuron may have two inputs in which case it requires three weights. One for each
input and one for the bias.

Weights are often initialized to small random values, such as values in the range 0 to 0.3,
although more complex initialization schemes can be used.

Like linear regression, larger weights indicate increased complexity and fragility. It is desirable
to keep weights in the network small and regularization techniques can be used.

Activation
The weighted inputs are summed and passed through an activation function, sometimes called a
transfer function.

23
An activation function is a simple mapping of summed weighted input to the output of the
neuron. It is called an activation function because it governs the threshold at which the neuron is
activated and strength of the output signal.

Historically simple step activation functions were used where if the summed input was above a
threshold, for example 0.5, then the neuron would output a value of 1.0, otherwise it would
output a 0.0.

Traditionally non-linear activation functions are used. This allows the network to combine the
inputs in more complex ways and in turn provide a richer capability in the functions they can
model. Non-linear functions like the logistic also called the sigmoid function were used that
output a value between 0 and 1 with an s-shaped distribution, and the hyperbolic tangent function
also called tanh that outputs the same distribution over the range -1 to +1.

More recently the rectifier activation function has been shown to provide better results.

3. Networks of Neurons
Neurons are arranged into networks of neurons.

A row of neurons is called a layer and one network can have multiple layers. The architecture of
the neurons in the network is often called the network topology.

Model of a Simple Network

Input or Visible Layers

24
The bottom layer that takes input from your dataset is called the visible layer, because it is the
exposed part of the network. Often a neural network is drawn with a visible layer with one
neuron per input value or column in your dataset. These are not neurons as described above, but
simply pass the input value though to the next layer.

Hidden Layers
Layers after the input layer are called hidden layers because that are not directly exposed to the
input. The simplest network structure is to have a single neuron in the hidden layer that directly
outputs the value.

Given increases in computing power and efficient libraries, very deep neural networks can be
constructed. Deep learning can refer to having many hidden layers in your neural network. They
are deep because they would have been unimaginably slow to train historically, but may take
seconds or minutes to train using modern techniques and hardware.

Output Layer
The final hidden layer is called the output layer and it is responsible for outputting a value or
vector of values that correspond to the format required for the problem.

The choice of activation function in he output layer is strongly constrained by the type of
problem that you are modeling. For example:

• A regression problem may have a single output neuron and the neuron may have no activation
function.
• A binary classification problem may have a single output neuron and use a sigmoid activation
function to output a value between 0 and 1 to represent the probability of predicting a value for
the class 1. This can be turned into a crisp class value by using a threshold of 0.5 and snap values
less than the threshold to 0 otherwise to 1.
• A multi-class classification problem may have multiple neurons in the output layer, one for each
class (e.g. three neurons for the three classes in the famous iris flowers classification problem).
In this case a softmax activation function may be used to output a probability of the network

25
predicting each of the class values. Selecting the output with the highest probability can be used
to produce a crisp class classification value.

4. Training Networks
Once configured, the neural network needs to be trained on your dataset.

Data Preparation
You must first prepare your data for training on a neural network.

Data must be numerical, for example real values. If you have categorical data, such as a sex
attribute with the values “male” and “female”, you can convert it to a real-valued representation
called a one hot encoding. This is where one new column is added for each class value (two
columns in the case of sex of male and female) and a 0 or 1 is added for each row depending on
the class value for that row.
This same one hot encoding can be used on the output variable in classification problems with
more than one class. This would create a binary vector from a single column that would be easy
to directly compare to the output of the neuron in the network’s output layer, that as described
above, would output one value for each class.

Neural networks require the input to be scaled in a consistent way. You can rescale it to the
range between 0 and 1 called normalization. Another popular technique is to standardize it so
that the distribution of each column has the mean of zero and the standard deviation of 1.

Scaling also applies to image pixel data. Data such as words can be converted to integers, such as
the popularity rank of the word in the dataset and other encoding techniques.

Stochastic Gradient Descent


The classical and still preferred training algorithm for neural networks is called stochastic
gradient descent.

This is where one row of data is exposed to the network at a time as input. The network
processes the input upward activating neurons as it goes to finally produce an output value. This

26
is called a forward pass on the network. It is the type of pass that is also used after the network is
trained in order to make predictions on new data.

The output of the network is compared to the expected output and an error is calculated. This
error is then propagated back through the network, one layer at a time, and the weights are
updated according to the amount that they contributed to the error. This clever bit of math is
called the backpropagation algorithm.
The process is repeated for all of the examples in your training data. One round of updating the
network for the entire training dataset is called an epoch. A network may be trained for tens,
hundreds or many thousands of epochs.

Weight Updates
The weights in the network can be updated from the errors calculated for each training example
and this is called online learning. It can result in fast but also chaotic changes to the network.

Alternatively, the errors can be saved up across all of the training examples and the network can
be updated at the end. This is called batch learning and is often more stable.

Typically, because datasets are so large and because of computational efficiencies, the size of the
batch, the number of examples the network is shown before an update is often reduced to a small
number, such as tens or hundreds of examples.

The amount that weights are updated is controlled by a configuration parameters called the
learning rate. It is also called the step size and controls the step or change made to network
weight for a given error. Often small weight sizes are used such as 0.1 or 0.01 or smaller.

The update equation can be complemented with additional configuration terms that you can set.

• Momentum is a term that incorporates the properties from the previous weight update to allow
the weights to continue to change in the same direction even when there is less error being
calculated.

27
• Learning Rate Decay is used to decrease the learning rate over epochs to allow the network to
make large changes to the weights at the beginning and smaller fine tuning changes later in the
training schedule.
Prediction
Once a neural network has been trained it can be used to make predictions.

You can make predictions on test or validation data in order to estimate the skill of the model on
unseen data. You can also deploy it operationally and use it to make predictions continuously.

The network topology and the final set of weights is all that you need to save from the model.
Predictions are made by providing the input to the network and performing a forward-pass
allowing it to generate an output that you can use as a prediction.

Effect of learning rule coefficient

What Is the Learning Rate?


Deep learning neural networks are trained using the stochastic gradient descent algorithm.

Stochastic gradient descent is an optimization algorithm that estimates the error gradient for the
current state of the model using examples from the training dataset, then updates the weights of
the model using the back-propagation of errors algorithm, referred to as simply backpropagation.
Specifically, the learning rate is a configurable hyperparameter used in the training of neural
networks that has a small positive value, often in the range between 0.0 and 1.0.

The learning rate is often represented using the notation of the lowercase Greek letter eta (n).
During training, the backpropagation of error estimates the amount of error for which the
weights of a node in the network are responsible. Instead of updating the weight with the full
amount, it is scaled by the learning rate.

This means that a learning rate of 0.1, a traditionally common default value, would mean that
weights in the network are updated 0.1 * (estimated weight error) or 10% of the estimated weight
error each time the weights are updated.

28
Effect of Learning Rate
A neural network learns or approximates a function to best map inputs to outputs from examples
in the training dataset.

The learning rate hyperparameter controls the rate or speed at which the model learns.
Specifically, it controls the amount of apportioned error that the weights of the model are
updated with each time they are updated, such as at the end of each batch of training examples.

Given a perfectly configured learning rate, the model will learn to best approximate the function
given available resources (the number of layers and the number of nodes per layer) in a given
number of training epochs (passes through the training data).

Generally, a large learning rate allows the model to learn faster, at the cost of arriving on a sub-
optimal final set of weights. A smaller learning rate may allow the model to learn a more optimal
or even globally optimal set of weights but may take significantly longer to train.

At extremes, a learning rate that is too large will result in weight updates that will be too large
and the performance of the model (such as its loss on the training dataset) will oscillate over
training epochs. Oscillating performance is said to be caused by weights that diverge (are
divergent). A learning rate that is too small may never converge or may get stuck on a
suboptimal solution.

Therefore, we should not use a learning rate that is too large or too small. Nevertheless, we must
configure the model in such a way that on average a “good enough” set of weights is found to
approximate the mapping problem as represented by the training dataset.

How to Configure Learning Rate


It is important to find a good value for the learning rate for your model on your training dataset.

The learning rate may, in fact, be the most important hyperparameter to configure for your
model.

29
In fact, if there are resources to tune hyperparameters, much of this time should be dedicated to
tuning the learning rate.

Unfortunately, we cannot analytically calculate the optimal learning rate for a given model on a
given dataset. Instead, a good (or good enough) learning rate must be discovered via trial and
error.

The range of values to consider for the learning rate is less than 1.0 and greater than 10^-6.

he learning rate will interact with many other aspects of the optimization process, and the
interactions may be nonlinear. Nevertheless, in general, smaller learning rates will require more
training epochs. Conversely, larger learning rates will require fewer training epochs. Further,
smaller batch sizes are better suited to smaller learning rates given the noisy estimate of the error
gradient.
A traditional default value for the learning rate is 0.1 or 0.01, and this may represent a good
starting point on your problem.

Diagnostic plots can be used to investigate how the learning rate impacts the rate of learning and
learning dynamics of the model. One example is to create a line plot of loss over training epochs
during training. The line plot can show many properties, such as:

• The rate of learning over training epochs, such as fast or slow.


• Whether model has learned too quickly (sharp rise and plateau) or is learning too slowly (little or
no change).
• Whether the learning rate might be too large via oscillations in loss.
Configuring the learning rate is challenging and time-consuming.

Add Momentum to the Learning Process


Training a neural network can be made easier with the addition of history to the weight update.

Specifically, an exponentially weighted average of the prior updates to the weight can be
included when the weights are updated. This change to stochastic gradient descent is called

30
“momentum” and adds inertia to the update procedure, causing many past updates in one
direction to continue in that direction in the future.

Momentum can accelerate learning on those problems where the high-dimensional “weight
space” that is being navigated by the optimization process has structures that mislead the
gradient descent algorithm, such as flat regions or steep curvature.

Momentum is set to a value greater than 0.0 and less than one, where common values such as 0.9
and 0.99 are used in practice.

Momentum does not make it easier to configure the learning rate, as the step size is independent
of the momentum. Instead, momentum can improve the speed of the optimization process in
concert with the step size, improving the likelihood that a better set of weights is discovered in
fewer training epochs.

Use a Learning Rate Schedule


An alternative to using a fixed learning rate is to instead vary the learning rate over the training
process.

The way in which the learning rate changes over time (training epochs) is referred to as the
learning rate schedule or learning rate decay.

Perhaps the simplest learning rate schedule is to decrease the learning rate linearly from a large
initial value to a small value. This allows large weight changes in the beginning of the learning
process and small changes or fine-tuning towards the end of the learning process.

In fact, using a learning rate schedule may be a best practice when training neural networks. Instead
of choosing a fixed learning rate hyperparameter, the configuration challenge involves choosing the
initial learning rate and a learning rate schedule. It is possible that the choice of the initial learning
rate is less sensitive than choosing a fixed learning rate, given the better performance that a learning
rate schedule may permit.

31
The learning rate can be decayed to a small value close to zero. Alternately, the learning rate can be
decayed over a fixed number of training epochs, then kept constant at a small value for the remaining
training epochs to facilitate more time fine-tuning.

Adaptive Learning Rates


The performance of the model on the training dataset can be monitored by the learning algorithm
and the learning rate can be adjusted in response.

This is called an adaptive learning rate.

Perhaps the simplest implementation is to make the learning rate smaller once the performance
of the model plateaus, such as by decreasing the learning rate by a factor of two or an order of
magnitude.

Alternately, the learning rate can be increased again if performance does not improve for a fixed
number of training epochs.

An adaptive learning rate method will generally outperform a model with a badly configured
learning rate.

Factors affecting Back Propagation Training

Back Propagation : Learning Factors

The backpropagation algorithm was originally introduced in the 1970s, but its importance wasn’t
fully appreciated until a famous paper in 1986 by David Rumelhart, Geoffrey Hinton, and Ronald
Williams. The paper describes several neural networks where backpropagation works far faster
than earlier approaches to learning, making it possible to use neural nets to solve problems which
had previously been insoluble. Today, the backpropagation algorithm is the workhorse of learning
in neural networks.
Although Backpropagation is the widely used and most successful algorithm for the training of a

32
neural network of all time, there are several factors which affect the Error-Backpropagation
training algorithms. These factors are as follows.

1. Initial Weights

Weight initialization of the neural network to be trained contribute to the final solution. Initially
before training the network weights of the network assigned to small random uniform values. If
all weights start out with equal weight values then there is a major chance of bad solution if
required final weights are unequal. Similarly, if initial random uniform weights are not uniform
(between 0 to 1) then there are chances of stuck in global minima and also the step towards global
minima are very small as our learning rate is small. Learning of network is very slow if initial
weights fall far from global minima of the error plot as an error of network is a function of
weights of the network. To get faster learning of network initial weights of neural network should
fall nearer to global minima of the error plot. Many empirical studies of the algorithm point out
that continuing training beyond a certain low error plateau results in the undesirable drift of
weights and this causes error to increase and mapping function implemented by network goes on
decreasing. To address this issue training should be restarted with newly initialized random
uniform weights.

2. Cumulative weight adjustment vs Incremental Updating

The Error backpropagation learning technique is based on a single pattern error detection in which
it requires small adjustment of weights which follow each training pattern in a training data. This
technique of adjusting of weights at every step when pattern applied to a network is called as
Incremental Updating. The Error backpropagation or Gradient Descent technique also implements
the gradient descent minimization of the overall error function computed over a complete cycle of
patterns, provided that learning constant sufficiently small. This scheme is known as cumulative
weight adjustment and error for this technique is calculated using the following expression.

33
Although both these techniques can bring satisfactory solutions, attention should be paid to the
fact that training works best under random conditions. For incremental updating, patterns should
be chosen randomly of different classes from training set so that the network should not overfit
for same class patterns.

3. The steepness of the activation function 𝜆

Gradient descent learning algorithm uses the continuous type of activation function, the most
common used activation function is the sigmoid function (unipolar and bipolar). This sigmoid
function is characterised by a factor called steepness factor 𝜆. the derivative of activation function
serves as multiple factors in building error signal term of a term of the neuron. Both, choice and
shape of activation function affect the speed of network learning. The derivative of activation
function(unipolar sigmoid) is given by,

The following figure shows the slope function of the activation function and illustrates how the
stiffness 𝜆 affect the learning of the network.

34
derivative of activation function for different 𝜆 values

for the fixed learning constant all adjustments of weights are in proportion to the steepness
coefficient 𝜆. When we use the large value of 𝜆then we get a similar result when we use large
learning constant 𝜂.

4. Learning Constant 𝜂.

The effectiveness and convergence of Error backpropagation are based on the value of learning
constant ƞ. The amount by which weights of network updated is directly proportional to the
learning factor ƞ and hence it plays the important role in error signal term of a neuron.

where 𝘾 is error signal term of neuron and 𝙛 ′❪𝙣𝙚𝙩❫ is the derivative of activation function.

When we use a larger value of ƞ, our network takes wider steps to reach global minima of error
plot. Due to a larger value of ƞ, there is a chance of missing global minima if error plot yields
shorter global minima. Similarly if we use a smaller value of 𝜂, our network takes shorter steps to
reach global minima of error plot but in this case, there is a chance of stuck in local minima of

35
error plot. To overcome these situation weights of a network are newly initialized with small
random values and retrain the network.

5. Momentum method

Momentum method deals with the convergence of gradient descent learning algorithm. Its
purpose is to accelerate the convergence of learning. This method supplements the current weight
adjustments with the fraction of most recent weight adjustment. This is usually done with the
following formulae,

where t and t-1 represent a current and previous step and 𝛂 is user selective momentum constant
(should be positive). The second term on the right side of the equation represents the momentum
term. For total N steps using the momentum method, the weight change can be expressed as

Typically 𝛂 is chosen between 0.1 to 0.8. By adding the momentum term the weight adjustment is
enhanced by the fraction of adjustment of weights at the previous step. This approach leads to the
global minima of the error curve with a much greater step.

Disadvantages of Backpropagation

• Backpropagation possibly be sensitive to noisy data and irregularity


• The performance of this is highly reliant on the input data
• Needs excessive time for training
• The need for a matrix-based method for backpropagation instead of mini-batch

36
Applications of Backpropagation

• The neural network is trained to enunciate each letter of a word and a sentence
• It is used in the field of speech recognition
• It is used in the field of character and face recognition

Numerical
➢ Implementation of logic functions like AND, OR, XOR, NAND, NOR, XNOR using
and Perceptrons.
➢ Implement MADALINE NETWORK to solve XOR problem
➢ Perceptron Training Algorithm

• https://towardsdatascience.com/perceptrons-logical-functions-and-the-xor-problem-
37ca5025790a
• https://towardsdatascience.com/implementing-the-xor-gate-using-backpropagation-in-
neural-networks-c1f255b4f20d

“Do solve the previous year NUMERICAL questions”

37
Video Lectures

• https://nptel.ac.in/courses/106/106/106106184/

REFRENCES

• https://medium.com/@omee0805/backpropagation-learning-factors-aca9c6e3bc1
• https://machinelearningmastery.com/learning-rate-for-deep-learning-neural-networks/
• https://machinelearningmastery.com/neural-networks-crash-course/
• https://www.edureka.co/blog/backpropagation/
• https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53

38

You might also like