Professional Documents
Culture Documents
11_lesson_8
11_lesson_8
8.1 Introduction
Lesson 7 discussed conterpropagation networks and statistical training. Both strategies can
be used to train multi layer ANN in supervised training. Conterpropagation networks can be
trained very fast as compared with backpropagation training. In contrast, statistical
training always offers to accept weight changes and makes the training session so dynamic.
Statistical training and conuterpropergation networks offer some solutions to issues in
backpropagation. This is because, backpropagation training is considered as the most
heavily used ANN training algorithm to date. Now we are going to learn some guidelines for
design and training of ANN. Note that these guidelines are rather heuristics-based and can
be used only for getting some idea about the design and training of ANN.
• Preparation of inputs
• Design of network architecture
• Controlling of training sessions
It should be noted that there are large number of freely available toolkits for development
of ANN. However, your knowledge about the above three aspects is essential for the
effective use of such tools as well. In fact, many tools hide those aspects from the user and
work as a black-boxes for training ANN. Therefore, we discuss those aspects in little details.
In general we can divide components in an input vector by a large number such as 100 to
make them small. Alternatively, input vector can also be normalized to make its
components smaller. Normalization has been accepted as a better technique for this
purpose.
It should be noted that weight initialization for an ANN must be done not only by setting
small values but also by using a combination of positive and negative values. Otherwise, if
all weights are positive, although they are small, at one point, produced net value can be
considerably large.
It is also encouraged not to use 0 as the values of components of an input or weights. This is
because; multiplication of any number by zero would be zero. Therefore, you may use an
encoding for 0 as a number like -1, to avoid this issue.
Bias
There can be an input which has 0 as the value of all components of the input, but its
desired output is non-zero. Think of an input pair like X = [0, 0, 0] D = [1, 1]. Here whatever
the weights that you introduced, you never get 1 as the net, which is the output. In order
to obtain a non-zero output through the net, we can do a modification to all the inputs.
The modification neither is the addition of nor-zero components, usually 1, to all inputs.
This quantity is called the bias. For instance, once we use the bias, the above input will be
modified as X = [0, 0, 0, -1].
In fact, introducing a bias for all inputs has been a customary, regardless of whether an
input has all components zero. Of course, we apply the bias for all neurons including the
ones which are in hidden layers and the output layer. This does not make any harm.
The following heuristics guide how we can decide on the network architecture or topology
comprising multiple layers with several neurons.
For instance, if a data set has desired outputs as D1 = [1, 0, 1, 1], we introduce four
neurons in the output layer. The decision about the number of neurons in the output layer
does not depend on the nature of the inputs.
As such, number of hidden layers and their neurons must be decided only through
experiments. Note that higher hidden layers, higher the representational power. However,
when we have too many hidden layers, training sessions will be too long due to extensive
calculations.
Since three layer ANN can model any real world problem, we may begin with network
architecture with one hidden layer and continue to do experiments for higher number of
hidden layers and neurons.
• Threshold function
• Learning constant
• Momentum coefficient
• Order of application of data
Generally bipolar continuous function offer a higher flexibility in a training session, yet
involves too much calculations. Learning constant (η) can also be increased or decreased if
the training cannot be kept below the Emax. Generally, smaller η values give smoother
training sessions, but liable to be paralyzed. Thus we must achieve the best through
experiments. Momentum coefficient can also be tried out.
The order of application of training data into an ANN is also very crucial. Suppose you are
training a ANN to recognize hand written numbers. If you keep on applying various forms of
one, and then proceed to train different hand written twos, obviously, error will suddenly
go up. This is because, the pattern of 2 are very different that of 1. In order to address this
issue, we never repeatedly apply input data from the same class or the category; instead
randomize the application of inputs over the different class or categories.
This scenario is very familiar to us. For instance, when you prepare for GCE (A/L), you
never studied chemistry one whole day, then physics and biology afterwards. We always
prefer to randomize the access to subjects within a learning session. Otherwise, brain tend
to remember but not to generalize, and also returns very high error, when a reasonably new
thing has been presented.
Example 8.1
Propose (a) the simplest possible (b) more appropriate network architecture for training the
following input pairs from a data set. Also propose a suitable learning algorithm for this
network.
Solution
A
C
In this architecture, as above output layer should have two neurons. However, to achieve a
better distribution of learning effect, it is appropriate to have more links. Thus, four
neurons in the input layer would be justifiable from a mathematical viewpoint. In addition,
there should be at least one hidden layer for the most promising and simplest network
architecture. The number of neurons in hidden layer must be decided through experiments.
Figure 8.3 shows more appropriate network architecture to model the above problem. We
can also use delta learning rule with backpropagation algorithm for this purpose.
I1 H1 O1
I2 H2
O2
I3
H3
I4
Output layer
H4
Error
Emax
Cycle
This kind of graph helps us to detect issues in a training session and apply solutions
accordingly. Below are some of the interesting such situations.
Case 1
Error always go beyond the Emax. Obviously, we have to go for strategies like weight re-
initialization, change of threshold function, learning constant, etc. This can also happen if
our Emax. is too ambitious. Then better to try with a smaller Emax. value. You can also change
number of neurons, and number of layers also in the network architecture.
Case 2
Error is less than Emax, but errors do not increase of decrease. This has happened due to
network paralysis. We can adjust the momentum coefficient to address this issue.
Case 3
Error has been decreasing over the cycles, yet suddenly error goes up and did not go
beyond the Emax. You may have introduced an input which is very different from previously
used inputs. Therefore, we better reshuffle the input data to address this issue.
Activity 8.1
Download a toolkit for training of ANN. Study its features and examples provided about the
use of ANN technology.