Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2011 10th Mexican International Conference on Artificial Intelligence

Comparison of PSO and DE for training neural


networks
Espinal Andrés, Sotelo-Figueroa Marco, Soria-Alcaraz Jorge A, Ornelas Manuel,Puga Hector, Carpio Martı́n,
Baltazar Rosario, Rico J.L.
División de Estudios de Posgrado e Investigación,
Instituto Tecnológico de León
León, Guanajuato, México
espinalandres86@hotmail.com, masotelof@gmail.com, soajorgea@gmail.com

Abstract—The use of computational resources required for A. Feed-Foward Artificial Neural Network
Feed-Forward Artificial Neural Network (FFANN) training phase
by means of classical techniques such as the backpropagation In FFANN, the first term feedforward describes how this
learning rule can be prohibitive in some applications. A good neural network processes and recalls patterns. In a feedforward
training phase is needed for a high performance of a neural neural network, neurons are only connected forward. Each
network. In searching for alternative methods for training phase layer of the neural network contains connections to the next
of FFANN, some metaheuristic techniques have been used to
layer, but there are no back connections [2].
do this task. This paper compares the performance of Particle
Swarm Optimization (PSO) and Differential Evolution (DE) as The feed-forward process can be explained as follows. Let
training methods for FFANN under several well-known pattern be expression 1 the notation for a feedforward neural network
recognition instances. [3]
Keywords-Neural Networks, Particle Swarm Optimization, Dif-
W 1 ,b1 W 2 ,b2 W L ,bL
ferential Evolution x0 → x1 → · · · → xL (1)

I. I NTRODUCTION where xl ∈ Rnl for all l = 0, . . . , L and W l is an nl × nl−1


Feed-Foward Artificial Neural Network (FFANN) [1] [2] matrix for all l = 1, . . . , L. There are L + 1 layers of neurons,
is a common type of neural network used in a wide range and L layers of synaptic weights.
of classification problems. Commonly this kind of neural Foward pass. The input vector x0 is transformed into the
network needs to be trained in order to perform a good output vector xL , by means of a feedforward process by
classification rate. This training phase usually means the use evaluating the equation 2.
of the backpropagation algorithm. However this algorithm ⎛ ⎞

nl−1
has shown some disadvantages related with the computational xi = f (ui ) = f ⎝
l l
Wij xj + bi ⎠
l l−1 l
(2)
resources used in some applications[3] . Currently some new j=1
approaches have emerged based in this idea. These approaches
commonly describe the adaptation of metaheuristic techniques for l = 1 to L. Where f (uli ) is the evaluation for the input uli
for the FFANN training phase[4][5]. This paper compares the of the xli neuron by an activation function.
performance of Particle Swarm Optimization (PSO) [6] and
Diferential Evolution (DE) [7] applied as training methods
for FFANN. B. Particle Swarm Optimization
PSO and DE were implemented for the training phase of The Particle Swarm Optimization (PSO) [6] [8] [9]
FFANN. Our benchmark tests were training process from the is a metaheuristic inspired in flocks of birds or schools
Ionosphere, Irish Plant, Glass, Teaching Assistant Evaluation of fish. It was developed by J. Kennedy and R. Eberhart
and Wine datasets, which were taken from UCI Machine in 1995. It is based on a concept called social metaphore.
Learning Repository. A fixed number of function points were This metaheuristics simulates a society where all individuals
performed and by the training error we were capable to discern contribute with their knowledge to obtain a better solution.
the performance of the training metaheuristics. The paper is In this metaheuristic each individual is called particle and
divided as follows: section 2 shows the theory of our approach. moves through a multidimensional space that represents the
Section 3 exposes the methodology proposed. In section 4 is social space or search space. The dimension of space depends
explained the experiments and results. Section 5 discusses the on the variables used to represent the problem.
conclusions. In the search space, the position of each particle is updated
by using its current location and its velocity vector; this vector
II. C ONCEPTS AND D EFINITIONS tells how fast the particle will move. The eq. 3 calculates the
In this section we present the theory of FFANN, PSO and velocity vector, the eq. 4 is the way to estimate a constriction
DE, which are essential for the development of this work. coefficient and the eq. 5 updates the position of a particle.

978-0-7695-4605-6/11 $26.00 © 2011 IEEE 83


DOI 10.1109/MICAI.2011.16
This file was provided to the participants of MICAI 2011 courtesy of IEEE CS CPS.
Do not distribute. Other users should download the paper from ieeexplore.ieee.org.
vi = χ(vi + ϕ1 ∗ (xi − BGlobal ) + ϕ2 ∗ (xi − BLocal )) (3) which consist of d-components in the d-dimensional space.
2 This vector can be considered as the chromosomes or
χ=  (4) genomes.
ϕ−2+ ϕ2 − 4ϕ This metaheuristic consists of two main steps: mutations
xi = xi + v i (5) and selection.
Differential evolution has several schemes for carrying out
where:
the mutation step which generates a so-called donor vector.
• vi is the velocity of the i-th particle.
One of these schemes is know as DE/Current to best/1 which
• xi is the position of the i-th particle.
is showed in eq. 7.
• BGlobal is the best position found so far by all particles.
• BLocal is the best position found by the i-th particle. vit+1 = F (xtbest − xti ) + F (xtp − xtq ) (7)
• ϕ1 determines the magnitude of the forces in the direction
of neighbourhood best BGlobal . where vit+1 is the donor vector, F ∈ [0, 2] is a parameter
• ϕ2 determines the magnitude of the forces in the direction often referred to as the differential weight, xtbest is the genome
of personal best BLocal . with the best fitness value in the population, xti is the current
• ϕ is a parameter for the constriction coefficient where genome and, xtp and xtq are two genomes randomly chosen
ϕ = ϕ1 + ϕ2 > 4. (indexes i, p and q must be different each other).
The algorithm 1 shows the PSO metaheuristic. Selection is essentially the same as that used in genetic
algorithms. It consists on to select the best fitness. For a
Algorithm 1 Particle Swarm Optimization Algorithm minimization problem, the minimum objective value must be
Require: ϕ1 neighbourhood memory coefficient, ϕ2 personal selected. Therefore, we have
memory coefficient, n swarm size.  t+1
vi if f (vit+1 ) ≤ f (xti )
1: Start the swarm particles. xt+1 = (8)
i xti otherwise
2: Start the vi for each particle in the swarm.
3: while stopping criterion not met do All the above two components can be seen in the pseudo
4: for i = 1 to n do code shown in algorithm 2. The overall search efficiency is
5: If the i-particle’s fitness is better than that of the controlled by the differential weight F .
BLocal then replace the BLocal with the i-particle. The algorithm 2 shows the DE metaheuristic.
6: If the i-particle’s fitness is better than that of the
BGlobal then replace the BGlobal with the i-particle. Algorithm 2 Differential Evolution Algorithm
7: Update the vi by eq. 3. Require: n population size.
8: Update the xi by eq. 5. 1: Initializate the initial population.
9: end for 2: Set the weight F ∈ [0, 2]
10: end while 3: while stopping criterion not met do
11: return BGlobal 4: for i = 1 to n do
5: For each xti randomly choose 2 distinct genomes xtp
and xtq , the vector xtbest is the genome with the best
C. Differential Evolution fitness value in the population.
6: Generate a new donor vector vit+1 by eq. 7 and
Differential Evolution (DE) [7] was developed by R.
update the genome xti by eq. 8.
Storn and K. Price in 1996. It is a vector-based evolutionary
7: end for
algorithm, and it can be considered as a further development
8: end while
to genetic algorithm. It is a stochastic search algorithm with
9: return xtbest , the best genome in the last run, like the
self-organizing tendency and does not use the information of
solution vector.
derivatives [10].
Like in genetic algorithms [11], the design parameters in
a d-dimensional search space are represented as vectors, and
various genetic operators are operated over their bit of strings. III. M ETHODOLOGY PROPOSED
However, unlikely genetic algorithms, differential evolution
The methodology evaluates two FFANN by means of PSO
carries out operations over each component (each dimension
and DE in the training phase. For each test intance, we use
of solution.)
the same architecture for both FFANN.
For a d-dimensional optimization problem with d param-
Basically the idea to achieve the FFANN training process
eters, a population of n solutions are initially generated, so
by using PSO and DE metaheuristics consists on to work in
we have xi solution vectors where i = 1, 2, . . . , n. For each
a multi - dimensional search space and try to minimize a
solution xi at any generation t, we use the conventional
learning measure of the FFANN. The dimension of the search
notation as
space needs to be equal to the amount of neural weights. The
xti = (xt1,i , xt2,i , . . . , xtd,i ) (6)
figure 1 shows the transformation of the FFANN to a particle
or genome.

84
The training phase was performed in the batch training antennas with a total transmitted power on the order
approach [4]. It means that for all patterns of the training of 6.4 kilowatts. The instances in this databse are de-
dataset, the quadratic error [3] is calculated from the target scribed by 2 attributes per pulse number, correspond-
output and the neural network output (eq. 9) before to adjust ing to the complex values returned by the function
all weights of the neural network. resulting from the complex electromagnetic signal.
The dataset has 351 instances with 35 attributes each

M 
m
(p) (p) one. The first 34 attributes are continuous and the
Quadratic Error = (tk − yk )2 (9)
35th is the type of class, good or bad. Good radar
p=1 k=1
returns are those showing evidence of some type of
where: M = number of patterns, m = number of structure in the ionosphere and bad returns are those
neural outputs, t = target output, y = neural network that do not; their signals pass through the ionosphere.
output. Irish Plant: The Irish Plant [14] dataset contains 3
Since we have a continuous space, the Quadratic Error can classes of 50 instances each, where each class refers
be considered like a F itness f unction. This function can be to a type of iris plant. One class is linearly separable
easily implemented in a metaheuristic technique such as PSO from the other 2; the latter are not linearly separable
and DE showed in the previous section. The dimension of the from each other. The dataset has 150 instances ,50 in
problem will be the weights of the links between nodes in our each of three classes. Each instances has 4 attributes:
FFANN’s model. sepal length in cm, sepal width in cm, petal length
The final goal for these two training techniques is the in cm and petal width in cm. The classes are: Iris
same that the backpropagation learning rule: To minimize the Setosa, Iris Versicolour and Iris Virginica.
quadratic error of the neural network outputs. Teaching Assistant Evaluation (TAE): The Teach-
ing Assistant Evaluation [15] dataset consist of eval-
uations of teaching performance over three regular
semesters and two summer semesters of 151 teaching
assistants assignment at the Statistics Department of
the University of Wisconsin-Madison. The scores
were divided into 3 roughly equal-sized categories
to form the class variable. Each instance is formed
by 5 attributes.
Wine: The Wine [16] dataset is the result of a
chemical analysis of wines grown in the same region
in Italy but derived from three different cultivations.
The analysis determined the quantities of 13 con-
stituents found in each of the three types of wines.
The dataset has 178 instances with 13 attributes each
one.

B. Experimental design
Fig. 1. Codification of a Particle/Genome for representing a FFANN
configuration. The configuration of the experiments realized for this paper
is described on the next paragraphs.
1) FFANN Configuration: As each dataset has a very par-
IV. E XPERIMENTS AND R ESULTS ticular structure (number of classes and number of features),
In this section we discuss the experiment design, execution the neural network architecture changed for each dataset. Since
and results of the comparison between PSO and DE for the the main goal of this paper is to know which metaheuristic is
training phase. the best in the minimization of the quadratic error, the authors
decided to use FFANNs with only one hidden layer for reasons
of simplicity.The number of neurons in the input layer was
A. Dataset description
given by the number of features of the dataset, the hidden layer
We chose five well-know dataset instances from the UCI was estimated as the double of the neurons in the input layer
Machine Learning Repository: minus one and finally the number of neurons of the output
Glass: The Glass [12] dataset is an study of classifi- layer is equal to the number of classes of the dataset. All
cation of types of glasses. It was motivated by crim- hidden and output neurons have the sigmoid function as their
inological investigation. This dataset is formed by 7 activation function. The eq. 10 shows the sigmoid function.
classes and it has 214 instances with 10 attributes
1
each one. f (uli ) = l (10)
Ionosphere: The Ionosphere [13] dataset was col- 1 + e−ui
lected by a system in Goose Bay, Labrador. This sys- where uli is the input of the neuron xli , as it’s explained in
tem consists of a phased array of 16 high-frequency section II-A.

85
2) Metaheuristics Configuration: In order to gather in-
formation about the performance of the two metaheuritics
proposed (PSO and DE) in the training phase of the FFANN,
we used the quadratic error from the neural network output
and the desired output. Our main objective is to minimize the
difference between the current solution and the desire solution.
Equation 9 was used as the objetive function in the
proposed metaheuristic techniques PSO and DE. This ap-
proach transforms the original problem of classification into
an optimization problem, where we need to find the correct
weights that minimizes the fitness function.
For each dataset we applied 10,000 iterations of the pro-
posed metaheuristics. Finally we compared the last fitness
reported for each heuristic to analyze its performance.
PSO setup: PSO used 20 particles (each particle rep-
Fig. 2. Training/Optimization Process for Glass dataset
resents a complete FFANN array of weights), with a
χ parameter of autoadaptation. The parameter of the
neighbourhood memory coefficient is ϕ1 = 2.05, and
for the personal memory coefficient is ϕ2 = 2.05.
DE setup: DE used 20 individuals (same as PSO,
each individuals represents the complete set of
weights of a FFANN), and F = 0.15. The scheme
for the mutation step used in this work was the
DE/Current to best/1 scheme (eq. 7).
We applied this design for each dataset in order to achieve
a fair comparison between PSO and DE.

C. Results
Once the test was executed for each dataset with the
proposed metaheuristics, we achieved the results shown in
table I. Figures 2 to 6 show the training/optimization process
for each dataset done by both metaheuristics: PSO and DE.

Dataset PSO DE
Fig. 3. Training/Optimization Process for Ionosphere dataset
Glass 57.881 133.51
Iono 0.362 12.574
Iris 1.49e−3 2.00
TAE 48.67 78.30
Wine 0.003 58.054
TABLE I
F INAL Q UADRATIC ERROR FROM EACH DATASET

These results can be analyzed by simple visual inspection.


It’s obvious that PSO gets a better performance than DE as a
training method for FFANN.

D. Discussion
The PSO algorithm is clearly the winner against the DE
algorithm, but we need to consider the number of operations
of both algorithms.
The number of operations of each algorithms affects directly
the computational resources used i.e a higher number of oper- Fig. 4. Training/Optimization Process for Iris dataset
ations produces a major resources consumption. This situation
is relevant for the results. The DE algorithm is simpler that
the PSO algorithm by the next reasons:

86
This paper only works on the training phase, so for further
work, we propose to complete the process of classification in
order to validate that a minimum error in the training phase
equals to a better performance of classification rate.

ACKNOWLEDGEMENT
Authors thank the support received from the CONACYT
and DGEST (Grant 3528.10-P).

R EFERENCES
[1] E. P. P. A. Derks, M. S. S. Pastor, and L. M. C. Buydens, “Robustness
analysis of radial base function and multi-layered feed-forward
neural network models,” Chemometrics and Intelligent Laboratory
Systems, vol. 28, no. 1, pp. 49 – 60, 1995. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/016974399580039C
Fig. 5. Training/Optimization Process for TAE dataset [2] J. Heaton, Introduction to Neural Networks for Java, Second Edition.
Heaton Research, Inc., 2008.
[3] M. Friedman and A. Kandel, Introduction to pattern recognition statis-
tical, structural, neural and fuzzy logic approaches. World Scientific,
2000.
[4] V. G. Gudise, G. K. Venayagamoorthy, and S.-M. /eee, “Comparison of
particle swarm optimization and backpropagation as training algorithms
for neural networks,” in in Proceedings of the IEEE Swarm Intelligence
Symposium 2003 (SIS 2003, 2003, pp. 110–117.
[5] J. Ilonen, J.-K. Kamarainen, and J. Lampinen, “Differential evolution
training algorithm for feed-forward neural networks,” Neural Processing
Letters, vol. 17, no. 1, pp. 93–105, 2003.
[6] J. Kennedy and R. C. Eberhart, “Particle swarm optimization,” IEEE
Int. Conf. Neural Netw, vol. 4, pp. 1942–1948, 1995.
[7] R. Storn and K. Price, “Differential evolution - a simple and
efficient heuristic for global optimization over continuous spaces,” J. of
Global Optimization, vol. 11, pp. 341–359, December 1997. [Online].
Available: http://portal.acm.org/citation.cfm?id=596061.596146
[8] C. Maurice, Particle Swarm Optimization. USA: Wiley-ISTE, 2006.
[9] R. Poli, J. Kennedy, and T. Blackwell, “Particle swarm optimization,”
Swarm Intelligence, vol. 1, no. 1, pp. 33–57, Jun. 2007. [Online].
Available: http://dx.doi.org/10.1007/s11721-007-0002-0
Fig. 6. Training/Optimization Process for Wine dataset [10] X. S. Yang, Nature Inspired Metaheuristic Algorithms, 2nd ed. Luniver
Press, 2008.
[11] J. Holland, “Adaptation in natural and artificial systems,” University of
Michigan Press, 1975. [Online]. Available: http://mitpress.mit.edu/
• DE makes a mutation by means of a vectorial subtraction, [12] P. Zhong and M. Fukushima, “A regularized nonsmooth newton method
for multi-class support vector machines,” in Systems Analysis, Optimiza-
and then evaluates the fitness function in order to decide tion and Data Mining in Biomedicine. Taylor and Francis, Mar 2007,
if the new solution replaces the old one. vol. 22, pp. 225–236.
• PSO needs to update each particle velocity. [13] V. G. Sigillito, S. P. Wing, L. V. Hutton, and K. B. Baker, “Classification
of radar returns from the ionosphere using neural networks,” Johns
• PSO requires more complex operations to generate a new Hopkins APL Technical Digest, vol. 10, pp. 262–266, 1989.
solution i.e needs to evaluate local and global knowledge [14] R. A. FISHER, “The use of multiple measurements in taxonomic
to discern a new solution. problems,” Annals of Human Genetics, vol. 7, no. 2, pp.
179–188, 1936. [Online]. Available: http://dx.doi.org/10.1111/j.1469-
• PSO checks for each particle if another particle has found 1809.1936.tb02137.x
a better solution that the current reported as BGlobal and [15] W. Loh and Y. Shih, “Split selection methods for classification trees,”
GLocal . Statistica Sinica, 1997.
[16] K. Ali and M. Pazzani, “Error reduction through learning multiple
If we consider each difference, we can assume that the PSO descriptions, in press,” Machine Learning, vol. 24, 1996.
algorithm at least uses a twice of computational resources that
the DE algorithm. This excess in the operations can be the
reason of its better performance.

V. C ONCLUSIONS
This paper has compared two metaheuristcs for FFANN’s
training phase. We have used PSO and DE as training phase.
DE have shown a simpler computing performance since the
number of operation made for this algorithm is at least half
the number of operations made for PSO. The test shows that
PSO works in a better way than DE in the minimization of
quadratic error function as fitness function.

87

You might also like