Professional Documents
Culture Documents
Bacardit - Learning Classifier Systems - 2009
Bacardit - Learning Classifier Systems - 2009
Bacardit - Learning Classifier Systems - 2009
Learning
Classifier Systems
11th International Workshop, IWLCS 2008
Atlanta, GA, USA, July 13, 2008
and 12th International Workshop, IWLCS 2009
Montreal, QC, Canada, July 9, 2009
Revised Selected Papers
13
Series Editors
Randy Goebel, University of Alberta, Edmonton, Canada
Jrg Siekmann, University of Saarland, Saarbrcken, Germany
Wolfgang Wahlster, DFKI and University of Saarland, Saarbrcken, Germany
Volume Editors
Jaume Bacardit
University of Nottingham, Nottingham, NG8 1BB, UK
E-mail: jaume.bacardit@nottingham.ac.uk
Will Browne
Victoria University of Wellington, Wellington 6140, New Zealand
E-mail: will.browne@vuw.ac.nz
Jan Drugowitsch
University of Rochester, Rochester, NY 14627, USA
E-mail: JDrugowitsch@bcs.rochester.edu
Ester Bernad-Mansilla
Universitat Ramon Llull, 08022 Barcelona, Spain
E-mail: esterb@salle.url.edu
Martin V. Butz
University of Wrzburg, 97070 Wrzburg, Germany
E-mail: mbutz@psychologie.uni-wuerzburg.de
CR Subject Classification (1998): I.2.6, I.2, H.3, D.2.4, D.2.8, F.1, H.4, H.2.8
ISSN 0302-9743
ISBN-10 3-642-17507-4 Springer Berlin Heidelberg New York
ISBN-13 978-3-642-17507-7 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
springer.com
Springer-Verlag Berlin Heidelberg 2010
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper 06/3180
Preface
IWLCS 2008
Organizing Committee Jaume Bacardit (University of Nottingham, UK)
Ester Bernad
o-Mansilla (Universitat Ramon Llull,
Spain)
Martin V. Butz (Universitat W
urzburg, Germany)
IWLCS 2009
Organizing Committee Jaume Bacardit (University of Nottingham, UK)
Will Browne (Victoria University of Wellington,
New Zealand)
Jan Drugowitsch (University of Rochester, USA)
Referees
Ester Bernad
o-Mansilla Francisco Herrera Luis Miramontes Hercog
Lashon Booker John Holmes Albert Orriols-Puig
Will Browne Tim Kovacs Wolfgang Stolzmann
Larry Bull Pier Luca Lanzi Keiki Takadama
Martin V. Butz Xavier Llor`a Stewart W. Wilson
Jan Drugowitsch Daniele Loiacono
Ali Hamzeh Drew Mellor
Past Workshops
1st IWLCS October 1992
NASA Johnson Space Center, Houston, TX,
USA
2nd IWLCS July 1999 GECCO 1999, Orlando, FL, USA
3rd IWLCS September 2000 PPSN 2000, Paris, France
4th IWLCS July 2001 GECCO 2001, San Francisco, CA, USA
5th IWLCS September 2002 PPSN 2002, Granada, Spain
6th IWLCS July 2003 GECCO 2003, Chicago, IL, USA
7th IWLCS June 2004 GECCO 2004, Seattle, WA, USA
8th IWLCS June 2005 GECCO 2005, Washington, DC, USA
9th IWLCS July 2006 GECCO 2006, Seattle, WA, USA
10th IWLCS July 2007 GECCO 2007, London, UK
11th IWLCS July 2008 GECCO 2008, Atlanta, GA, USA
12th IWLCS July 2009 GECCO 2009, Montreal, Canada
13th IWLCS July 2010 GECCO 2010, Portland, OR, USA
Table of Contents
Function Approximation
How Fitness Estimates Interact with Reproduction Rates:
Towards Variable Ospring Set Sizes in XCSF . . . . . . . . . . . . . . . . . . . . . . . 47
Patrick O. Stalph and Martin V. Butz
Applications
Supply Chain Management Sales Using XCSR . . . . . . . . . . . . . . . . . . . . . . . 145
Mara Franco, Ivette Martnez, and Celso Gorrin
X Table of Contents
1 Introduction
Learning classier systems [10,8,17] combine evolutionary computation with
methods of temporal dierence learning to solve classication and reinforce-
ment learning problems. A classier system maintains a population of condition-
action-prediction rules, called classiers, which identies its current knowledge
about the problem to be solved. At each time step, the system receives the
current state of the problem and matches it against all the classiers in the pop-
ulation. The results is a match set containing the classiers that can be applied
to the problem in its current state. Based on the value of the actions in the
match set, the classier system selects an action to perform on the problem to
progress toward its solution. As a consequence of the executed action, the system
receives a numerical reward that is distributed to the classiers accountable for
it. While the classier system is interacting with the problem, a genetic algo-
rithm is applied to the population to discover better classiers through selection,
recombination and mutation.
Matching is the main and most computationally demanding process of a clas-
sier system [14,3] that can occupy up to the 65%-85% of the overall com-
putation time [14]. Accordingly, several methods have been proposed in the
literature to speed up matching in learning classier systems. Llor`a and Sas-
try [14] compared the typical encoding of classier conditions for binary inputs,
an encoding based on the underlying binary arithmetic, and a version of the
J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 120, 2010.
c Springer-Verlag Berlin Heidelberg 2010
2 P.L. Lanzi and D. Loiacono
same encoding optimized via vector instructions. Their results show that bi-
nary encodings combined with optimizations based on the underlying integer
arithmetic can speedup the matching process up to 80 times. The analysis of
Llor`a and Sastry [14] did not consider the inuence of classier generality on
the complexity of matching. As noted in [3], the matching usually stops as soon
as it is determined that the classier cannot be applied to the current problem
instance (e.g., [1,12]). Accordingly, matching a population of highly specic clas-
siers takes much less than matching a population of highly general classiers.
Butz et al. [3] extended the analysis in [14] (i) by considering more encodings
(the specicity-based encoding used in Butzs implementation [1] and the en-
coding used in some implementations of Alecsys [7]); and (ii) by taking into
account classiers generality. Their results show that, overall, specicity-based
matching can be 50% faster than character-based encoding when general popu-
lations are involved, but it can be slower than character-based encoding if more
specic populations are considered. Binary encoding was conrmed to be the
fastest option with a reported improvement up to 90% compared to the usual
character-based encoding. Butz et al. [3] also proposed a specicity-based en-
coding for real-coded inputs which could halve the time required to match a
population.
In this work, we took a dierent approach to speed up matching in classier
systems based on the use of Graphical Processing Units (GPUs). More precisely,
we used NVIDIAs Compute Unied Device Architecture (CUDA) to implement
matching for (i) real inputs using interval-based conditions and for (ii) binary
inputs using ternary conditions. We tested our GPU-based matching by apply-
ing the same experimental design used in [14,3]. Our results show that on small
problems, due to the memory transfer overhead introduced by GPUs, matching
is faster when performed using the usual CPU. On larger problems, involving
either more variables or more classiers, GPU-based matching can outperform
CPU-based implementation with a 3-12 speedup when the interval-based rep-
resentation is applied to match real-valued inputs and a 20-50 speedup for
ternary-based representation.
execution. In addition, large cache memories are provided to reduce the instruc-
tion and data access latencies required in large complex applications.
On the other hand, the GPU design is optimized for the execution of mas-
sive number of threads. It exploits the large number of executed threads to nd
work to do during long-latency memory accesses, minimizing the control logic
required for each thread. Small cache memories are provided so that when mul-
tiple threads access to the same memory data, they do not need to all access to
the DRAM. As a result, much more chip area is dedicated to the oating-point
calculations.
statements or to ensure that the branches executed will be the same across the
whole warp.
input st+1 . The incoming reward rt+1 is used to compute the estimated payo
P (t) as,
Next, the parameters of the classiers in [A] are updated [5]. At rst, the pre-
diction p is updated with learning rate (0 1) as,
p p + (P (t) p) (3)
Then, the prediction error and the tness are updated [17,5].
On a regular basis (dependent on the parameter ga ), the genetic algorithm
is applied to the classiers in [A]. It selects two classiers, copies them, and
with probability performs crossover on the copies; then, with probability
it mutates each allele. The resulting ospring classiers are inserted into the
population and two other classiers are deleted from the population to keep the
population size N constant.
// matching procedure
int pos = 0;
bool result = true;
// matching procedure
int pos = 0;
bool result = true;
rst n values of lb contain the lower bounds of the rst classier condition in
the population; while the rst n values of ub contain the upper bounds of the
same condition. The next n values in lb and ub contain the lower and upper
bounds of the second classier condition, and so on for all the N classiers in
the population. In contrast, when the representation by columns is used, the
rst N values of lb contain the lower bounds associated to the rst input of
the N classiers in the population; similarly the rst N values of ub contain the
corresponding upper bounds. The next N values in lb and ub contain the lower
and upper bounds associated to the second input, and so on for all the n inputs
of the problem.
(a) (b)
Fig. 2. Classifier conditions in the GPU global memory are represented as two matrices
lb and ub which can be stored (a) by row or (b) by columns; cli represents the variables
in the classifier condition; si shows what variables should be matched in parallel by the
kernel
Speeding Up Matching in Learning Classifier Systems Using CUDA 9
if (tidx<N)
{
int has_matched=1,i=0;
while ( (has_matched) && i<n )
{
has_matched = (input[i] >= LB[pos+i]) && (input[i] <= UB[pos+i]);
i++;
}
matched[tidx]=has_matched;
}
}
10 P.L. Lanzi and D. Loiacono
be too distant and require the GPU to perform two separate memory accesses.
Accordingly, the GPU will remain idle for a signicant amount of time to ac-
cess memory. In contrast, if lb and ub are represented by column (Figure 2b),
the same operations will access contiguous memory locations. In fact, at the
rst clock cycle, one kernel will now access the value in lb[0] (the rst lower
bound of cl0 ). while the second kernel will access the nearby memory position
lb[1] where the rst lower bound of cl1 is stored. As a result, the GPU can per-
form several operations using just one memory access resulting in the maximum
parallelization possible.
Kernels are the basic computation units in CUDA and they are the source
of the parallelization. Kernels are executed in parallels on separate GPU cores
grouped into blocks whose size depends on the model of GPU used and must be
properly set to achieve the best parallelization. As soon as a core completed the
execution of a block of kernels, a new block is assigned to it.
In our case, a kernel is in charge of performing matching one classier. Accord-
ingly, the GPU will execute N kernels one for each classier in the population.
We used blocks of 64 kernels which we empirically found to be the best block
size on the card models we tested.
Algorithm 3 shows the kernel for interval-based matching using CUDA when
using a representation of lb and ub by row is used. Each kernel reads the condition
of a classier from the device shared memory and checks whether it matches the
current input. If a match is found, the position of the matched array in the
device memory corresponding to the classier is set to one otherwise it is set to
zero.
if (tidx<N)
{
int has_matched=1,i=0;
while ( (matched) && i<n )
{
has_matched = (input[i] >= LB[i*N+tidx]) && (input[i] <= UB[i*N+tidx]);
i++;
}
matched[tidx]=has_matched;
}
}
Speeding Up Matching in Learning Classifier Systems Using CUDA 11
// matching procedure
// matching procedure
bool matched = true;
while ( (matched) && i<m )
{
matched = ( ( (fp[i]^inputs[i]) & (sp[i]) ) == 0)
i++;
}
return matched;
12 P.L. Lanzi and D. Loiacono
if (tidx<N)
{
int matched=1,i=0;
while ( (matched) && i<m )
{
unsigned int sp_i = sp[pos+i];
unsigned int input_i = input[i];
unsigned int fp_i = fp[pos+i];
has_matched= ( ( (fp^input_i) & (sp_i) ) == 0);
i++;
}
matched[tidx]=has_matched;
}
}
of classier systems research [7] and similar ones have been recently proposed to
speed up the matching using standard CPUs [14].
In their famous classier system Alecsys [7], Dorigo, Colombetti and col-
leagues implemented classier conditions as arrays of bits packed up inside un-
signed integers. In Alecsys, a condition was represented by two arrays, fp and
sp, of unsigned integers; a one in the condition was represented by a bit set to
one in the same position of fp and sp; a zero was represented by a bit set to zero
in the same position of fp and sp; a dont care (#) could be either represented
by a 0 in fp and a 1 in sp or by a 1 in fp and a 0 in sp. Given the bit encoded
inputs i, a condition matches if fp^i & sp^i returns a set of zero bits, where ^
is the bitwise exclusive or and & is the bitwise logical and. Algorithm 5 shows the
C++ implementation of the encoding used in Alecsys and the corresponding
matching taken from [3]. The condition is represented as two variables, fp and
sp, using the Standard Template Library (STL) bitset class [11], which encodes
a set of bits; the condition matches if the resulting bitset has all the bits set to
zero, i.e., if result.none() returns true.
We can apply the same approach we used for interval-based conditions to
speed up the matching of ternary conditions using CUDA. For this purpose,
we need to modify Alecsyss encoding as follows. A classier condition is still
represented using two arrays, fp and sp, each one representing part of the con-
dition. In the case of the GPU representation however, the rst array fp encodes
only the specic positions while the second array sp encodes only the general
positions. As a results, this encoding reduces the number of bitwise operations
Speeding Up Matching in Learning Classifier Systems Using CUDA 13
if (tidx<N)
{
int has_matched=1,i=0;
while ( (has_matched) && i<m )
{
unsigned int fp_i= fp[i*N+tidx];
unsigned int sp_i = sp[i*N+tidx];
unsigned int input_i = input[i];
has_matched= ( ( (fp_i^input_i) & (sp_i) ) == 0);
i++;
}
matched[tidx]=has_matched;
}
}
6 Experimental Results
In this work, we used an experimental design similar to the one applied in [3]
which was inspired to the previous work of Llor`a and Sastry [14]. We generated
a population of N interval-based or ternary conditions of length n with dierent
generality and 1000 random input congurations. For interval-based conditions,
the generality of a random population was determined by setting an adequate
value of the parameter r0 (see [19] for details); for ternary conditions, generality
was set using the dont care probability P# . We matched each random input
against the N conditions using one of the kernels previously discussed and mea-
sured the average time required to perform all the match operations using the
functions provided by the CUDA distribution. We repeated this procedure 10
times. Overall, we tested two matching kernels (one using the representation
by row and one using the representation by columns) on the CPU2 , on a Tesla
C1060 and on a GeForce 9600 GT (see Appendix A). The performance was
measured as the average CPU time to perform the 1000 matches over the N
conditions. The reported average performance takes into account (i) the time to
load each one of the 1000 inputs to be matched into the GPU; and (ii) the time
to move the result vector from the GPU to main CPU memory.
Table 1 reports the average matching time for one condition using either the
CPU, a Tesla C1060 GPU or a GeForce 9600 GT GPU, when (i) the number
of inputs n in 10, 100, or 1000; the generality is chosen in {0.25, 0.50, 0.75,
0.90, 0.95}; the population size N is either 1000 (Table 1a), 10000 (Table 1b),
or 100000 (Table 1c); the representation is either row-based or column-based.
As expected, Tesla C1060 is always faster than GeForce 9600 GT which, on
the other hand, has only 512Mb of memory and cannot manage a large popula-
tion of 100000 classiers (Table 1c). As anticipated, column-based representation
results on a superior performance on GPUs. However, on the CPU, column-based
representation can be 10 times slower than its row-based counterpart. In fact, on
the CPU, row-based matching allows for the caching of contiguous data positions
(both condition bounds or inputs) which signicantly speed up the matching pro-
cess. In contrast, column-based matching accesses data positions in a scattered
way with respect to the storage resulting in a slower matching on CPUs.
2
Experiments have been performed on a 2 quad-core Xeon (2.66 GHz) with 8GB of
RAM running Linux Fedora Core 6.
Speeding Up Matching in Learning Classifier Systems Using CUDA 15
Table 1. Time (ms) required to match 1000 instances when the problem consists of 10,
100 or 1000 real inputs, the population size N is (a) 1000, (b) 10000, and (c) 100000;
the population generality gen varies between 0.25 and 0.95; statistics are averages over
10 runs
n gen CPUrow CPUcol TESLArow TESLAcol GF9600row GF9600col
10 0.25 0.029 0.002 0.032 0.001 0.047 0.000 0.045 0.000 0.046 0.002 0.044 0.002
10 0.50 0.032 0.002 0.035 0.002 0.047 0.000 0.045 0.000 0.052 0.002 0.052 0.006
10 0.75 0.032 0.003 0.037 0.003 0.047 0.000 0.045 0.000 0.056 0.001 0.054 0.004
10 0.90 0.030 0.003 0.034 0.002 0.047 0.000 0.045 0.000 0.059 0.002 0.055 0.002
10 0.95 0.029 0.002 0.033 0.002 0.047 0.000 0.045 0.000 0.059 0.001 0.055 0.001
100 0.25 0.159 0.010 0.170 0.008 0.162 0.001 0.147 0.000 0.246 0.005 0.206 0.006
100 0.50 0.207 0.011 0.216 0.014 0.167 0.000 0.147 0.000 0.297 0.001 0.252 0.010
100 0.75 0.237 0.004 0.250 0.006 0.171 0.000 0.147 0.000 0.350 0.010 0.295 0.011
100 0.90 0.257 0.005 0.268 0.003 0.174 0.001 0.147 0.000 0.381 0.017 0.315 0.002
100 0.95 0.265 0.009 0.277 0.009 0.175 0.000 0.147 0.000 0.384 0.005 0.325 0.007
1000 0.25 1.654 0.029 7.487 0.121 1.463 0.006 1.148 0.001 3.009 0.056 1.678 0.026
1000 0.50 2.154 0.026 9.423 0.094 1.588 0.005 1.148 0.000 3.678 0.051 1.982 0.037
1000 0.75 2.537 0.013 11.002 0.073 1.659 0.005 1.148 0.000 4.133 0.038 2.222 0.039
1000 0.90 2.719 0.022 11.815 0.065 1.694 0.004 1.148 0.000 4.374 0.028 2.362 0.025
1000 0.95 2.779 0.017 12.101 0.049 1.703 0.003 1.148 0.001 4.444 0.033 2.392 0.012
(a)
n gen CPUrow CPUcol TESLArow TESLAcol GF9600row GF9600col
10 0.25 0.282 0.009 0.321 0.009 0.136 0.001 0.091 0.001 0.258 0.006 0.204 0.005
10 0.50 0.312 0.017 0.353 0.015 0.153 0.001 0.091 0.001 0.316 0.002 0.241 0.001
10 0.75 0.300 0.003 0.345 0.018 0.164 0.001 0.091 0.001 0.365 0.004 0.272 0.005
10 0.90 0.284 0.008 0.324 0.011 0.168 0.001 0.091 0.001 0.390 0.007 0.291 0.006
10 0.95 0.277 0.009 0.318 0.013 0.169 0.001 0.091 0.001 0.399 0.005 0.294 0.004
100 0.25 2.442 0.041 3.651 0.094 0.935 0.004 0.367 0.001 2.244 0.009 1.468 0.008
100 0.50 2.663 0.046 4.106 0.077 1.067 0.004 0.368 0.000 2.829 0.007 1.878 0.005
100 0.75 2.799 0.039 4.506 0.080 1.155 0.002 0.369 0.001 3.307 0.008 2.207 0.005
100 0.90 2.869 0.026 4.643 0.068 1.202 0.002 0.368 0.001 3.576 0.006 2.395 0.009
100 0.95 2.884 0.013 4.701 0.067 1.218 0.001 0.369 0.001 3.657 0.005 2.448 0.003
1000 0.25 19.232 0.106 121.913 0.574 29.569 0.208 2.505 0.002 117.867 0.535 10.975 0.112
1000 0.50 23.703 0.083 158.491 0.753 41.444 0.132 2.512 0.001 164.823 0.680 12.382 0.211
1000 0.75 27.025 0.103 189.205 0.427 44.579 0.109 2.510 0.001 199.877 0.362 14.107 0.243
1000 0.90 28.793 0.094 205.408 0.642 44.872 0.076 2.511 0.001 216.719 0.288 14.912 0.324
1000 0.95 29.354 0.098 210.484 0.627 44.839 0.076 2.511 0.001 222.016 0.283 15.206 0.482
(b)
n gen CPUrow CPUcol TESLArow TESLAcol GF9600row GF9600col
10 0.25 3.042 0.016 5.598 0.092 1.079 0.002 0.628 0.001 2.355 0.009 1.795 0.008
10 0.50 3.290 0.022 5.891 0.119 1.242 0.003 0.630 0.002 2.951 0.007 2.180 0.011
10 0.75 3.265 0.017 5.733 0.086 1.353 0.002 0.632 0.002 3.433 0.007 2.487 0.015
10 0.90 3.073 0.025 5.592 0.077 1.396 0.001 0.630 0.002 3.694 0.013 2.659 0.031
10 0.95 2.983 0.028 5.536 0.062 1.407 0.001 0.630 0.002 3.772 0.003 2.699 0.007
100 0.25 29.947 0.086 42.500 0.124 9.171 0.010 3.329 0.001 21.414 0.091 13.847 0.053
100 0.50 31.040 0.076 47.296 0.149 10.499 0.007 3.340 0.001 27.100 0.084 17.631 0.104
100 0.75 31.086 0.093 51.373 0.187 11.329 0.010 3.341 0.001 32.071 0.036 20.926 0.074
100 0.90 30.946 0.061 53.472 0.231 11.757 0.007 3.341 0.001 34.769 0.043 22.711 0.035
100 0.95 30.849 0.060 54.254 0.205 11.900 0.005 3.341 0.001 35.654 0.026 23.304 0.056
1000 0.25 192.730 0.724 1253.090 4.225 329.946 0.878 23.186 0.075 - -
1000 0.50 236.868 0.827 1640.120 3.299 423.478 0.421 23.286 0.047 - -
1000 0.75 270.308 0.831 1950.790 3.422 445.080 0.353 23.306 0.054 - -
1000 0.90 287.616 0.792 2120.160 2.686 449.720 0.333 23.255 0.052 - -
1000 0.95 292.706 0.729 2177.270 2.225 450.194 0.289 23.299 0.067 - -
(c)
experiments performed, the classiers generality ranges between 0.25 and 0.95,
thus, at least one out of four classiers will match. In the GPUs many matches
are run in parallel and the overall matching time depends on the slower match.
Accordingly, even when classier generality is 0.25 the overall matching time is
almost the same. However, this does not happen with the GeForce 9600 GT,
where the results are very similar to the ones of the CPU, i.e., matching time
increases with the classiers generality (as in [3]). This is due to the more strict
requirements that the GeForce 9600 GT poses on the memory access pattern.
To maximize parallelization with the GeForce 9600 GT, cores need to access
memory positions that are both contiguous and adequately aligned, whereas
Tesla C1060 only poses constraints on the former. As more and more matching
are performed the pattern memory access of the GeForce 9600 GT tends to
diverge (accessed memory positions become more and more misaligned) resulting
in a worsening of the overall performance.
Table 2. Time (ms) required to match 1000 instances when the problem size is 32, 512,
1024, 4096 or 10240 bits, the population size N is 1000, and the population generality
gen is 0.00, 0.25, 0.50, 0.75, 0.99 or 1.0. Data are averages over 10 runs
n P# CPUrow CPUcol TESLArow TESLAcol GF9600row GF9600col
32 0.0 0.003 0.000 0.004 0.000 0.034 0.000 0.034 0.001 0.030 0.003 0.028 0.001
32 0.25 0.003 0.000 0.004 0.000 0.035 0.000 0.034 0.000 0.029 0.002 0.029 0.002
32 0.5 0.004 0.000 0.004 0.000 0.035 0.000 0.035 0.000 0.029 0.001 0.029 0.001
32 0.75 0.004 0.000 0.004 0.000 0.035 0.001 0.035 0.000 0.029 0.001 0.029 0.002
32 0.99 0.006 0.000 0.004 0.000 0.035 0.001 0.035 0.001 0.029 0.001 0.029 0.002
32 1.0 0.004 0.000 0.004 0.000 0.035 0.001 0.035 0.000 0.029 0.001 0.029 0.002
512 0.0 0.006 0.000 0.005 0.000 0.037 0.000 0.035 0.000 0.032 0.001 0.031 0.001
512 0.25 0.006 0.000 0.005 0.000 0.037 0.001 0.035 0.000 0.033 0.001 0.033 0.002
512 0.5 0.006 0.000 0.005 0.000 0.037 0.000 0.035 0.000 0.033 0.002 0.032 0.002
512 0.75 0.006 0.001 0.005 0.000 0.038 0.000 0.037 0.000 0.033 0.002 0.032 0.001
512 0.99 0.034 0.002 0.038 0.003 0.051 0.000 0.050 0.000 0.052 0.001 0.050 0.001
512 1.0 0.041 0.003 0.047 0.004 0.061 0.000 0.050 0.001 0.085 0.002 0.077 0.001
1024 0.0 0.007 0.001 0.006 0.000 0.038 0.001 0.037 0.001 0.035 0.001 0.035 0.002
1024 0.25 0.007 0.001 0.006 0.000 0.037 0.000 0.037 0.001 0.037 0.003 0.034 0.001
1024 0.5 0.007 0.001 0.006 0.000 0.038 0.001 0.037 0.000 0.036 0.002 0.036 0.003
1024 0.75 0.008 0.001 0.006 0.000 0.039 0.000 0.038 0.001 0.038 0.002 0.037 0.002
1024 0.99 0.037 0.002 0.037 0.002 0.064 0.000 0.066 0.000 0.066 0.004 0.063 0.001
1024 1.0 0.080 0.005 0.091 0.007 0.069 0.001 0.067 0.001 0.152 0.011 0.133 0.004
4096 0.0 0.016 0.001 0.014 0.001 0.047 0.001 0.046 0.001 0.058 0.003 0.055 0.003
4096 0.25 0.015 0.001 0.014 0.001 0.047 0.001 0.045 0.001 0.057 0.002 0.054 0.002
4096 0.5 0.015 0.001 0.013 0.001 0.046 0.001 0.045 0.001 0.058 0.002 0.053 0.000
4096 0.75 0.016 0.001 0.013 0.001 0.048 0.001 0.046 0.001 0.057 0.002 0.056 0.003
4096 0.99 0.047 0.003 0.045 0.002 0.086 0.001 0.088 0.001 0.102 0.005 0.093 0.002
4096 1.0 0.312 0.004 0.350 0.011 0.276 0.001 0.170 0.001 0.633 0.002 0.436 0.002
10240 0.0 0.031 0.002 0.029 0.002 0.061 0.002 0.061 0.002 0.096 0.002 0.094 0.004
10240 0.25 0.031 0.002 0.029 0.002 0.062 0.002 0.062 0.002 0.098 0.005 0.093 0.002
10240 0.5 0.031 0.002 0.029 0.002 0.062 0.002 0.061 0.002 0.099 0.005 0.095 0.004
10240 0.75 0.032 0.002 0.029 0.002 0.063 0.002 0.061 0.002 0.097 0.003 0.095 0.003
10240 0.99 0.067 0.004 0.064 0.004 0.101 0.002 0.103 0.002 0.142 0.004 0.133 0.003
10240 1.0 0.768 0.002 3.491 0.020 0.435 0.001 0.373 0.001 1.335 0.002 0.969 0.003
Speeding Up Matching in Learning Classifier Systems Using CUDA 17
Table 3. Time (ms) required to match 1000 instances when the problem size is 32, 512,
1024, 4096 or 10240 bits, the population size N is 10000, and the population generality
gen is 0.00, 0.25, 0.50, 0.75, 0.99 or 1.0. Data are averages over 10 runs.
Table 4. Time (ms) required to match 1000 instances when the problem size is 32,
512, 1024, 4096 or 10240 bits, the population size N is 100000, and the population
generality gen is 0.00, 0.25, 0.50, 0.75, 0.99 or 1.0. Data are averages over 10 runs.
report the average matching time for one condition respectively when N is 1000
(Table 2), 10000 (Table 3), and 100000 (Table 4).
The results conrm several of the previous ndings. Column-based matching
outperforms row-based matching on GPUs. Tesla C1060 is generally faster than
GeForce 9600 GT as expected. Again, in the smaller population, the CPU is gen-
erally faster than both GPUs. In addition, also with 10000 classiers, when less
than 32 or 512 binary inputs are considered (i.e., when conditions are represented
by one to 16 unsigned integers), the CPU is faster; as the population size or the
number of inputs increases, the GPUs outperforms CPU on larger problems.
When P # 0.99 the speedup provided from the Tesla C1060 with respect to
column-based implementation on the CPU can be close to 50. Compared to
the row-based implementation on the CPU, the results show that Tesla C1060
implementation outperforms CPU on medium and big problems (when n > 512
and N 10000) with a speedup near to 20.
As before, column-based matching outperforms row-based matching on GPUs.
However, while with interval-based conditions, the CPU performed best with
row-based implementation, in this case, column-based always performs better
except when classiers are fully general (i.e., P# =1.0). To understand this re-
sult, we need to consider the memory access patterns in the two implemen-
tations. When the P# is not very high, i.e., P# < 0.99, the probability of
matching is easily close to zero when more than few dozens of inputs are consid-
ered3 . Accordingly, the matching process is very likely to stop very early, before
the rst 100 bits have been tested. This is why the average matching times of
classiers with P# in the range [0, 0.75] are very close. As a result, with the
column-based representation only the small memory areas, where the rst bits
are stored, are accessed. Thus, in this case, the cache locality is exploited across
the matching of several classiers: as the matching involve few initial inputs
for each classier, once the data is loaded for matching one classier, then it is
readily available for the following ones. In contrast, in the row-based implemen-
tation, the pattern of memory accesses spread over the whole memory where
the classiers are allocated. On the other hand, when the classiers are fully
general (i.e., when P # = 1.0) the matching involve all the inputs for all the
classiers. Accordingly, the locality is fully exploited by the row-based imple-
mentation, because it performs a sequential memory access pattern. In contrast,
in this case, the memory access pattern of column-based representation is highly
inecient.
7 Conclusions
In this paper, we studied GPU-based parallelization of the matching in learn-
ing classier systems for real inputs (using interval-based conditions) and binary
3
The probability of matching an input nwith n bits for a classifier generated with
a dont care probability P# is 1+p 2
; thus, when P#=0.75, the probability of
matching an input of size n = 100 is lower than 105 .
Speeding Up Matching in Learning Classifier Systems Using CUDA 19
References
1. Butz, M.V.: XCS (+ tournament selection) classifier system implementation in c,
version 1.2. Technical Report 2003023, Illinois Genetic Algorithms Laboratory
University of Illinois at Urbana-Champaign (2003)
2. Butz, M.V.: Kernel-based, ellipsoidal conditions in the real-valued xcs classifier
system. In: Beyer, H.-G., OReilly, U.-M. (eds.) GECCO, pp. 18351842. ACM,
New York (2005)
3. Butz, M.V., Lanzi, P.L., Llor` a, X., Loiacono, D.: An analysis of matching in learn-
ing classifier systems. In: Ryan, C., Keijzer, M. (eds.) Proceedings of Genetic and
Evolutionary Computation Conference, GECCO 2008, Atlanta, GA, USA, July
12-16, ACM Press, New York (2008)
4. Butz, M.V., Lanzi, P.L., Wilson, S.W.: Hyper-ellipsoidal conditions in XCS: rota-
tion, linear approximation, and solution structure. In: Cattolico [6], pp. 14571464.
5. Butz, M.V., Wilson, S.W.: An algorithmic description of XCS. Journal of Soft
Computing 6(3-4), 144153 (2002)
6. Cattolico, M. (ed.): Proceedings of Genetic and Evolutionary Computation Con-
ference, GECCO 2006, Seattle, Washington, USA, July 8-12. ACM, New York
(2006)
7. Dorigo, M., Colombetti, M.: Robot Shaping: An Experiment in Behavior Engineer-
ing. MIT Press/Bradford Books (1998)
8. Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learn-
ing. Addison-Wesley, Reading (1989)
9. Holland, J.H.: Escaping Brittleness: The possibilities of General-Purpose Learning
Algorithms Applied to Parallel Rule-Based Systems. In: Mitchell, Michalski, Car-
bonell (eds.) Machine learning, an artificial intelligence approach, vol. II, ch. 20,
pp. 593623. Morgan Kaufmann, San Francisco (1986)
10. Holland, J.H., Reitman, J.S.: Cognitive systems based on adaptive algorithms
(1978); Reprinted in: Fogel, D.B. (ed.): Evolutionary Computation. The Fossil
Record. IEEE Press, Los Alamitos (1998) ISBN: 0-7803-3481-7
11. Josuttis, N.M.: The C++ Standard Library: A Tutorial and Reference. Addison-
Wesley Professional, Reading (1999)
12. Lanzi, P.L.: The XCS library (2002)
13. Lanzi, P.L., Wilson, S.W.: Using convex hulls to represent classifier conditions. In:
Cattolico [6], pp. 14811488
14. Llor`a, X., Sastry, K.: Fast rule matching for learning classifier systems via vector
instructions. In: Cattolico [6], pp. 15131520
15. Stone, C., Bull, L.: For real! XCS with continuous-valued inputs. Evolutionary
Computation 11(3), 298336 (2003)
20 P.L. Lanzi and D. Loiacono
16. Wilson, S.W.: ZCS: A zeroth level classifier system. Evolutionary Computa-
tion 2(1), 118 (1994), http://prediction-dynamics.com/
17. Wilson, S.W.: Classifier Fitness Based on Accuracy. Evolutionary Computa-
tion 3(2), 149175 (1995)
18. Wilson, S.W.: Get real! XCS with continuous-valued inputs. In: Lanzi, P.L., Stolz-
mann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, pp. 209222.
Springer, Heidelberg (2000)
19. Wilson, S.W.: Mining oblique data with XCS. In: Lanzi, P.L., Stolzmann, W., Wil-
son, S.W. (eds.) IWLCS 2000. LNCS, vol. 1996, pp. 158176. Springer, Heidelberg
(2001)
A Device Specifications
Table 5. Specification of GeForce 9600GT
1 Introduction
Association rule mining [2] aims at extracting interesting associations among the
attributesi.e., associations that occur with a certain frequency and strength
of repositories of unlabeled data. Research conducted on association rule mining
was originally focused on extracting rules that identied strong relationships be-
tween the occurrence of two or more attributes or items on collections of binary
data, e.g., if item X occurs then also item Y will occur [2,3,14]. Later on, sev-
eral researchers concentrated on extracting association rules from data described
by continuous attributes [10,22], which posed new challenges to the eld. Several
algorithms proposed to apply a discretization method in advance to transform
the original data into binary values [16,18,22,24] and then use a binary associa-
tion rule miner. This led to further research on designing discretization procedures
that avoid losing useful information. Other approaches mined interval-based as-
sociation rules and permitted the algorithm to independently move the interval
J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 2137, 2010.
c Springer-Verlag Berlin Heidelberg 2010
22 A. Orriols-Puig and J. Casillas
bound of each rules variable [17]. Also, fuzzy modeling was introduced to create
fuzzy association rules (e.g., see [13,15]).
Association rules are widely used in various areas such as telecommunication
networks, market and risk management, and inventory control. All these appli-
cations are characterized by generating data online, so that data may be made
available in form of streams [1,19]. Nonetheless, all the aforementioned algo-
rithms were designed for static collections of data. Learning from data streams
has received a special amount of attention in the last few years, particularly in
supervised learning [1,19]. However, few proposals of online binary association
rule miners can be found in the literature, and most of them are only able to
deal with problems with categorical attributes (e.g., see [23]).
In this paper, we address the problem of mining association rules from streams
of examples online. We propose a learning classier system (LCS) whose archi-
tecture is inspired by XCS [25,26] and UCS [6], which we address as classier
system for association rule mining (CSar). CSar uses an interval-based repre-
sentation for evolving quantitative association rules from data with continuous
attributes and a discrete representation for categorical attributes. The system
receives a stream of unlabeled examples which are used to create new rules and
to tune the parameters of the existing ones with the aim of evolving as many
interesting rules as possible. CSar is rst compared with Apriori [3] on a problem
dened only by categorical attributes. The results on this problem indicate that
CSar can evolve rules of similar interest as those created by Apriori, one of the
most referred algorithms in the association rule mining realm, which considers all
the possible combinations of attribute values to create all interesting association
rules (notice that this approach can only be used in domains with categori-
cal data). The experimentation is then extended by considering a collection of
real-world problems and by analyzing the behavior of dierent congurations of
CSar over these problems. The results denote that CSar is able to create highly
supported and interesting interval-based association rules in which the intervals
have not been prexed by a discretization algorithm.
The remainder of this paper is organized as follows. Section 2 provides the
basic concepts of association rules and reviews the main proposals in the lit-
erature for both binary and quantitative association rule mining. Section 3 de-
scribes in detail our proposal. Section 4 explains the methodology followed in
the experiments, and Section 5 analyzes the results of these experiments. Fi-
nally, Section 6 summarizes, concludes, and gives the future work lines that will
be followed.
2 Framework
Before proceeding with the description of our proposal, this section introduces
some important concepts of association rules. We rst describe the problem of
extracting association rules from categorical data. Then, we extend the problem
to mining association rules from data with continuous attributes and review
dierent proposals that can be found in the literature.
Evolution of Interesting Association Rules 23
on algorithms that were able to extract association rules from databases that
contained quantitative attributes.
Srikant and Agrawal [22] designed an Apriori-like approach to mine quantita-
tive association rules. The authors used an equi-depth partitioning to transform
continuous attributes to categorical attributes. Moreover, the authors identied
the problem of the sharp boundary between discrete intervals, which highlighted
that quantitative mining algorithms may either ignore or over-emphasize the
items that lay near the boundary of intervals. Attempting to address this prob-
lem, several authors applied dierent clustering mechanisms to extract the best
possible intervals from the data [16,18]. A completely dierent approach was
taken in [17], where a genetic-algorithm-based technique was used to evolve
interval-based association rules without applying any discretization procedure
to the variables. The GA was responsible for creating new promising association
rules and for evolving the intervals of the variables of the association rules. The
problem associated to creating variables with unbounded intervals is that, in
general, the support for small intervals is smaller than the support for large in-
tervals, which makes the system create rules with large intervals, covering nearly
all the domain. To avoid this, the system penalized the tness of rules that had
large intervals. In [20] a similar approach was followed. The authors proposed
a framework in which nding good intervals from which interesting association
rules could be extracted was addressed as an optimization problem.
As done in [17,20], CSar does not apply any discretization mechanism to the
original data and interval bounds are evolved by the genetic procedure. The
main novelty of our proposal is that association rules are not mined from static
databases but from streams of examples. This characteristic guides some parts
of the algorithm design, which is described in detail in the next section.
3 Description of CSar
CSar is a Michigan-style LCS for mining interval-based association rules from
data that contain both quantitative and categorical attributes. The learning
architecture of CSar is inspired by UCS [6] and XCS [25,26]. CSar aims at evolv-
ing populations of interesting association rules, i.e., rules with large support and
condence. For this purpose, CSar evaluates a set of association rules online and
evolves this rule set by means of a steady-state genetic algorithm (GA) [11,12]
that is applied to population niches. As follows, a detailed description of the
system is provided, focusing on the dierences in the knowledge representation
and learning process with respect to those of XCS and UCS.
values). This list is used by the mutation operator with the aim of preventing the
existence of intervals that cover the same examples but are slightly dierent. As
follows, we provide details about (1) the covering operator, (2) the procedures
to create association set candidates, (3) the association set subsumption mech-
anism, and (4) the parameter update procedure. Next section explains in more
detail the discovery component. It is worth noting that some of the operators are
similar to those of several existing systems such as the ones described in [5,9].
where maxInt is the maximum interval length. Finally, one of the previously
unselected variables is randomly chosen to form the consequent of the rule,
which is initialized following the same procedure. Note that the association rule
created is supported by, at least, the sampled example.
Grouping by antecedent. This strategy considers that two rules are similar
if they have exactly the same variables in their antecedent, regardless of
their corresponding values Vi . Therefore, this grouping strategy creates Na
association set candidates, where Na is the number of rules in [M] with
dierent variables in the antecedent. Each association set contains rules that
have exactly the same variables in the antecedent. The underlying idea is
that rules with the same antecedent may express similar knowledge. Note
that, under this strategy, rules with dierent variables in the consequent can
be grouped in the same association set.
Grouping by consequent. This strategy groups in the same association set
the classiers in [M] that have the same variable in the consequent with
Evolution of Interesting Association Rules 27
where l1 , l2 , u1 , and u2 are the lower bound and upper bound of the con-
sequent variable of r1 and r2 for a continuous attribute, x1 and x2 are the
values of the consequent variable for a categorical attribute, and ord(xi )
maps each categorical value to a numeric value. It is worth noting that given
two continuous variables with the same lower bound in the interval, we sort
rst the rule with the most general variable (i.e., the rule with larger ui ).
We take this approach with the aim of forming association set candidates
with the largest number of overlapping classiers by using the procedure
explained as follows.
Once [M] has been sorted, the association set candidates are built as fol-
lows. At the beginning, an association set candidate is created and the rst
classier in [M] is added to this association set candidate. Then, the follow-
ing classier is added if it has the same variable in the consequent, and his
lower bound is smaller than the minimum upper bound of the classiers in
the association set. This process is repeated until nding the rst classier
that violates this condition. In this case, a new association set candidate
is created, and the same process is applied to add new classiers to this
association set. The underlying idea of this association set strategy is that
rules that explain the same region of the consequent may denote the same
associations among variables.
The cost of both methodologies for creating the association sets are guided by the
cost of sorting the population. We applied a quicksort strategy for this purpose,
which has a cost of O(n logn), where n is the match set size.
is more general than rj . A rule ri is more general than rj if all the input and
the output variables of ri are also dened in rj , each categorical variable of ri
has the same value as the corresponding variable in rj , and the interval [li , ui ] of
each continuous variable in ri includes the interval [lj , uj ] of the corresponding
variable in rj (i.e., li lj and ui uj ).
The two parents are copied into ospring ch1 and ch2 , which undergo crossover
and mutation if required.
The system applies uniform crossover with probability P . First, it considers
each variable in the antecedent of both rules. If only one parent has the vari-
able, one child is randomly selected and the variable is copied to this child. If
both parents contain the variable, this variable is copied to each ospring. The
procedure controls that, at the end of the process, each ospring has, at least,
one input variable. Then, the rule consequent is crossed by adding to the rst
ospring the consequent of one of the parents (which is randomly selected) and
adding to the remaining ospring the consequent of the other parent.
Three types of mutation can be applied to a rule: (1) introduction/removal
of antecedent variables (with probability PI/R ), (2) mutation of variables val-
ues (with probability P ), and (3) mutation of the consequent variable (with
probability PC ). The rst type of mutation chooses randomly whether a new
antecedent variable has to be added to or one of the antecedent variables has to
be removed from the rule. If a variable has to be added, one of the non-existing
variables is randomly selected and added to the rule. This operation can only be
applied if the rule does not have all the possible variables. If a variable has to be
removed, one of the existing variables is randomly selected and removed from the
rule. This operation can only be applied if the rule has at least two variables in
the antecedent. The second type of mutation selects one of the existing variables
of the rule and mutates its value. For continuous variables, two random amounts
ranging in [-m0 , m0 ] are added to the lower bound and the upper bound respec-
tively, where m0 is a user-set parameter. If the interval surpasses the maximum
length or the lower bound becomes greater than the upper bound, the interval
is repaired. Finally, the lower and the upper bounds of the mutated variable
are approximated to the closest value in the list of seen values for this variable.
This process is applied to avoid having rules in the population with very similar
interval bounds in its variables, since having all them may not only provide no
additional knowledge, but also hinder human experts from reading the whole
population. For categorical variables, a new value for the variable is randomly
selected. The last type of mutation randomly selects one of the variables in the
antecedent and exchanges it with the output variable.
After crossover and mutation, the new ospring are introduced into the pop-
ulation. First, each classier is checked for subsumption [26] with their parents.
To decide if any parent can subsume the ospring, the same procedure explained
for association set subsumption is followed. If any parent is identied as a possi-
ble subsumer for the ospring, the ospring is not inserted and the numerosity
of the parent is increased by one. Otherwise, we check [A] for the most general
rule that can subsume the ospring. If no subsumer can be found, the classier
is inserted into the population.
If the population is full, excess classiers are deleted from [P] with probabil-
ity proportional to their association set size estimate as. Moreover, if a classi-
er k is suciently experienced (expk > del ) and its tness F k is signicantly
30 A. Orriols-Puig and J. Casillas
lower than the average tness of the classiers in [P] (F k < F[P ] where F[P ] =
i[P ] F ), its deletion probability is further increased. That is, each classier
1 i
N
has a deletion probability pk of
dk
pk = , (10)
j[P ] dj
where
asnumF[P ]
if expk > del and F k < F[P ]
dk = Fk (11)
as num otherwise.
Thus, the deletion algorithm balances the classier allocation in the dierent
association sets by pushing toward the deletion of rules belonging to large correct
sets. At the same time, it favors the search toward highly t classiers, since the
deletion probability of rules whose tness is much smaller than the average tness
is increased.
4 Experimental Methodology
After having carefully described the system, now we are in position to exper-
imentally analyze the behavior of CSar. The aim of the experimental analysis
Evolution of Interesting Association Rules 31
Table 1. Properties of the data sets. The columns describe: the identier of the data
set (Id.); the name of the data set (dataset); the number of instances (#Inst); the total
number of features (#Fea); the number of real features (#Re); the number of integer
features (#In); and the number of nominal features (#No).
was to (1) study whether CSar could actually evolve a set of interesting associa-
tion rules, (2) examine the behavior of the system under dierent congurations.
With these objectives in mind, we did the following two experiments.
As our rst concern was to analyze whether CSar could evolve the most inter-
esting association rules regardless of having a xed population size. Therefore,
we compared CSar with Apriori [3], probably the most inuential association
rule miner, on the zoo problem [4]. We selected the zoo problem for this analysis
since Apriori only works on problems described by categorical attributes and
the zoo problem satises this requirement. More specically, the zoo problem is
dened by (1) fteen binary attributes which indicate whether the animal has
a total of fteen characteristics such as whether it has tail or hair and (2) two
categorical attributes that can take more than two values and which represent
the number of legs and the type of animal.
Secondly, we studied the impact of using the two dierent procedures
to create association rule candidates and of using progressively bigger max-
imum intervals. For this purpose, we ran CSar (1) with both antecedent- and
consequent-grouping strategies to create association sets candidates and (2) with
dierent maximum interval lengths on a collection of real-world problems ex-
tracted from the UCI repository [4] and from local repositories [7]. The charac-
teristics of these problems are reported in Table 1.
In all runs, CSar employed the following conguration: num iterations =
100 000, popSize = 6 400, conf0 = 0.95, = 10, mna = 10, {del , GA } = 50, exp
= 1000, P = 0.8, {PI/R , P , PC } = 0.1, m0 =0.2. Association set subsumption
was activated in all runs.
32 A. Orriols-Puig and J. Casillas
With the aim of the experiments in mind, in what follows we discuss about the
experimental results.
1400 180
Conf > 0.05 Conf > 0.05
Conf > 0.10 Conf > 0.10
Conf > 0.15 160 Conf > 0.15
1200
Conf > 0.20 Conf > 0.20
Conf > 0.25 140 Conf > 0.25
1000 Conf > 0.30 Conf > 0.30
Conf > 0.35 120 Conf > 0.35
Number of Rules
Number of Rules
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
support support
(a) antecedent grouping (b) consequent grouping
Fig. 1. Number of rules evolved with minimum support and condence for the zoo
problem with (a) antecedent-grouping and (b) consequent-grouping strategies. The
curves are averages over ve runs with dierent random seeds.
1600
Conf > 0.05
Conf > 0.10
1400 Conf > 0.15
Conf > 0.20
1200 Conf > 0.25
Conf > 0.30
Conf > 0.35
Number of Rules
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
support
Fig. 2. Number of rules created by Apriori with minimum support and condence for
the zoo problem. Lower condence and support are not shown since Apriori creates all
possible combinations of attributes, exponentially increasing the number of rules.
Evolution of Interesting Association Rules 33
Table 2. Comparison of the number of rules evolved by CSar with antecedent- and
consequent-grouping strategies to form the association set candidates with the number
of rules evolved by Apriori at high support and condence values
Confidence
antecedent grouping consequent grouping A-priori
0.4 0.6 0.8 0.4 0.6 0.8 0.4 0.6 0.8
0.40 275 30 271 27 230 23 65 10 63 9 59 9 2613 2514 2070
0.50 123 4 123 4 106 3 61 8 61 8 58 8 530 523 399
Support
That is, Apriori is a two-phase algorithm that exhaustively explores all the fea-
ture space, discovers all the itemsets with a minimum predened support, and
creates all the possible rules with these itemsets. Therefore, some of the rules
supplied by Apriori are included in other rules. We consider that a rule r1 is
included in another rule r2 if r1 has, at least, the same variables with the same
values in the rule antecedent and the rule consequent as r2 (r1 may have more
variables). In the results provided herein, we removed from the nal population
all the rules that were included by other rules. Thus, we provide an upper bound
of the number of dierent rules that can be generated.
Two important observations can be made from these results. Firstly, the re-
sults clearly show that Apriori can create a higher number of rules than CSAr
(for the sake of clarity, Table 2 species the number of rules for support values
ranging from 0.4 to 1.0 and condence values of {0.4,0.6,0.8}). This behavior
was expected, since CSar has a limited population size, while Apriori returns
all possible association rules. Nevertheless, it is worth noting that CSAr and
Apriori found exactly the same number of highly interesting rules; that is, both
systems discovered two rules with both condence and support higher than 0.8.
This highlights the robustness of CSar, whose mechanisms guide the system to
discover the most interesting rules.
Secondly, focusing on the results reported in Figure 1, we can see that the
populations evolved with the antecedent-grouping strategy are larger than those
built with the consequent-grouping strategy. This behavior will be also present,
and discussed in more detail, in the extended experimental analysis conducted
in the next subsection.
After showing that CSar can create highly interesting association rules in a
case-study problem characterized by categorical attributes, we now extend the
experimentation by running the system on 16 real-world data sets. We ran the
system with (1) antecedent-grouping and consequent-grouping strategies and (2)
34 A. Orriols-Puig and J. Casillas
Table 3. Average ( standard deviation of the) number of rules with support and
condence greater than 0.60 created by CSar with antecedent- and consequent-grouping
strategies and with maximum interval sizes of MI={0.10, 0.25, 0.50}. The average and
standard deviation are computed on ve runs with dierent random seeds.
antecedent consequent
MI=0.10 MI=0.25 MI=0.50 MI=0.10 MI=0.25 MI=0.50
adl 135 3 294 15 567 66 46 1 74 3 147 23
ann 1736 133 1765 79 1702 135 478 86 525 112 489 34
aud 2206 80 2017 147 1999 185 1014 12 982 100 880 215
aut 84 14 192 7 710 106 25 6 58 3 188 6
bpa 11 4 174 15 365 42 17 2 100 4 123 22
col 134 14 188 7 377 64 180 13 191 7 198 8
gls 33 4 160 17 694 26 23 2 89 6 205 23
H-s 28 1 61 4 248 32 13 1 29 1 92 13
irs 00 00 50 5 00 00 28 8
let 00 113 17 991 40 00 103 6 205 13
pim 41 93 9 570 51 30 53 5 154 25
tao 00 00 81 00 00 52
thy 46 2 152 4 350 27 29 2 80 3 160 2
wdbc 00 419 43 1143 131 00 145 17 304 16
wne 116 9 273 48 536 34 26 3 65 9 137 17
wpbc 00 00 740 234 00 00 264 34
allowing intervals of maximum length maxInt = {0.1, 0.25, 0.5} for continuous
variables. Note that by using dierent grouping strategies we are changing the
way how the system creates association set candidates; therefore, as competition
is held among rules within the same association set, the resulting rules can be
dierent in both cases. On the other hand, having an increasing larger interval
length for continuous variables enables the system to obtain more general rules.
Table 3 reports the number of rules, with condence and support greater than
or equal to 0.6, created by the dierent congurations of CSar. All the reported
results are averages of ve runs with dierent random seeds.
Comparing the results obtained with the two dierent grouping schemes, we
can see that the antecedent-grouping strategy yielded larger populations than
the consequent-grouping strategy, on average. This behavior was expected since
the antecedent grouping creates smaller association sets, and thus, maintains
more diversity in the population. Nonetheless, a closer examination of the nal
population indicates that the dierence in the nal number of rules decreases if
we only consider the rules with the highest condence and support. For example,
considering all the rules with condence and support greater than or equal to
0.60, the antecedent-grouping strategy results in populations 2.16 bigger than
those of the consequent-grouping strategy. However, considering only the rules
with condence and support greater than or equal to 0.85, the average dierence
in the population length gets reduced to 1.12. This indicates a big proportion
of the most interesting rules are discovered by the two strategies. It is worth
Evolution of Interesting Association Rules 35
highlighting therefore that the lower number of rules evolved by the consequent-
grouping strategy can be considered as an advantage, since the strategy avoids
creating and maintaining uninteresting rules in the population, which implies a
lower computational time to evolve the population.
Focusing on the impact of varying the interval length, the results indicate that
for lower maximum interval lengths CSar tends to evolve rules with less support.
This behavior can be easily explained as follows. Large maximum interval length
enable the existence of highly general rules, which will have higher support.
Moreover, if both antecedent and consequent variables are maximally general,
rules will also have high condence. Taking this idea to the extreme, rules that
contain variables whose intervals range from the minimum value to the maximum
value for the variable will have maximum condence and support. Nonetheless
these rules will be uninteresting for human experts. On the other hand, small
interval lengths may result in more interesting association rules, though too
small lengths may result in rules that denote strong associations but have less
support. This highlights a tradeo in the setting of this parameter, which should
be adjusted for each particular problem. As a rule of thumb, similarly to what can
be done with other association rule miners, the practitioner may start setting
small interval lengths and increase them in case of not obtaining rules with
enough support for the particular domain used.
Acknowledgements
References
14. Houtsma, M., Swami, A.: Set-oriented mining of association rules. Technical Report
RJ 9567, Almaden Research Center, San Jose, California (October 1993)
15. Kaya, M., Alhajj, R.: Genetic algorithm based framework for mining fuzzy associ-
ation rules. Fuzzy Sets and Systems 152(3), 587601 (2005)
16. Lent, B., Swami, A.N., Widom, J.: Clustering association rules. In: Procedings of
the IEEE International Conference on Data Engineering, pp. 220231 (1997)
17. Mata, J., Alvarez, J.L., Riquelme, J.C.: An evolutionary algorithm to discover
numeric association rules. In: SAC 2002: Proceedings of the 2002 ACM Symposium
on Applied Computing, pp. 590594. ACM, New York (2002)
18. Miller, R.J., Yang, Y.: Association rules over interval data. In: SIGMOD 1997:
Proceedings of the 1997 ACM SIGMOD International Conference on Management
of data, pp. 452461. ACM, New York (1997)
19. Nun
ez, M., Fidalgo, R., Morales, R.: Learning in environments with unknown dy-
namics: Towards more robust concept learners. Journal of Machine Learning Re-
search 8, 25952628 (2007)
20. Salleb-Aouissi, A., Vrain, C., Nortet, C.: Quantminer: A genetic algorithm for
mining quantitative association rules. In: Veloso, M.M. (ed.) Proceedings of the
2007 International Join Conference on Articial Intelligence, pp. 10351040 (2007)
21. Savasere, A., Omiecinski, E., Navathe, S.: An ecient algorithm for mining as-
sociation rules in large databases. In: Proceedings of the 21st VLDB Conference,
Zurich, Switzerland, pp. 432443 (1995)
22. Srikant, R., Agrawal, R.: Mining quantitative association rules in large relational
tables. In: Jagadish, H.V., Mumick, I.S. (eds.) Proceedings of the 1996 ACM
SIGMOD International Conference on Management of Data, Montreal, Quebec,
Canada, pp. 112 (1996)
23. Wang, C.-Y., Tseng, S.-S., Hong, T.-P., Chu, Y.-S.: Online generation of association
rules under multidimensional consideration based on negative border. Journal of
Information Science and Engineering 23, 233242 (2007)
24. Wang, K., Tay, S.H.W., Liu, B.: Interestingness-based interval merger for numeric
association rules. In: Proceedings of the 4th International Conference on Knowledge
Discovery and Data Mining, KDD, pp. 121128. AAAI Press, Menlo Park (1998)
25. Wilson, S.W.: Classier tness based on accuracy. Evolutionary Computation 3(2),
149175 (1995)
26. Wilson, S.W.: Generalization in the XCS classier system. In: 3rd Annual Conf.
on Genetic Programming, pp. 665674. Morgan Kaufmann, San Francisco (1998)
27. Wilson, S.W.: Get real! XCS with continuous-valued inputs. In: Lanzi, P.L., Stolz-
mann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, pp. 209219.
Springer, Heidelberg (2000)
Coevolution of Pattern Generators and
Recognizers
Stewart W. Wilson
1 Introduction
J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 3846, 2010.
c Springer-Verlag Berlin Heidelberg 2010
Coevolution of Pattern Generators and Recognizers 39
vector (its index). Each position can be thought of as a place, and there is a
value there. An ordinary function is thus a mapping of values in places into
an outcome. Call it a place/value (PV) mapping. If you slide the values along
the placesor expand them from a pointthe outcome is generally completely
dierent. The function depends on just which values are in which places.
Patterns, on the other hand, are relative place/relative value (RPRV) map-
pings. Often, a given instance can be transformed into another instance, but
with the same outcome, by a transformation that maintains the relative places
or values of the elementsfor example, such transformations as scaling, trans-
lation, rotation, contrast, even texture. The RPRV property, however, makes
pattern recognition very dicult for machine learning methods that attach ab-
solute signicance to input element positions and values.
There is considerable work on relative-value, or relational, learning systems,
e.g., in classier systems [5,4], and in reinforcement learning generally [1]. But
for human-related pattern classes, what seems to be required is a method that
is intrinsically able to deal with both relative value and relative place. This
suggests that the method must be capable of transformations, both of its input
and in subsequent stages. The remainder of the paper lays out one proposal for
achieving this.
S F
E
Fig. 1. S sends messages to F that are snied by E
output image. The results of evolving such trees of functions could be surprising
and beautiful. Sims article gives a number of examples of the images, including
one (Figure 2) having the following symbolic expression,
(round (log (+ y (color-grad (round (+ (abs (round
(log (+ y (color-grad (round (+ y (log (invert y) 15.5))
x) 3.1 1.86 #(0.95 0.7 0.59) 1.35)) 0.19) x)) (log (invert
y) 15.5)) x) 3.1 1.9 #(0.95 0.7 0.35) 1.35)) 0.19) x).
Fig. 2. Evolved image from Sims [6]. Gray-scale rendering of color original.
c 1991
Association for Computing Machinery, Inc. Reprinted with permission.
Such an image-generating program is a good starting point for us, except for
two missing properties. First, the program does not transform an input image;
its only inputs are x and y. Second, the program is deterministic: it is not able
to produce dierent outputs for the same image input, a property required in
order to produce image variants.
To transform an image, the program needs to take as input not only x and
y, but also the input image values. A convenient way to do this appears to be
to add the image to the function set. That is, add Im(x, y) to the function set,
where Im is a function that maps image points to image values of the current
input. For example, consider the expression
(* k (Im (- x x0 ) (- y y0 )).
The eect is to produce an output that translates the input by x0 and y0 in the
x and y directions and alters its contrast by the factor k. It seems fairly clear
that adding the current input image, as a kind of function, to the function set
(it could apply at any stage), is quite general and would permit a great variety
of image transformations.
44 S.W. Wilson
To allow dierent transformations from the same program is not dicult. One
approach is to include a switch function, Sw , in the function set. Sw would have
two inputs and would pass one or the other of them to its output depending on
the setting of a random variable at evaluation time (i.e., set when a new image is
to be processed and not reset until the next image). The random variable would
be a component of a vector of random binary variables, one variable for each
specic instance of Sw in the program. Then at evaluation time, the random
vector would be re-sampled and the resulting component values would dene a
specic path through the program tree. The number of distinct paths is 2 raised
to the number of instances of Sw , and equals the number of distinct input image
variants that the program can create. If that number turns out to be too small,
other techniques for creating variation will be required.
The transformation programs just described would be directly usable by S to
generate variants of A and B starting with archetypes of each. F and E would
also use such programs, but not alone. Recognition, in the present approach,
reverses generation: it takes a received image and attempts to transform it back
into an archetype. Since it does not know the identity of the received image, how
does the recognizer know which transformations to apply?
We suggest that a recognition program be a kind of Pittsburgh classier
system [7] in which each classier has a condition part intended to be matched
against the input, and an action part that is a transformation program of the kind
used by S (but without Sw ). In the simplest case the classier condition would
be an image-like array of reals to be matched against the input image; the best-
matching classiers transformation program would then be applied to the image.
The resulting output would then be matched (by F ) against archetypes A and B
and the better-matching character selected. E, as noted earlier, would compare
the average of the output image with a threshold. It might be desirable for
recognition to take more than one match-transform step; they could be chained
up to a certain number, or until a suciently sharp A/B decision (or dierence
from threshold) occurred.2
References
1. Dzeroski, S., de Raedt, L., Driessens, K.: Relational reinforcement learning. Machine
Learning 43, 752 (2001)
2. Daniel Hillis, W.: Co-evolving parasites improve simulated evolution as an optimiza-
tion procedure. Physica D 42, 228234 (1990)
3. Holland, J.H.: Escaping Brittleness: The Possibilities of General-Purpose Learning
Algorithms Applied to Parallel Rule-Based Systems. In: Mitchell, Michalski, Car-
bonell (eds.) Machine Learning, an Articial Intelligence Approach, vol. II, ch. 20,
pp. 593623. Morgan Kaufmann, San Francisco (1986)
4. Mellor, D.: A rst order logic classier system. In: Beyer, H.-G., OReilly, U.-M.,
Arnold, D.V., Banzhaf, W., Blum, C., Bonabeau, E.W., Cantu-Paz, E., Dasgupta,
D., Deb, K., Foster, J.A., de Jong, E.D., Lipson, H., Llora, X., Mancoridis, S.,
Pelikan, M., Raidl, G.R., Soule, T., Tyrrell, A.M., Watson, J.-P., Zitzler, E. (eds.)
GECCO 2005: Proceedings of the 2005 Conference on Genetic and Evolutionary
Computation, Washington DC, USA, June 25-29, vol. 2, pp. 18191826. ACM Press,
New York (2005)
5. Shu, L., Schaeer, J.: VCS: Variable Classier System. In: David Schaer, J. (ed.)
Proceedings of the 3rd International Conference on Genetic Algorithms (ICGA
1989), George Mason University, pp. 334339. Morgan Kaufmann, San Francisco
(June 1989), http://www.cs.ualberta.ca/~ jonathan/Papers/Papers/vcs.ps
46 S.W. Wilson
6. Sims, K.: Articial evolution for computer graphics. Computer Graphics 25(4), 319
328 (1991), http://doi.acm.org/10.1145/122718.122752, Also
http://www.karlsims.com/papers/siggraph91.html
7. Smith, S.F.: A Learning System Based on Genetic Adaptive Algorithms. PhD thesis,
University of Pittsburgh (1980)
8. Wilson, S.W.: Classier Fitness Based on Accuracy. Evolutionary Computation 3(2),
149175 (1995)
How Fitness Estimates Interact with
Reproduction Rates: Towards Variable Ospring
Set Sizes in XCSF
1 Introduction
Learning classier systems were introduced over thirty years ago [1] as cognitive
systems. Over all these years, it has been clear that there is a strong interac-
tion between parameter estimationsbe it by traditional bucket brigade tech-
niques [2], the Widrow-Ho rule [3,4], or by recursive least squares and related
linear approximation techniques [5,6]and the genetic algorithm, in which the
successful identication and propagation of better classiers depends on the ac-
curacy of these estimates. Various control parameters have been used to balance
genetic reproduction with the reliability of the parameter estimation, but to the
best of our knowledge, there is no study that addresses the estimation problem
explicitly.
J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 4756, 2010.
c Springer-Verlag Berlin Heidelberg 2010
48 P.O. Stalph and M.V. Butz
In the XCS classier system [4], reproduction takes place by means of a steady-
state, niched GA. Reproductions are activated in current action sets (or match
sets in function approximation problems as well as in the original XCS paper).
Upon reproduction, two ospring classiers are generated, which are mutated
and recombined with certain probabilities. Reproduction is balanced by the GA
threshold. It species that GA reproduction is activated only if the average time
of the last GA activation in the set lies longer in the past than GA . It has
been shown that the threshold can delay learning speed but it also prevents the
neglect of rarely sampled problem niches in the case of unbalanced data sets [7].
Nonetheless, the reproduction of two classiers seems to be rather arbitrary
except for the fact that two ospring classiers are needed for simple recombi-
nation mechanisms. Unless the Learning Classier System has a hard time to
learn the problem, the reproduction of more than two classiers could speed
up learning. Thus, this study investigates the eect of modifying the number of
ospring classiers generated upon GA invocation. We further focus our study
on the real-valued domain and thus on the XCSF system [8,9]. Besides, we use
the rotating hyperellipsoidal representation for the evolving classier condition
structures [10].
This paper is structured as follows. Since we assume general knowledge of
XCS1 , we immediately start investigating performance of XCSF on various test
problems and with various ospring set sizes. Next, we discuss the results and
provide some theoretical considerations. Finally, we propose a road-map for fur-
ther studying the observed eects and adapting the ospring set sizes according
to the perceived problem diculty and learning progress as well as on the esti-
mated reliability of available classier estimates.
To study the eects of increased ospring set sizes, we chose four challenging
functions dened in [0, 1]2 , each with rather distinct regularities:
Function f1 has been used in various studies [10] and has a diagonal regularity. It
requires the evolution of stretched hyperellipsoids that are rotated by 45 . Func-
tion f2 is a radial sine function that requires a somewhat circular distribution of
1
For details about XCS refer to [4,11].
Towards Variable Ospring Set Sizes in XCSF 49
prediction
0.5
0
-0.5
0.5
f
0
-0.5
1
-1 0.8
0 0.6
0.2 0.4 y
0.4 0.2
0.6
x 0.8
1 0
prediction
1
0.5
0
f 0.5
0
1
0.8
-0.50 0.6
0.2 0.4 y
0.4 0.2
0.6
x 0.8
1 0
prediction
1
0.5
0
1.5
1
f
0.5
1
0 0.8
0 0.6
0.2 0.4 y
0.4 0.2
0.6
x 0.8
1 0
prediction
1
1.5 0.5
0
1 -0.5
-1
f 0.5
0
-0.5
-1
-1.5
0
0.2
0.4 1
x 0.6 0.8
0.6
0.8 0.4
0.2 y
1 0
Fig. 1. Final function approximations, including contour lines, are shown on the left-
hand side. The corresponding population distributions after compaction are shown on
the right-hand side. For visualization purposes, the conditions are drawn 80% smaller
than their actual size.
50 P.O. Stalph and M.V. Butz
6400 6400
macro classifiers
macro classifiers
1 1000 1 sel2 - pred. error 1000
select 2 - pred. error
macro cl. macro cl.
select 4 - pred. error sel10% - pred. error
macro cl. macro cl.
select 8 - pred. error sel50% - pred. error
macro cl. 100 macro cl. 100
0.1 0.1
prediction error
prediction error
0.01 0.01
0 20 40 60 80 100 0 20 40 60 80 100
number of learning steps (1000s) number of learning steps (1000s)
6400 6400
macro classifiers
macro classifiers
0.01 0.01
0 20 40 60 80 100 0 20 40 60 80 100
number of learning steps (1000s) number of learning steps (1000s)
Fig. 2. Dierent selection strengths with xed (left hand side) or match-set-size relative
(right hand side) ospring set sizes can speed-up learning signicantly but potentially
increase the nal error level reached. The vertical axis is log-scaled. Error bars represent
one standard deviation and the thin dashed line shows the target error 0 = 0.01.
classiers. Function f3 is a crossed ridge function, for which it has been shown
that XCSF performs competitively in comparison with deterministic machine
learning techniques [10]. Finally, function f4 twists two sine functions so that it
becomes very hard for the evolutionary algorithm to receive enough signal from
the parameter estimates in order to structure the problem space more eectively
for an accurate function approximation.
Figure 1 shows the approximation surfaces and spatial partitions generated
by XCSF with a population size of N = 6400 and with compaction [10] acti-
vated after 90k learning iterations.2 The graphs on the left-hand side show the
actual function predictions and qualitatively conrm that XCSF is able to learn
accurate approximations for all four functions. On the right-hand side, the cor-
responding condition structures of the nal populations are shown. In XCS and
2
Other parameters were set to the following values: = .1, = .5, = 1, 0 = .01,
= 5, GA = 50, = 1.0, = .05, r0 = 1, del = 20, = 0.1, sub = 20. All
experiments in this paper are averaged over 20 experiments.
Towards Variable Ospring Set Sizes in XCSF 51
6400 6400
macro classifiers
macro classifiers
1 select 2 - pred. error 1000 1 sel2 - pred. error 1000
macro cl. macro cl.
select 4 - pred. error sel10% - pred. error
macro cl. macro cl.
select 8 - pred. error sel50% - pred. error
macro cl. 100 macro cl. 100
0.1 0.1
prediction error
prediction error
0.01 0.01
0 20 40 60 80 100 0 20 40 60 80 100
number of learning steps (1000s) number of learning steps (1000s)
macro classifiers
macro classifiers
1 1000 1 1000
100 100
0.1 0.1
prediction error
prediction error
Fig. 3. While in the crossed ridge function larger ospring set sizes mainly speed-up
learning, in the challenging sine-in-sine function, larger ospring set sizes can strongly
aect the nal error level reached
XCSF, two classiers are selected for reproduction, crossover, and mutation. We
now investigate the inuence of modied reproduction sizes.
Performance of the standard setting, where two classiers are selected for re-
production (with replacement), is compared with four other reproduction size
choices. In the rst experiment the ospring set size was set to four and eight
classiers respectively. Thus, four (eight) classiers are reproduced upon GA in-
vocation and crossover is applied twice (four times) before the mutation operator
is applied. In a second, more aggressive setting the ospring set size is set rela-
tive to the current match set size, namely to 10% and 50% of the match set size.
Especially the last setting was expected to reveal that excessive reproduction
can deteriorate learning.
Learning progress is shown in Figure 2 for functions f1 and f2 . It can be seen
that in both cases standard XCSF with two ospring classiers learns signif-
icantly slower than settings with a larger number of ospring classiers. The
number of distinct classiers in the population (so called macro classiers), on
the other hand, shows that initially larger ospring set sizes increase the popula-
tion sizes much faster. Thus, an initially higher diversity due to larger ospring
sets yields faster initial learning progress. However, towards the end of the run,
52 P.O. Stalph and M.V. Butz
standard XCSF actually reaches a slightly lower error than the settings with
larger ospring sets. This eect is the more pronounced the larger the ospring
set. In the radial sine function, this eect is not as strong as in the sine function.
Similar observations can also be made in the crossed ridge function, which
is shown in Figure 3(a). In the sine-in-sine function f4 (Figure 3(b)), larger
ospring set sizes degrade performance most severely. While a selection of four
ospring classiers as well as a selection of a size of 10% of the match set size still
shows slight error decreases, larger ospring set sizes completely stall learning
despite large and diverse populations. It appears that the larger ospring set sizes
prevent the population from identifying relevant structures and thus prevent the
development of accurate function approximations.
3 Theoretical Considerations
What is the eect of increasing the number of ospring generated upon GA
invocation? The results indicate that initially, faster learning can be induced.
However, later on, learning potentially stalls.
Previously, learning in XCS was characterized as an interactive learning pro-
cess in which several evolutionary pressures [12] foster learning progress: (1) A
tness pressure is induced since usually on average more accurate classiers are
selected for reproduction than for deletion. (2) A set pressure, which causes an
intrinsic generalization pressure, is induced since also on average more general
classiers are selected for reproduction than for deletion. (3) Mutation pressure
causes diversication of classier conditions. (4) Subsumption pressure causes
convergence to maximally accurate, general classiers, if found. Since tness and
set pressure work on the same principle, increasing the number of reproductions
generally equally increases both pressures. Thus, their balance is maintained.
However, the tness pressure only applies if there is a strong-enough tness sig-
nal, which depends on the number of evaluations a classier underwent before
the reproduction process. The mutation pressure also depends on the number of
reproductions; thus, a faster diversication can be expected given larger ospring
set sizes.
Another analysis estimated the reproductive opportunities a superior classier
might have before being deleted [13]. Moreover, a niche support bound was
derived [14], which characterizes the probability that a classier is sustained
in the population, given that it represents an important problem niche for the
nal solution. Both of these bounds assume that the accuracy of the classier
is accurately specied. However, the larger the ospring set size is, the faster
the classier turnaround, thus the shorter the average iteration time a classier
stays in the population, and thus the fewer the number of iterations available
to a classier until it is deleted. The eect is that the GA in XCS has to work
with classier parameter estimates that are less reliable since they underwent
less updates on average. Thus, larger ospring set sizes induce larger noise in
the selection process.
As long as the tness pressure leads in the right direction because the param-
eter estimates have enough signal, learning proceeds faster. This latter reason
Towards Variable Ospring Set Sizes in XCSF 53
6400 6400
macro classifiers
macro classifiers
1 sel2 - pred. error 1000 1 sel2 - pred. error 1000
macro cl. macro cl.
sel10% - pred. error sel10% - pred. error
macro cl. macro cl.
sel10%to2 - pred. error sel10%to2 - pred. error
macro cl. 100 macro cl. 100
0.1 0.1
prediction error
prediction error
0.01 0.01
0 20 40 60 80 100 0 20 40 60 80 100
number of learning steps (1000s) number of learning steps (1000s)
macro classifiers
1 sel2 - pred. error 1000 1 1000
macro cl.
sel10% - pred. error
macro cl.
sel10%to2 - pred. error
macro cl. 100 100
0.1 0.1
prediction error
prediction error
0 20 40 60 80 100 0 20 40 60 80 100
number of learning steps (1000s) number of learning steps (1000s)
Fig. 4. When decreasing the number of generated ospring over the learning trial,
learning speed is kept high while the error convergence reaches the level that is reached
by always generating two ospring classiers (a,b,c). However, in the case of the chal-
lenging sine-in-sine function, further learning would be necessary to reach a similarly
low error level (d).
5 Conclusions
This paper has shown that a xed ospring set size does not necessarily yield
the best learning speed that XCSF can achieve. Larger ospring set sizes can
strongly increase the initial learning speed but do not necessarily reach maximum
accuracy. Adaptive ospring set sizes, if scheduled appropriately, can get the best
of both worlds in yielding high initial learning speed and low nal error. The
results however also suggest that a simple adaptation scheme is not generally
applicable. Furthermore, the theoretical considerations suggest that a signal-
to-noise estimate could be used to control the GA ospring schedule and the
ospring set sizes. Given a strong tness signal, a larger set of ospring could
be generated.
Another consideration that needs to be taken into account in such an o-
spring generation scheme, however, is the fact that problem domains may be
Towards Variable Ospring Set Sizes in XCSF 55
Acknowledgments
The authors acknowledge funding from the Emmy Noether program of the Ger-
man research foundation (grant BU1335/3-1) and like to thank their colleagues
at the department of psychology and the COBOSLAB team.
References
1. Holland, J.H.: Adaptation. In: Progress in Theoretical Biology, vol. 4, pp. 263293.
Academic Press, New York (1976)
2. Holland, J.H.: Properties of the bucket brigade algorithm. In: Proceedings of the
1st International Conference on Genetic Algorithms, Hillsdale, NJ, USA, pp. 17.
L. Erlbaum Associates Inc., Mahwah (1985)
3. Widrow, B., Ho, M.E.: Adaptive switching circuits. Western Electronic Show and
Convention, Convention Record, Part 4, 96104 (1960)
4. Wilson, S.W.: Classier tness based on accuracy. Evolutionary Computation 3(2),
149175 (1995)
5. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Prediction update al-
gorithms for XCSF: RLS, Kalman lter, and gain adaptation. In: GECCO 2006:
Proceedings of the 8th Annual Conference on Genetic and Evolutionary Compu-
tation, pp. 15051512. ACM, New York (2006)
6. Drugowitsch, J., Barry, A.: A formal framework and extensions for function ap-
proximation in learning classier systems. Machine Learning 70, 4588 (2008)
7. Orriols-Puig, A., Bernad o-Mansilla, E.: Bounding XCSs parameters for unbal-
anced datasets. In: GECCO 2006: Proceedings of the 8th Annual Conference on
Genetic and Evolutionary Computation, pp. 15611568. ACM, New York (2006)
8. Wilson, S.W.: Get real! XCS with continuous-valued inputs. In: Lanzi, P.L., Stolz-
mann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, pp. 209219.
Springer, Heidelberg (2000)
9. Wilson, S.W.: Classiers that approximate functions. Natural Computing 1, 211
234 (2002)
10. Butz, M.V., Lanzi, P.L., Wilson, S.W.: Function approximation with XCS: Hyper-
ellipsoidal conditions, recursive least squares, and compaction. IEEE Transactions
on Evolutionary Computation 12, 355376 (2008)
11. Butz, M.V., Wilson, S.W.: An algorithmic description of XCS. In: Lanzi, P.L.,
Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp.
267274. Springer, Heidelberg (2001)
56 P.O. Stalph and M.V. Butz
12. Butz, M.V., Kovacs, T., Lanzi, P.L., Wilson, S.W.: Toward a theory of general-
ization and learning in XCS. IEEE Transactions on Evolutionary Computation 8,
2846 (2004)
13. Butz, M.V., Goldberg, D.E., Tharakunnel, K.: Analysis and improvement of t-
ness exploitation in XCS: Bounding models, tournament selection, and bilateral
accuracy. Evolutionary Computation 11, 239277 (2003)
14. Butz, M.V., Goldberg, D.E., Lanzi, P.L., Sastry, K.: Problem solution sustenance
in XCS: Markov chain analysis of niche support distributions and the impact on
computational complexity. Genetic Programming and Evolvable Machines 8, 537
(2007)
15. Butz, M.V., Goldberg, D.E., Lanzi, P.L.: Bounding learning time in XCS. In: Deb,
K., et al. (eds.) GECCO 2004. LNCS, vol. 3103, pp. 739750. Springer, Heidelberg
(2004)
Current XCSF Capabilities and Challenges
1 Introduction
The increasing interest in Learning Classier Systems (LCS) [1] has propelled
research and LCS have proven their capabilities in various applications, includ-
ing multistep problems [2,3], datamining tasks [4,5], as well as robot applica-
tions [6,7]. The focus of this work is on the Learning Classier System XCSF [8],
which is a modied version of the original XCS [2]. XCSF is able to approxi-
mate multi-dimensional, real-valued function surfaces from samples by locally
weighted, usually linear, models.
While XCS theory has been investigated thoroughly in the binary domain [5],
theory on real-valued input and output spaces remains sparse. There are two
important questions: When does the system work at all and how does it scale
with increasing complexity? We will address these questions by rst carrying
over parts of the XCS theory and, secondly, showing the results of a scalability
analysis, which suggests that XCSF scales optimally in the required population
size.
J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 5769, 2010.
c Springer-Verlag Berlin Heidelberg 2010
58 P.O. Stalph and M.V. Butz
2 Theory
We assume sucient knowledge about the XCSF Learning Classier System
and directly start with a theoretical analysis. We carry over preconditions for
successful learning known from binary XCS and propose a scalability model,
which shows how the population size scales with increasing function complexity
and dimensionality.
Given that all of the above challenges are overcome and the system is able to
learn an accurate approximation of the problem at hand, it is important to know
how changes in the function complexity or dimensionality aect XCSFs learning
performance. In particular, we model the relation between
In order to simplify the model, we assume a uniform function structure and uni-
form sampling1 . This also implies a uniform classier structure, that is, uniform
shape and size. Without loss of generality, let the n-dimensional input space
be conned to [0, 1]n . Furthermore, we assume that XCSF evolves an optimal
solution [19]. This includes four properties, namely
1. completeness, that is, each possible input is covered in that at least one
classier matches.
2. correctness, that is, the population predicts the function surface accurately
in that the prediction error is below the target error 0 .
3. minimality, that is, the population contains the minimum number of classi-
ers needed to represent the function completely and correctly.
4. non-overlappingness, that is, no input is matched by more than one classier.
In sum, we assume a uniform patchwork of equally sized, non-overlapping, ac-
curate, and maximally general classiers. These assumptions reect reality on
uniform functions except for non-overlappingness, which is almost impossible for
real-valued input spaces.
We consider a uniformly sampled function of uniform structure
where n is the dimensionality of the input space and reects the function
complexity. Since we do neither x the condition type, not the predictor used
in XCSF, we have to dene the complexity via the prediction error. We dene
such that a linear increase in this value results in the same increase in the
prediction error. Thus, saying that the function is twice as complex induces that
the prediction error is twice as high for the same classiers. Since the classier
volume V inuences the prediction error in a polynomial fashion on uniform
functions, we can summarize the assumptions in the following equation.
n
= V (2)
We can now derive the optimal classifier volume and the optimal population size.
Using the target error 0 , we get an optimal volume of
n
0
Vopt = . (3)
The volume of the input space to be covered is one and it follows that the optimal
population size is n
Nopt = . (4)
0
To sum up, the dimensionality n has an exponential inuence on the population
size, while the function complexity and the target error 0 have a polynomial
inuence. Increasing the function complexity will require a polynomial increase
of the population size in the order n.
1
Non-uniform sampling is discussed elsewhere [18].
Current XCSF Capabilities and Challenges 61
5000
macro classifiers
1D
2D
3D
500
100
0.1 1 10
gradient (log-scale)
Fig. 1. Comparative plots of the nal population size after condensation (data points)
and the developed scalability theory (solid lines) for dimensions n = 1 to n = 6.
The number of macro classiers is plotted against the function complexity, which is
modeled via the increasing gradient. The order of the polynomials are equal to the
dimension n, which requires an exponential increase in population size. An increasing
function complexity results in a polynomial increase. Apart from an approximately
constant overhead due to overlapping classiers, the scalability model ts reality.
Note that no assumptions are made about the condition type or the predic-
tor used. The intentionally simple equations 3 and 4 hide a complex geometric
problem in the variable . For example, assume a three-dimensional non-linear
function that is approximated using linear predictions and rotating ellipsoidal
conditions. Calculating the prediction error is non-trivial for such a setup. When
the above bounds are required exactly, this geometric problem has to be solved
for any condition-prediction-function combination anew.
In order to validate the scalability model, we conducted experiments with
interval conditions and constant predictions on a linear function2 . XCSF with
constant predictions equals XCSR [20], however, only one dummy action is avail-
able. As done before in [19] with respect to XCS, we analyze a restricted class of
problems for XCSF. On the one hand, the constant prediction makes this setup
a worst case scenario in terms of required population size. On the other hand,
the simple setup allows for solving the geometric problem analyticallythus,
we can compare the theoretical population size bound from Equation 4 with the
actual population size that is required to approximate the respective function. A
so called bisection algorithm runs XCSF with dierent population size settings
in a binary search fashion. On termination, the bisection procedure returns the
approximately minimal population size N that is required for successful learning.
2
Other settings: 500000 iterations, 0 = 0.01, = 0.1, = 1, = 0.1, = 5, = 1,
= 0.05, r0 = 1, GA = 50, del = 20, sub = 20. GA subsumption was applied.
Uniform crossover was applied.
62 P.O. Stalph and M.V. Butz
For details of the bisection algorithm and how the geometric problem is solved,
please refer to [9].
Figure 1 shows the results of the bisection experiments on the one- to six-
dimensional linear function f (x1 , . . . , xn ) = ni=1 xi , where solid lines repre-
sent the developed theory (Equation 4) and the data shown represents the nal
population size after condensation [21]. For each dimension n, the function di-
culty was linearly increased by increasing the gradient of the linear function.
The polynomials are shown as straight lines on a log-log-scale plot, where the
gradient of a line equals the order of the corresponding polynomial.
We observe an approximately constant overhead from scalability theory to
actual population size. This overhead is expected, since the scalability model
assumes non-overlappingness. Most importantly, the prediction of the model lies
parallel to the actual data, which indicates that the dimension n ts the ex-
ponent of the theoretical model. Thus, the experiment conrms the scalability
model: Problem dimensionality has an exponential inuence on the required pop-
ulation size (given full problem space sampling). Furthermore, a linear increase
in the problem diculty (or a linear decrease of the target error 0 ) induces a
polynomial increase in the population size.
Prediction Type Typically linear predictors are used for a good balance
of expressiveness and interpretability. However, others are possible, such as
constant predictors [8] or polynomial ones [25].
Learning Time The number of iterations should be set high enough to assure
that the prediction error converges to a value below the desired 0 .
GA Frequency Threshold GA This threshold species that GA reproduc-
tion is activated only if the average time of the last GA activation in the
set lies longer in the past than GA . Increasing this value delays learning,
but may also prevent forgetting and overgeneralization in unbalanced data
sets [18].
Mutation Rate The probability of mutation is closely related to the avail-
able mutation options of the condition type and thus it is also connected to
the dimensionality of the problem. It should be set according to the problem
at hand, e.g. = 1/m, where m is the number of available mutation options.
Initial classifier size r0 One the one hand, this value should be set high
enough to meet the covering challenge, that is, it should be set such that
simple covering with less than N classiers is sucient to cover the whole
input space. On the other hand, the initial size should be small enough to
yield a tness signal upon crossover or mutation in order to prevent oversized
classiers from taking over the population.
The other parameters can be set to their default values, thus ensuring a good
balance of the evolutionary pressures.
The strongest interdependencies can be found between population size N ,
target error 0 , condition structure, and prediction type as indicated by the
scalability model of Section 2.2. Changing either of these will aect XCSFs
learning performance signicantly. For example, with a higher population size a
lower target error can be reached. An appropriate condition structure may turn
a polynomial problem into a linear one, thus requiring less classiers. Advanced
predictors are able to approximate more complex functions and thus enable
coarse structuring of the input space, again reducing the required population
size. When tuning either of these settings, the related parameters should be
kept in mind.
Before running XCSF with some arbitrary settings on a particular problem, a few
things have to be considered. This concerns mainly the condition and prediction
structures, that is, XCSFs solution representation. The next two paragraphs
highlight some issues about dierent representations.
Selecting an Appropriate Predictor. The rst step is to select the type of predic-
tion to be used for the function approximation. Linear predictions have a reason-
able computational complexity and good expressiveness, while the nal solution
is well interpretable. In some cases, it might be required to invert the approx-
imated function after learning, which is easily possible with a linear predictor.
However, if prior knowledge suggests a special type of function (e.g. polynomials
64 P.O. Stalph and M.V. Butz
Even the best condition and prediction structures do not necessarily guarantee
successful learning. This section discusses some issues, where ne-tuning of some
parameters may help to reach the desired accuracy. Furthermore, we point out
when XCSF reaches its limits, so that simple parameter tuning cannot overcome
learning failures.
Ideally, given an unknown function, XCSFs prediction error quickly drops
below 0 (see Figure 2(a) for a typical performance graph). When XCSF is not
able to accurately learn the function, there are four possible main reasons:
Given case 1, the learning time is too short to allow for an appropriate structuring
of the input space. Increasing the number of iterations will solve this issue.
In contrast, case 2 indicates that the function is too dicult to approximate
with the given population size, target error, predictor, and condition structure.
Figure 2(b) illustrates a problem in which the system does not reach the target
error. Increasing the learning time allows for a settling of the prediction error, but
the target error is only reached when the maximum population size is increased.
While in the previous examples XCSF just does not reach the target er-
ror, in other scenarios the system completely fails to learn anything due to
bad parameter choices. There are two major factors that may prevent learning
completely: covering-deletion cycles and at tness landscapes. Although case 3
Current XCSF Capabilities and Challenges 65
6400 6400
pred. error pred. error
macro cl. macro cl.
matchset macro cl. 1000 matchset macro cl. 1000
1
macro classifiers
macro classifiers
1
100 100
prediction error
prediction error
0.1
10 10
0.1
0.01
1 1
0.01
0.001
0 20 40 60 80 100 0 20 40 60 80 100
number of learning steps (1000s) number of learning steps (1000s)
6400 6400
pred. error
macro cl.
matchset macro cl. 1000 1000
macro classifiers
macro classifiers
pred. error
1e-15 macro cl.
100 matchset macro cl. 100
prediction error
prediction error
10
10 10
1e-16
1
1 1
1e-17 0.1
0 20 40 60 80 100 0 20 40 60 80 100
number of learning steps (1000s) number of learning steps (1000s)
(a) sine 20D, too small r0 (b) sine 20D, too large r0
seems strange, there is a simple explanation. If the population size and initial
classier size are set such that the input space cannot be covered by the covering
mechanism, the system continuously covers and deletes classiers without any
knowledge gain (so called covering-deletion cycle [10]). Typically, the average
match set size is one, the population size quickly reaches the maximum, and
the average prediction error is almost zero because the error during covering is
zero. Exemplary, we equip XCSF with a small initial classier size r0 and run
the system on a 20-dimensional sine function as shown in Figure 3(a). Especially
high-dimensional input spaces are prone to this problematic cycle, because (1)
66 P.O. Stalph and M.V. Butz
the initial classier volume has to be high enough to allow for a complete cover-
age, but (2) the initial volume may not exceed the size where the GA does not
receive a sucient tness signal.
The latter may be the case when a single mutation of the initial covering
shape cannot produce a suciently small classier that captures the (eventually
ne-grained) structure of the underlying function. Thus, the GA is missing a
tness gradient and, due to higher reproductive opportunities, over-general clas-
siers take over the population as shown in Figure 3(b). Typically, the prediction
error does not drop at all. Here XCSF reaches its limits and simple parame-
ter tuning may not help to overcome the problem with a reasonable population
size. Eventually, a rened initial classier size hits a reasonable tness and pre-
vents over-general classiers. Otherwise, it might be necessary to reconsider the
condition structure or corresponding evolutionary operators.
Acknowledgments
The authors acknowledge funding from the Emmy Noether program of the Ger-
man research foundation (grant BU1335/3-1) and like to thank their colleagues
at the department of psychology and the COBOSLAB team.
68 P.O. Stalph and M.V. Butz
References
1. Holland, J.H.: Adaptation in natural and articial systems: An introductory analy-
sis with applications to biology, control, and articial intelligence. The MIT Press,
Cambridge (1992)
2. Wilson, S.W.: Classier tness based on accuracy. Evolutionary Computation 3(2),
149175 (1995)
3. Butz, M.V., Goldberg, D.E., Lanzi, P.L.: Gradient descent methods in learning
classier systems: Improving XCS performance in multistep problems. Technical
report, Illinois Genetic Algorithms Laboratory (2003)
4. Bernad o-Mansilla, E., Garrell-Guiu, J.M.: Accuracy-based learning classier sys-
tems: Models, analysis, and applications to classication tasks. Evolutionary Com-
putation 11, 209238 (2003)
5. Butz, M.V.: Rule-Based Evolutionary Online Learning Systems: A Principal Ap-
proach to LCS Analysis and Design. Springer, Heidelberg (2006)
6. Butz, M.V., Herbort, O.: Context-dependent predictions and cognitive arm control
with XCSF. In: GECCO 2008: Proceedings of the 10th Annual Conference on
Genetic and Evolutionary Computation, pp. 13571364. ACM, New York (2008)
7. Stalph, P.O., Butz, M.V., Pedersen, G.K.M.: Controlling a four degree of freedom
arm in 3D using the XCSF learning classier system. In: Mertsching, B., Hund,
M., Aziz, Z. (eds.) KI 2009. LNCS, vol. 5803, pp. 193200. Springer, Heidelberg
(2009)
8. Wilson, S.W.: Classiers that approximate functions. Natural Computing 1, 211
234 (2002)
9. Stalph, P.O., Llor`a, X., Goldberg, D.E., Butz, M.V.: Resource Management and
Scalability of the XCSF Learning Classier System. Theoretical Computer Science
(in press), http://dx.doi.org/10.1016/j.tcs.2010.07.007
10. Butz, M.V., Kovacs, T., Lanzi, P.L., Wilson, S.W.: How XCS evolves accurate clas-
siers. In: Proceedings of the Genetic and Evolutionary Computation Conference
(GECCO 2001), pp. 927934 (2001)
11. Wright, A.H.: Genetic algorithms for real parameter optimization. In: Foundations
of Genetic Algorithms, pp. 205218. Morgan Kaufmann, San Francisco (1991)
12. Goldberg, D.E.: Real-coded genetic algorithms, virtual alphabets, and blocking.
Complex Systems 5, 139167 (1991)
13. Radclie, N.J.: Equivalence class analysis of genetic algorithms. Complex Sys-
tems 5, 183205 (1991)
14. M uhlenbein, H., Schlierkamp-Voosen, D.: Predictive models for the breeder genetic
algorithm I. continuous parameter optimization. Evolutionary Computation 1,
2549 (1993)
15. Beyer, H.G., Schwefel, H.P.: Evolution strategies - a comprehensive introduction.
Natural Computing 1(1), 352 (2002)
16. Bosman, P.A.N., Thierens, D.: Numerical optimization with real-valued estimation-
of-distribution algorithms. In: Scalable Optimization via Probabilistic Modeling.
SCI, vol. 33, pp. 91120. Springer, Heidelberg (2006)
17. Stalph, P.O., Butz, M.V.: How Fitness Estimates Interact with Reproduction
Rates: Towards Variable Ospring Set Sizes in XCSF. In: Bacardit, J. (ed.) IWLCS
2008/2009. LNCS (LNAI), vol. 6471, pp. 4756. Springer, Heidelberg (2010)
18. Orriols-Puig, A., Bernad o-Mansilla, E.: Bounding XCSs parameters for unbal-
anced datasets. In: GECCO 2006: Proceedings of the 8th Annual Conference on
Genetic and Evolutionary Computation, pp. 15611568. ACM, New York (2006)
Current XCSF Capabilities and Challenges 69
19. Kovacs, T., Kerber, M.: What makes a problem hard for XCS? In: Lanzi, P.L.,
Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp.
251258. Springer, Heidelberg (2001)
20. Wilson, S.W.: Get real! XCS with continuous-valued inputs. In: Lanzi, P.L., Stolz-
mann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, pp. 209219.
Springer, Heidelberg (2000)
21. Wilson, S.W.: Generalization in the XCS classier system. In: Genetic Program-
ming 1998: Proceedings of the Third Annual Conference, pp. 665674 (1998)
22. Stone, C., Bull, L.: For real! XCS with continuous-valued inputs. Evolutionary
Computation 11(3), 299336 (2003)
23. Butz, M.V., Lanzi, P.L., Wilson, S.W.: Function approximation with XCS: Hyper-
ellipsoidal conditions, recursive least squares, and compaction. IEEE Transactions
on Evolutionary Computation 12, 355376 (2008)
24. Wilson, S.W.: Classier conditions using gene expression programming. In: Bac-
ardit, J., Bernad o-Mansilla, E., Butz, M.V., Kovacs, T., Llor` a, X., Takadama,
K. (eds.) IWLCS 2006 and IWLCS 2007. LNCS (LNAI), vol. 4998, pp. 206217.
Springer, Heidelberg (2008)
25. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Extending XCSF beyond
linear approximation. In: GECCO 2005: Proceedings of the 2005 Conference on
Genetic and Evolutionary Computation, pp. 18271834 (2005)
26. Vijayakumar, S., Schaal, S.: Locally weighted projection regression: An O(n) algo-
rithm for incremental real time learning in high dimensional space. In: ICML 2000:
Proceedings of the Seventeenth International Conference on Machine Learning, pp.
10791086 (2000)
27. Vijayakumar, S., DSouza, A., Schaal, S.: Incremental online learning in high di-
mensions. Neural Computation 17(12), 26022634 (2005)
28. Stalph, P.O., Rubinsztajn, J., Sigaud, O., Butz, M.V.: A comparative study: Func-
tion approximation with LWPR and XCSF. In: GECCO 2010: Proceedings of the
12th Annual Conference on Genetic and Evolutionary Computation (in press, 2010)
Recursive Least Squares and Quadratic
Prediction in Continuous Multistep Problems
Abstract. XCS with computed prediction, namely XCSF, has been re-
cently extended in several ways. In particular, a novel prediction update
algorithm based on recursive least squares and the extension to poly-
nomial prediction led to signicant improvements of XCSF. However,
these extensions have been studied so far only on single step problems
and it is currently not clear if these ndings might be extended also to
multistep problems. In this paper we investigate this issue by analyzing
the performance of XCSF with recursive least squares and with quadratic
prediction on continuous multistep problems. Our results show that both
these extensions improve the convergence speed of XCSF toward an op-
timal performance. As showed by the analysis reported in this paper,
these improvements are due to the capabilities of recursive least squares
and of polynomial prediction to provide a more accurate approximation
of the problem value function after the rst few learning problems.
1 Introduction
Learning Classier Systems are a genetic based machine learning technique for
solving problems through the interaction with an unknown environment. The
XCS classier system [16] is probably the most successful learning classier sys-
tem to date. It couples eective temporal dierence learning, implemented as
a modication of the well-known Q-learning [14], to a niched genetic algorithm
guided by an accuracy based tness to evolve accurate maximally general so-
lutions. In [18] Wilson extended XCS with the idea of computed prediction to
improve the estimation of the classiers prediction. In XCS with computed pre-
diction, XCSF in brief, the classier prediction is not memorized into a parameter
but computed as a linear combination of the current input and a weight vector
associated to each classier. Recently, in [11] the classier weights update has
been improved with a recursive least squares approach and the idea of computed
prediction has been further extended to polynomial prediction. Both the recur-
sive least squares update and the polynomial prediction have been eectively
applied to solve function approximation problems as well as to learn Boolean
functions. However, so far it is not currently clear whether these ndings might
be extended also to continuous multistep problems, where Wilsons XCSF has
J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 7086, 2010.
c Springer-Verlag Berlin Heidelberg 2010
Recursive Least Squares and Quadratic Prediction 71
been already successfully applied [9]. In this paper we investigate this important
issue. First, we extend the recursive least squares update algorithm to multistep
problems with the covariance resetting, a well known approach to deal with a non
stationary target. Then, to test our approach, we compare the usual Widrow-Ho
update rule to the recursive least squares one (extended with covariance reset-
ting) on a class of continuous multistep problems, the 2D Gridworld problems [1].
Our results show that XCSF with recursive least squares outperforms XCSF with
Widrow-Ho rule in terms of convergence speed, although both reach nally an
optimal performance. Thus, the results conrm the ndings of previous works
on XCSF with recursive least squares applied to single step problems. In addi-
tion, we performed a similar experimental analysis to investigate the eect of
polynomial prediction on the same set of problems. Also in this case, the re-
sults suggest that quadratic prediction results in a faster convergence of XCSF
toward the optimal performance. Finally, to explain why recursive least squares
and polynomial prediction increase the convergence speed of XCSF we showed
that they improve the accuracy of the payo landscape learned in the rst few
learning problems.
where cl is a classier, [M]|a represents the subset of classiers in [M] with action
a, cl.F is the tness of cl; cl.p(st ) is the prediction of cl computed in the state
st . In particular, when piecewise-linear approximation is considered, cl.p(st ) is
computed as:
cl.p(st ) = cl .w0 x0 + cl.wi st (i) (2)
i>0
where cl.w i is the weight wi of cl and x0 is a constant input. The values of
P (st , a) form the prediction array. Next, XCSF selects an action to perform.
The classiers in [M] that advocate the selected action are put in the current
action set [A]; the selected action is sent to the environment and a reward P is
returned to the system.
Reinforcement Component. XCSF uses the incoming reward P to update the
parameters of classiers in action set [A]. The weight vector w of the classiers
in [A] is updated using a modified delta rule [15]. For each classier cl [A],
each weight cl.w i is adjusted by a quantity wi computed as:
wi = (P cl.p(st ))st (i) (3)
|st |2
where is the correction rate and |st |2 is the norm of the input vector st , (see [18]
for details). Equation 3 is usually referred to as the normalized Widrow-Ho
Recursive Least Squares and Quadratic Prediction 73
update or modified delta rule, because of the presence of the term |st (i)|2 [5].
The values wi are used to update the weights of classier cl as:
cl.w i cl.w i + wi (4)
Then the prediction error is updated as:
cl. cl. + (|P cl.p(st )| cl.) (5)
Finally, classier tness is updated as in XCS.
Discovery Component. The genetic algorithm and subsumption deletion in
XCSF work as in XCSI [17]. On a regular basis depending on the parameter ga ,
the genetic algorithm is applied to classiers in [A]. It selects two classiers with
probability proportional to their fitness, copies them, and with probability
performs crossover on the copies; then, with probability it mutates each allele.
Crossover and mutation work as in XCSI [17,18]. The resulting ospring are
inserted into the population and two classiers are deleted to keep the population
size constant.
The matrix V(t) is usually initialized as V(0) = rls I, where rls is a positive
constant and I is the n n identity matrix. A higher rls , denotes that initial
parametrization is uncertain, accordingly, initially the algorithm will use a higher,
thus faster, update rate (kt ). A lower rls , denotes that initial parametrization is
rather certain, accordingly the algorithm will use a slower update. It is worthwhile
to say that the recursive least squares approach presented above involves two basic
underlying assumptions [5,4]: (i) the noise on the target payo P used for updat-
ing the classier weights can be modeled as a unitary variance white noise and
(ii) the optimal classier weights vector does not change during the learning pro-
cess, i.e., the problem is stationary. While the rst assumption is often reasonable
and has usually a small impact on the nal outcome, the second assumption is
not justied in many problems and may have a big impact on the performance.
In the literature [5,4] many approaches have been introduced for relaxing this as-
sumption. In particular, a straightforward approach is the resetting of the matrix
V: every rls updates, the matrix V is reset to its initial value rls I. Intuitively,
this prevent RLS to converge toward a xed parameter estimate by continually
restarting the learning process. We refer the interested reader to [5,4] for a more
detailed analysis of recursive least squares and other related approaches, like the
well known Kalman lter. The extension of XCSF with recursive least squares is
straightforward: we added to each classier the matrix V as an additional param-
eter and we replaced the usual update of classier weights with the recursive least
squares update described above and reported as Algorithm 1.
respect to the time required for each update, the Widrow-Ho update rule in-
volves only n scalar multiplications and, thus, is O(n); instead, recursive least
squares requires a matrix multiplication, which is O(n2 ). Therefore, recursive
least squares is more complex than Widrow-Ho rule both in terms of memory
and time requirements.
cl.p(st ) = w0 x0 + w1 st ,
where x0 is a constant input and st is the current state. Thus, we can introduce
a quadratic term in the approximation evolved by XCSF:
To learn the new set of weights we use the usual XCSF update algorithm (e.g.,
either RLS or Widrow-Ho) applied to the input vector xt , dened as xt =
x0 , st , s2t .
When more variables are involved, so that st = st (1), . . . , st (n), we dene
and apply XCSF to the newly dened input space. The same approach can
be generalized to allow the approximation of any polynomials of order k by
extending the input vector xt with high order terms. However in this paper, for
the sake of simplicity, we will limit our analysis to the quadratic prediction.
4 Experimental Design
To study how recursive least squares and the quadratic prediction aect the
performance of XCSF on continuous multistep problems we considered a well
known class of problems: the 2D gridworld problems, introduced in [1]. They
are two dimensional environments in which the current state is dened by a
pair of real valued coordinates x, y in [0, 1]2 , the only goal is in position 1, 1,
and there are four possible actions (left, right, up, and down) coded with two
bits; each action corresponds in a step of size s in the corresponding direction;
actions that would take the system outside the domain [0, 1]2 take the system to
the nearest position of the grid border. The system can start anywhere but in the
goal position and it reaches the goal position when both coordinates are equal or
greater than one. When the system reaches the goal it receives 0, in all the other
cases it receives -0.5. We called the problem described above empty gridworld,
76 D. Loiacono and P.L. Lanzi
V(x,y)
6
10
1
1
0.5
0.5
y 0 0
x
(a)
1
1111
0000
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
y
0
0 x 1
(b)
5
V(x,y)
10
15
20
1
1
0.5 0.5
y 0 0 x
(c)
Fig. 1. The 2D Continuous Gridworld problems: (a) the optimal value function of
Grid(0.05) when =0.95; (b) the Puddles(0.05) environment; (c) the optimal value
function of Puddles(0.05) when =0.95
dubbed Grid(s), where s is the agent step size. Figure 1a shows the optimal
value function associated to the empty gridworld problem, when s = 0.05 and
= 0.95.
A slightly more challenging problem can be obtained by adding some obstacles
to the empty gridworld environment, as proposed in [1]: each obstacle represents
an area in which there is an additional cost for moving. These areas are called
puddles [1], since they actually create a sort of puddle in the optimal value
function. Figure 1b depicts the Puddles(s) environment that is derived from
Grid(s) by adding two puddles (the gray areas). When the system is in a puddle,
it receives an additional negative reward of -2, i.e., the action has an additional
Recursive Least Squares and Quadratic Prediction 77
cost of -2; in the area where the two puddles overlap, the darker gray region, the
two negative rewards add up, i.e., the action has a total additional cost of -4.
We called this second problem puddle world, dubbed Puddles(s), where s is the
agent step size. Figure 1c shows the optimal value function of the puddle world,
when s = 0.05 and = 0.95.
The performance is computed as the average number of steps to reach the
goal during the last 100 test problems. To speed up the experiments, problems
can last at most 500 steps; when this limit is reached the problem stops even if
the system did not reach the goal. All the statistics reported in this paper are
averaged over 20 experiments.
5 Experimental Results
Our aim is to study how the RLS update and the quadratic prediction aect
the performance of XCSF on continuous multistep problems. To this purpose
we applied XCSF with dierent type of prediction, i.e., linear and quadratic,
and with dierent update rules, i.e., Widrow-Ho and RLS, on the Grid(0.05)
and Puddles(0.05) problems. In addition, we also compared the performance
of XCSF to the one obtained with tabular Q-learning [13], a standard reference
in the RL literature. In order to apply tabular Q-learning to the 2D Gridworld
problems, we discretized the the continuous problem space, using the step size
s = 0.05 as resolution for the discretization process. In the rst set of experiments
we investigated the eect of the RLS update on the performance of XCSF, while
in the second set of experiments we extended our analysis also to quadratic
prediction. Finally, we analyzed the results obtained and the accuracy of the
action-value approximations learned by the dierent versions of XCSF.
40
WH
20
10
0
0 1000 2000 3000 4000 5000
LEARNING PROBLEMS
(a)
40
WH
AVERAGE NUMBER OF STEPS
RLS
QL
30
20
10
0
0 1000 2000 3000 4000 5000
LEARNING PROBLEMS
(b)
Fig. 2. The performance of Q-learning (reported as QL), XCSF with the Widrow-Ho
update (reported as WH), and of XCSF with the RLS update (reported as RLS) applied
to: (a) Grid(0.05) problem (b) Puddles(0.05) problem. Curves are averages on 20
runs.
Also in this case, XCSF with RLS update is able to learn faster than XCSF with
the usual Widrow-Ho update rule and the dierence with Q-learning is even
less evident.
Therefore, our results suggest that the RLS update rule is able to exploit the
experience collected more eectively than the Widrow-Ho rule and conrm the
previous ndings on single step problems reported in [11].
experiments, the RLS update leads to a faster convergence, also when quadratic
prediction is used. In addition, the results suggest that also quadratic prediction
aects the learning speed: both with Widrow-Ho update and with the RLS up-
date the quadratic prediction outperforms the linear one. In particular, XCSF
with the quadratic prediction and the RLS update is able to learn even faster
than Q-learning in both Grid(0.05) and Puddles(0.05) problems. However, as
Table 1b shows, all the systems reach an optimal performance. Finally, it can
be noticed that the number of macroclassiers evolved (Table 1c) is very similar
for all the systems, suggesting that XCSF with quadratic prediction does not
evolve a more compact solution.
4
LINEAR WH
LINEAR RLS
QUADRATIC WH
3 QUADRATIC RLS
AVERAGE ERROR
2
0
0 1000 2000 3000 4000 5000
LEARNING PROBLEMS
(a)
4
LINEAR WH
LINEAR RLS
QUADRATIC WH
3 QUADRATIC RLS
AVERAGE ERROR
0
0 1000 2000 3000 4000 5000
LEARNING PROBLEMS
(a)
Fig. 3. Average absolute error of the value functions learned by XCSF on (a) the
Grid(0.05) problem and (b) the Puddles(0.05) problem. Curves are averages over 20
runs.
results is not surprising: (i) the RLS update exploits more eectively the expe-
rience collected and learns faster an accurate approximation; (ii) the quadratic
prediction allows a broader generalization in the early stages that leads very
quickly to a rough approximation of the payo landscape. Figure 3 reports the
error of the value function learned by the four XCSF versions during the learning
process. The error of a learned value function is measured as the absolute error
with respect to the optimal value function, computed as the average of the abso-
lute errors over an uniform grid of 100 100 samples of the problem space. For
each version of XCSF this error measure is computed at dierent stages of the
learning process and then averaged over the 20 runs to generate the error curves
reported in Figure 3. Results conrm our hypothesis: both quadratic prediction
and RLS update lead very fast to accurate approximations of the optimal value
function, although the nal approximations are as accurate as the one evolved
by XCSF with Widrow-Ho rule and linear prediction. To better understand
how the dierent versions of XCSF approximate the value function, Figure 4,
Recursive Least Squares and Quadratic Prediction 81
4
V(x,y)
10
1
1
0.5
0.5
y 0 0
x
(a)
4
V(x,y)
10
1
1
0.5
0.5
y 0 0
x
(b)
0
V(x,y)
10
1
1
0.5
0.5
y 0 0
x
(c)
Fig. 4. Examples of the value function evolved by XCSF with linear prediction and
Widrow-Ho update on the Grid(0.05) problem: (a) after 50 learning episodes (b) after
500 learning episodes (c) at the end of the experiment (after 5000 learning episode)
82 D. Loiacono and P.L. Lanzi
4
V(x,y)
10
1
1
0.5
0.5
y 0 0
x
(a)
0
V(x,y)
10
1
1
0.5
0.5
y 0 0
x
(b)
0
V(x,y)
10
1
1
0.5
0.5
y 0 0
x
(c)
Fig. 5. Examples of the value function evolved by XCSF with linear prediction and
RLS update on the Grid(0.05) problem: (a) after 50 learning episodes (b) after 500
learning episodes (c) at the end of the experiment (after 5000 learning episode)
Recursive Least Squares and Quadratic Prediction 83
4
V(x,y)
10
1
1
0.5
0.5
y 0 0
x
(a)
0
V(x,y)
10
1
1
0.5
0.5
y 0 0
x
(b)
0
V(x,y)
10
1
1
0.5
0.5
y 0 0
x
(c)
Fig. 6. Examples of the value function evolved by XCSF with quadratic prediction
and Widrow-Ho update on the Grid(0.05) problem: (a) after 50 learning episodes
(b) after 500 learning episodes (c) at the end of the experiment (after 5000 learning
episode)
84 D. Loiacono and P.L. Lanzi
4
V(x,y)
10
1
1
0.5
0.5
y 0 0
x
(a)
0
V(x,y)
10
1
1
0.5
0.5
y 0 0
x
(b)
0
V(x,y)
10
1
1
0.5
0.5
y 0 0
x
(c)
Fig. 7. Examples of the value function evolved by XCSF with quadratic prediction and
RLS update on the Grid(0.05) problem: (a) after 50 learning episodes (b) after 500
learning episodes (c) at the end of the experiment (after 5000 learning episode)
Recursive Least Squares and Quadratic Prediction 85
Figure 5, Figure 6, and Figure 7 show some examples of the value functions
learned by XCSF at dierent stages of the learning process. In particular,
Figure 4a and Figure 5a show the value function learned by XCSF with lin-
ear prediction after few learning episodes, using respectively the Widrow-Ho
update and the RLS update. While the value function learned by XCSF with
Widrow-Ho is at and very uninformative, the one learned by XCSF with RLS
update provides a rough approximation to the slope of the optimal value func-
tion, despite it is still far from being accurate. Finally, Figure 6 and Figure 7
report similar examples of value functions learned by XCSF with quadratic pre-
dictions. Figure 7a shows how XCSF with both quadratic prediction and RLS
update may learn very quickly a rough approximations of the optimal value
function after very few learning episodes. A similar analysis can be performed
on the Puddles(0.05) but it is not reported here due to the lack of space.
6 Conclusions
References
2. Butz, M.V., Pelikan, M.: Analyzing the evolutionary pressures in xcs. In: Spec-
tor, L., Goodman, E.D., Wu, A., Langdon, W.B., Voigt, H.-M., Gen, M., Sen, S.,
Dorigo, M., Pezeshk, S., Garzon, M.H., Burke, E. (eds.) Proceedings of the Ge-
netic and Evolutionary Computation Conference (GECCO 2001), July 7-11, pp.
935942. Morgan Kaufmann, San Francisco (2001)
3. Butz, M.V., Wilson, S.W.: An algorithmic description of xcs. Journal of Soft Com-
puting 6(3-4), 144153 (2002)
4. Goodwin, G.C., Sin, K.S.: Adaptive Filtering: Prediction and Control, Prentice-
Hall information and system sciences series (March 1984)
5. Haykin, S.: Adaptive Filter Theory, 4th edn. Prentice-Hall, Englewood Clis (2001)
6. Lanzi, P.L., Loiacono, D.: Xcsf with neural prediction. In: IEEE Congress on Evo-
lutionary Computation, CEC 2006, pp. 22702276 (2006)
7. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Extending XCSF beyond
linear approximation. In: Genetic and Evolutionary Computation GECCO-2005,
Washington DC, USA, pp. 18591866. ACM Press, New York (2005)
8. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: XCS with computed pre-
diction for the learning of boolean functions. In: Proceedings of the IEEE Congress
on Evolutionary Computation CEC 2005, Edinburgh, UK, pp. 588595. IEEE,
Los Alamitos (September 2005)
9. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: XCS with computed
prediction in continuous multistep environments. In: Proceedings of the IEEE
Congress on Evolutionary Computation CEC 2005, Edinburgh, UK, pp. 2032
2039. IEEE, Los Alamitos (September 2005)
10. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Prediction update al-
gorithms for XCSF: RLS, kalman lter, and gain adaptation. In: GECCO 2006:
Proceedings of the 8th Annual Conference on Genetic and Evolutionary Compu-
tation, pp. 15051512. ACM Press, New York (2006)
11. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Generalization in the
XCSF classier system: Analysis, improvement, and extension. Evolutionary Com-
putation 15(2), 133168 (2007)
12. Loiacono, D., Marelli, A., Lanzi, P.L.: Support vector regression for classier pre-
diction. In: GECCO 2007: Proceedings of the 9th Annual Conference on Genetic
and Evolutionary Computation, pp. 18061813. ACM Press, New York (2007)
13. Watkins, C.J.C.H.: Learning from delayed reward. PhD thesis (1989)
14. Watkins, C.J.C.H., Dayan, P.: Technical note: Q-Learning. Machine Learning 8,
279292 (1992)
15. Widrow, B., Ho, M.E.: Neurocomputing: Foundation of Research. In: Adaptive
Switching Circuits, pp. 126134. The MIT Press, Cambridge (1988)
16. Wilson, S.W.: Classier Fitness Based on Accuracy. Evolutionary Computa-
tion 3(2), 149175 (1995), http://prediction-dynamics.com/
17. Wilson, S.W.: Mining Oblique Data with XCS. In: Lanzi, P.L., Stolzmann, W.,
Wilson, S.W. (workshop organisers): Proceedings of the International Workshop
on Learning Classier Systems (IWLCS-2000), in the Joint Workshops of SAB
2000 and PPSN 2000, pp. 158174 (2000)
18. Wilson, S.W.: Classiers that approximate functions. Journal of Natural Comput-
ing 1(2-3), 211234 (2002)
19. Wilson, S.W.: Classier systems for continuous payo environments. In: Deb, K.,
Poli, R., Banzhaf, W., Beyer, H.-G., Burke, E., Darwen, P., Dasgupta, D., Floreano,
D., Foster, J., Harman, M., Holland, O., Lanzi, P.L., Spector, L., Tettamanzi,
A., Thierens, D., Tyrrell, A. (eds.) GECCO 2004. LNCS, vol. 3103, pp. 824835.
Springer, Heidelberg (2004)
Use of a Connection-Selection Scheme in Neural XCSF
1 Introduction
Two main theories to explain the emergence of complexity in the brain are construc-
tivism (e.g.[1]), where complexity develops by adding neural structure to a simple
network, and selectionism [2] where an initial amount of over-complexity is gradually
pruned over time through experience. We are interested in the feasibility of combin-
ing both approaches to realize flexible learning within Learning Classifier Systems
(LCS) [3], exploiting their Genetic Algorithm (GA) [4] foundation in particular. In
this paper we present a form of neural LCS [5] based on XCSF [6] which includes the
use of self-adaptive search operators to exploit both constructivism and selectionism
during reinforcement learning.
The focus of this paper centres around the impact of a form of feature selection that
we apply to the neural classifiers, allowing a more granular exploration of the net-
work weight space. Unlike traditional feature selection, which acts only on input
channels, we allow every connection in our networks to be enabled or disabled. We
term this addition connection selection, and evaluate in detail the effects of its in-
clusion in our LCS, in terms of solution size, internal knowledge representation and
stability of evolved solutions in two evaluation environments; the first a discrete maze
and the second a continuous maze.
J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 87106, 2010.
Springer-Verlag Berlin Heidelberg 2010
88 G.D. Howard, L. Bull, and P.-L. Lanzi
For claritys sake, we shall refer to the system without connection selection as N-
XCSF, and the version with connection selection as N-XCSFcs. Applications of this
type of learning system are varied, including (but not limited to) agent navigation,
data mining and function approximation; we are interested in the field of simulated
agent navigation. The rest of this paper is organized as follows: section 2 details
background research, section 3 introduces the evaluation environments used, and
section 4 shows the implementation of neural XCSF. Section 5 describes connection
selection, section 6 provides results of the experiments conducted, and section 7
provides a brief discussion and suggests further avenues of research.
2 Background
removes a single, fully-connected hidden layer neuron to the classifier condition. The
author then proceeds to define the use of NCS in continuous-valued environments
using a bounded-range representation, which reduces the number of neurons required
by each MLP. This constructivist LCS was then modified to include parameter
self-adaptation in [11]. The probabilities of constructivism events occurring are
self-adaptive in the same way as the mutation rate in [20], where an Evolutionary
Strategy inspired implementation is used to control the amount of genetic mutation
that occurs within each GA niche in a classifier system. This allows classifiers that
match in suboptimal niches to search more broadly within the solution space when
is large, and decreasing the mutation rate when an optimal solution has been found to
maintain stability within the niche. In both cases it is reported that networks of differ-
ent structure evolve to handle different areas of the problem space, thereby identifying
the underlying structure of the task.
Constructivism leads us to the field of variable length neural representations. Tra-
ditional genetic crossover operators are of questionable utility when applied to the
variable-length genomes that constructivism generates, as all rely on randomly pick-
ing points within the genome to perform crossover on. This can have the effect of
breaking the genome in areas that rely on spatial proximity to provide high-utility. A
number of methods, notably Harveys Species Adaptive Genetic Algorithm (SAGA)
[21] and Hutt and Warwicks Synapsing Variable-Length Crossover (SVLC) [22]
provide methods of crossing variable-length genetic strings, with SVLC reporting
superior performance than SAGA in a variable-length test problem. SVLC also elimi-
nates the main weakness of SAGA; that the initial crossover point on the first genome
is still chosen randomly, with only the second subject to a selection heuristic. It
should be noted that neither N-XCSF nor N-XCSFcs use any version of crossover
during a GA cycle; the reasoning behind this omission being twofold. Firstly, directly
addressing the problem would require increasing the complexity of the system (add-
ing SVLC-like functionality, for example). Secondly, and more importantly, experi-
mental evidence suggests that sufficient solution space exploration can be obtained
via a combination of GA mutation, self-adaptive mutation and neural constructivism,
to produce optimal solutions in both discrete and continuous environments. This view
is reinforced elsewhere in literature, e.g. [23].
Aside from GA-based crossover difficulties, there are also problems related to
creating novel network structures of high utility. For example, the competing conven-
tions problem (e.g. [24]) demonstrates how two networks of different structure but
identical utility may compete with each other for fitness, despite being essentially the
same network. Neuro Evolution of Augmenting Topologies (NEAT) [25] presents a
method for addressing this problem under constructivism. Each gene under the
NEAT scheme specifies a connection, specifying the input neuron and output neuron,
the connection weight, and a Boolean flag indicating if the connection is currently
enabled or disabled. Each gene also has a marker that corresponds to that genes first
appearance in the population, with markers passed down from parents to children
during a GA event, and is based on the assumption that genes from the same origin
are more likely to encode similar functions. The marker is retained to make it more
likely that homologous genes will be selected during crossover. NEAT has been ap-
plied to evolve robot controllers [26].
90 G.D. Howard, L. Bull, and P.-L. Lanzi
Feature selection is a method of streamlining the data input to a process, where the
input data can be imagined as a vector of inputs, with dimension >1. This can be done
manually (by a human with relevant domain knowledge), although this process can be
error-prone, costly in terms of both time and potentially money, and, of course, re-
quires expert domain knowledge. A popular alternative in the machine learning com-
munity is automatic feature selection.
The use of feature selection brings two major benefits firstly, that the amount of
data being input to a process can be reduced (increasing computational efficiency),
and secondly that noisy connections (or those otherwise inhibitory to the successful
performance of the system) can be disabled. Useful features within the input vector
are preserved as the performance of the system can be expected to drop if they are
disabled, with the converse being true for disabling noisy/low-fitness connections.
This is especially useful when considering the case of mobile robot control, where
sensors are invariably subject to a certain level of noise that can be automatically
filtered out by the feature selection mechanism. This description of the concept of
feature selection can be seen to display a strong relationship with the MLP (and in-
deed any connectionist neural) paradigm, which uses a collection of clearly discre-
tised input channels to produce an output. It can be demonstrated that the disabling of
connections within the input layer of an MLP can have a (sometimes drastic) affect on
the output of the network [27].
Related work on the subject of feature selection in neural networks can be found in
[28] and [29], who explore the use of feature selection in a variety of neural networks.
Also especially pertinent is the implementation of feature selection within the NEAT
framework (FS-NEAT) [30], who apply their system to a double pole balancing task
with 256 inputs. FS-NEAT performs feature selection by giving each input feature a
small chance (1/I, where I is the dimension of the input vector) to be connected to
every output node. An unaltered NEAT mutation sequence then allows these connec-
tions to connect to nodes in the hidden layers of the networks, as well as providing the
ability to add further input nodes to the networks, again with a small probability of
input addition.
The authors make the point that NEAT, following a constructivist methodology,
tends to evolve small networks without superfluous connections. They observe both a
quicker convergence to optimality and networks with only around 32% of the avail-
able input nodes connected in the best-performing network, a reduction from 256
inputs to an average useful subset size of 83.6 enabled input nodes. Also highly
relevant is the derivative FD-NEAT (Feature Deselection NEAT) [31], where all
connections are enabled by default, and pruning rather than growing of connections
takes place (it should be noted that FS-NEAT and neural constructivism [1] are simi-
lar, as are FD-NEAT and Edelmans theory of neural Darwinism [2]). Consistent
between all four papers mentioned above is that they perform input feature selection
only (in other words, only input connections are viable candidates for enabling/
disabling).
A comparative study into neuroevolution for both classification and regression
tasks (supervised) can be found in [32], where the authors compare purely heuristic
approaches with an ensemble of evolutionary neural networks (ENNs), whose MLPs
Use of a Connection-Selection Scheme in Neural XCSF 91
3 Environments
Discrete maze experiments are conducted on a real-valued version of the Maze4 envi-
ronment [34] (Figure 1). In the diagram, O represents an obstacle that the agent
cannot traverse, G is the goal state, where the agent must reach to receive reward,
and * is a free space that the agent can occupy. The environmental discount rate
=0.71. The environmental representation was altered to loosely approximate a real
robots sensor readings - the binary string normally used to represent a given input
state st is replaced with a real-valued counterpart in the same way as [5]. That is, each
exclusive object type the agent could encounter is represented by a random real num-
ber within a specified range ([0.0, 0.1] for free space, [0.4,0.5] for an obstacle and
[0.9, 1.0] for the goal state). In the discrete environment, the input state st consists
of the cell contents of the 8 cells directly surrounding the agents current position, and
the boundedly-random numeric representation attempts to emulate the sensory noise
that real robots encounter. Performance is gauged by a Step-to-goal count the
number of discrete movements required to reach the goal state from a random starting
position in the maze; in Maze 4 this figure is 3.5. Upon reaching the goal state, the
agent receives a reward of 1000. Action calculation is covered in section 4. The test
environment for the continuous experiments is the 2-D continuous grid world,
Grid(0.05) (Figure 2) [35]. This is two-dimensional environment where the agents
current state, st, consists of the x and y components of the agents current location
within the environment; to emulate sensory noise both the x and y location of the
92 G.D. Howard, L. Bull, and P.-L. Lanzi
agent are subject to random noise +/- [0%-5%] of the agents true position. Both x and
y are bounded in the range [0,1]; any movement outside of this range takes the agent
to the nearest grid boundary. The environmental discount rate =0.95. The agent
moves a predetermined step size (in this case 0.05) within this environment. The only
goal state is in the top-right hand corner of the grid where (x+y >1.90). The agent
can start anywhere except the goal state, and must reach a goal state in the fewest
possible movements, where it receives a reward of 1000. Again, action calculation is
covered in section 4.
O O O O O O O O 1.0
O * * O * * G O
O O * * O * * O
O O * O * * O O
0.5
O * * * * * * O
O O * O * * * O
O * * * * O * O
O O O O O O O O
0.0 0.5 1.0
Fig. 1. The discrete Maze4 environment Fig. 2. The continuous grid (0.05) environment
match set if it has activation greater than 0.5. This is necessary as the action of the
classifier must be re-calculated for each state the classifier encounters, so each clas-
sifier sees each input. The outputs at the other two neurons (real numbers) are
mapped to a single discrete movement, which varies between discrete and continuous
environments. In the discrete case, the outputs at the other two neurons are mapped to
a movement in one of eight compass directions (N, NE, E, etc.). This takes place in a
way similar to [5], where three ranges of discrete output are possible for each node:
0.0<x<0.4 (low), 0.4<x<0.6 (medium), and 0.6<x<1.00 (high). The unequal partition-
ing is used to counteract the insensitivity of the sigmoid function to values within the
extreme reaches of its range. A discrete movement is mapped from these continuous
action inputs (high, high) = north, (high, med) = northeast, (high, low) = east, and so
on. It should be noted that the final two motor pairings (low, medium) and (low,
low) both produce a move to the northwest.
In the continuous environment, movement is constrained to one of four compass
directions (North, east, south, west). This takes place similarly to discrete environ-
ments, except here there are four possible directions and only two ranges of discrete
output are possible: 0.0<x<0.5 (low), and 0.5<x<1.00 (high). The combined actions of
each motor translate to a discrete movement according to the two motor output
strengths (high, high) = north, (high, low) = east, (low, high) = south, and (low,
low) = west.
At each time-step, XCSF builds a match set, [M], from [P] consisting of all clas-
sifiers whose conditions match the current input state st. In neural XCSF, every ac-
tion must be present in each [M]. If this is not the case, covering is used to generate
classifiers that advocate the missing action(s); covering repeatedly generates random
networks until the network action matches the desired output for a given input state.
Once [M] is formed, a prediction array is created. In XCSF, each classifier prediction
(cl.p) is calculated as a product of the environmental input (or state, st) and the predic-
tion weight vector (w) associated with each classifier. This vector has one element for
each input (8 in the discrete case, 2 in the continuous case), plus an additional element
w0 which corresponds to x0, a constant input that is set as a parameter of XCSF. A
classifiers prediction is calculated as shown in equation 1:
. . . (1)
The prediction array is the fitness weighted-average of the calculated predictions for
each possible action. An action selection policy is used to decide which action should
be taken (in [6], a random action selection policy is used on explore trials, and deter-
ministic on exploit trials). All classifiers that advocate the selected action form the
action set [A]. The action is taken and, if the goal state is reached, a reward is returned
from the environment that is used to update the parameters of the classifiers in [A]. A
discounted reward is propagated to the previous action set [A-1] if it exists. The predic-
tion weight vector of each classifier in the action set is updated using a version of the
delta rule, rather than updating the classifiers prediction value (equation 2). Each
prediction weight is then updated (equation 3) and prediction error is calculated (equa-
tion 4). Here, the vector x is the state st augmented by the parameter x0.
. (2)
| |
94 G.D. Howard, L. Bull, and P.-L. Lanzi
. . (3)
| . | ) (4)
Further details of the update procedure used in XCSF can be found in [6]. The GA
may then fire if the average time since the last GA application to the classifiers in [A]
exceeds a threshold GA. Our GA is modified to be a two-stage process. Stage 1 (sec-
tion 4.1) controls the rates of mutation and constructivism/connection selection that
occur within the system, with stage 2 (section 4.2 and section 5) controlling the evolu-
tion of neural architecture in terms of both neurons and connections. Deletion occurs
as in [6]. This cycle of [P] ([M] [A]) reward is called a trial. Each experiment
consists of 50,000 trials (20,000 in the continuous case). Each trial is either in explo-
ration mode (roulette wheel action selection) or exploitation mode (deterministic
action selection). We employ roulette wheel action selection on exploration trials to
discourage potentially time wasting agent movements, especially when the agents
payoff landscape becomes more accurate.
4.1 Self-adaptation
Implementation of NC in this system is based on the work of Bull [5]. Each rule has a
varying number of hidden layer neurons (initially 1, and always > 0), with additional
neurons being added or removed from the hidden layer depending on the constructiv-
ism element of the system. Constructivism takes place during a GA cycle, after muta-
tion. Two new self-adaptive parameters, and , are added. Here, represents the
Use of a Connection-Selection Scheme in Neural XCSF 95
disabled (set to 0.0, not a viable target for connection weight mutation). If the flag
was false but then flipped to true, the connection weight is randomly initialised
uniformly in the range [-1, 1]. All flags are initially set to true for newly initialised
classifiers and classifiers created via covering. During a node addition event, the flags
representing the new nodes connections are set probabilistically, with P(connection
enabled) = 0.5. Sharing something of a middle ground between FS-NEAT [30] and
FD-NEAT [31], we grow whole neurons (using constructivism, high-granularity
feature selection), but tend to prune connections from those neurons (using our net-
work-wide feature selection implementation, low granularity feature deselection). The
exception to this is node addition, which produces neurons that are on average 50%
connected.
6 Experimentation
Following this brief introduction is a comparison of neural XCSF, with (N-XCSFcs)
and without (N-XCSF) connection selection, in both discrete (Maze4) and continuous
(Grid(0.05)) environments. In Figure 3(a), as well as all other steps-to-goal graphs
(4(a), 5(a), 6(a)), the red dashed line represents optimal performance.
(a) (b)
(c)
Fig. 3. (a) Steps to goal (b) average number of nodes per classifier in the population (c) sself-
adaptive parameter values in N-XCSF
N (with no connection selection) in Maze4
maze, and the steps-to-goaal count recorded. Under the standard maze scenario uused
above, and in the LCS literrature, it is not possible to perform standard statistical teests
for significant differences in performance as performance is plotted as a 50-pooint
moving average due to the random start location. Using an extra exploit trial from ma
fixed position eases statisticcal comparison and allows us to define stabilty. Stabilitty is
defined as follows: A solu ution can be said to be stable if, for each of 50 consecuttive
knowledge test trials interrspersed between standard explore and exploit trials - frrom
the constant location in thee maze, the solution always finds the optimal path to the
goal. The first trial at whicch each run of a system reaches stability is recorded, and
this set of 10 numbers is compared
c to the sets produced by the other variants of the
system using a standard T--test. We also record various other indicators, namely the
average self-adaptive mutaation rate, , and the average number of connected hiddden
layer nodes.
98 G.D. Howard, L. Bulll, and P.-L. Lanzi
(a) (b)
(c)
(d)
Fig. 4. (a) Steps to goal (b) average number of nodes per classifier in the population (c) sself-
adaptive parameter values (d)) average enabled connections per classifier (%) in N-XCS SFcs
(with connection selection) in Maze4.
Table 1. Detailing the average time to stability and T-Test results when comparing N-XCSF
and N-XCSFcs in Maze4
Average P value
Connection selection 8838.40 0.64
No connection selection 10358.8
Table 2. Detailing the average self-adaptive mutation rate and T-Test results when comparing
N-XCSF and N-XCSFcs in Maze4
Average P value
Connection selection 0.11 2.99E-09
No connection selection 0.058
Table 3. Detailing the average number of hidden layer nodes and T-Test results when compar-
ing N-XCSF and N-XCSFcs in Maze4
Average P value
Connection selection 2.81 9.16E-08
No connection selection 1.50
reverse is true for the action set size estimate; 225.7 is the average when connection
selection is enabled, 143.8 without. So even though the match set generation is quicker
with connection selection, all action set-based operations (overall action determination,
parameter updates, reinforcement, GA activation) can be expected to be computation-
ally less efficient with a connection selection scheme applied. In terms of actual en-
abled connections within the population, we can observe that the average number of
connected nodes in the hidden layers of the classifiers (Figures 3(b) and 4(b)) do not
favour connection selection (1.5 connected nodes vs. 2.7 connected nodes). However,
connection selection has only 60% enabled connections on average (Figure 4(d)). We
can then calculate the number of connections enabled in the entire population as:
Hence, even though there are more connections in the connection selection networks
(as there are more hidden layer nodes on average in those networks), the lower re-
quired population means that fewer calculation computations are necessary. For a
neural representation to function, it is postulated that information from all surround-
ing locations would be needed to make an accurate decision with regards to move-
ment in the environment (i.e. keeping a Markov problem structure). Observations of
the final networks agree with this, showing that connections are more frequently cut
between the hidden and output neurons.
Average P value
Connection seleection 11453.5 0.45
No connection selection
s 13453.7
(a) (b)
(c)
Fig. 5. (a) Steps to goal (b) average number of nodes per classifier in the population (c) sself-
adaptive parameter values in N-XCSF
N (with no connection selection) in Grid (0.05)
102 G.D. Howard, L. Bulll, and P.-L. Lanzi
(
(a) (b)
(c)
(d)
Fig. 6. (a) Steps to goal (b) average number of nodes per classifier in the population (c) sself-
adaptive parameter values (d)) average enabled connections per classifier (%) in N-XCS SFcs
(with connection selection) in Grid (0.05)
Use of a Connection-Selection Scheme in Neural XCSF 103
The final self-adaptive parameter is the most pronounced of these differences, dif-
fering between continuous and discrete environments by a factor of ten (0.3 and 0.03
respectively). Table 5 shows the impact in regards to self-adaptive mutation rate when
connection selection is added in the continuous environment. Both connection selec-
tion and non-connection selection versions share similar average values, although the P
value reveals that the difference is still close to statistically significant. Table 6 indi-
cates that, in contrast to the discrete case (Table 3), the average number of hidden layer
nodes evolved by connection selection and non-connection selection versions in the
continuous case are not statistically significantly different. These results indicate that
the impact of connection selection is less in the continuous environment than in the
discrete environment. A possible explanation is that since there are fewer connections
per node in the continuous environment (5 as opposed to 11), the ability of a connec-
tion selection scheme to alter the functionality of a network is reduced.
Table 5. Detailing the average self-adaptive mutation rate and T-Test results when comparing
N-XCSF and N-XCSFcs in the continuous Grid (0.05) environment
Average P value
Connection selection 0.45 0.02
No connection selection 0.47
Table 6. Detailing the average number of hidden layer nodes and T-Test results when compar-
ing N-XCSF and N-XCSFcs in the continuous Grid (0.05) environment
Average P value
Connection selection 1.36 0.46
No connection selection 1.29
These results show that even with 20% more network connectivity (figure 4(d) vs.
figure 6(d)), in the continuous case, the reduced population needs of N-XCSFcs in a
continuous environment provides a greater efficiency enhancement, as not only does
each network contain less connections, also the overall number of networks in the
final solution is significantly reduced.
104 G.D. Howard, L. Bull, and P.-L. Lanzi
7 Discussion
This paper has detailed the implementation of an XCSF system for simulated agent
navigation; as well as the addition of various other elements, namely neural classifier
representation, self-adaptive mutation rates, and neural constructivism. The effects of
a network-wide feature-selection derivative have been examined, with particular em-
phasis placed on computational efficiency and final solution parsimony. Furthermore,
it has been shown that such a system can have a significant impact on both of these
factors in solving both discrete and continuous agent navigation tasks. The research
presented here could be extended in a number of ways; including comparison of other
network types or classifier representations on the same tasks. We also aim to investi-
gate the effects of different methods of performing constructivism (see e.g.[35]).
References
[1] Quartz, S.R., Sejnowski, T.J.: The Neural Basis of Cognitive Development: A Construc-
tionist Manifesto. Behavioural and Brain Sciences 20(4), 537596 (1997)
[2] Edelman, G.: Neural Darwinism: The Theory of Neuronal Group Selection. Basic
Books, New York (1987)
[3] Holland, J.H.: Adaptation. In: Rosen, R., Snell, F.M. (eds.) Progress in Theoretical Biol-
ogy, vol. 4, pp. 263293. Academic Press, New York (1976)
[4] Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan
Press, Ann Arbor (1975)
[5] Bull, L.: On Using Constructivism in Neural Classifier Systems. In: Guervs, J.J.M.,
Adamidis, P.A., Beyer, H.-G., Fernndez-Villacaas, J.-L., Schwefel, H.-P. (eds.) PPSN
2002. LNCS, vol. 2439, pp. 558567. Springer, Heidelberg (2002)
[6] Wilson, S.W.: Function Approximation with a Classifier System. In: Spector, L.D., Wu,
G.E.A., Langdon, W.B., Voight, H.M., Gen, M. (eds.) Proceedings of the Genetic and
Evolutionary Computation Conference (GECCO 2001), pp. 974981. Morgan Kauf-
mann, San Francisco (2001)
[7] Rumelhart, D.E., McClelland, J.L.: Parallel Distributed Processing. MIT Press, Cam-
bridge (1986)
[8] Bull, L., Hurst, J.: A Neural Learning Classifier System with Self-Adaptive Constructiv-
ism. In: IEEE Congress on Evolutionary Computation. IEEE Press, Los Alamitos (2003)
[9] Buhmann, M.D.: Radial Basis Functions: Theory and Implementations. Cambridge Uni-
versity, Cambridge (2003)
[10] Bull, L., OHara, T.: Accuracy-based Neuro and Neuro-Fuzzy Classifier Systems. In:
Langdon, W.B., Cantu-Paz, E., Mathias, K., Roy, R., Davis, D., Poli, R., Balakrishnan,
K., Hanavar, V., Rudolph, G., Wegener, J., Bull, L., Potter, M.A., Schultz, A.C., Miller,
J.F., Burke, E., Jonoska, N. (eds.) GECCO 2002: Proceedings of the Genetic and Evolu-
tionary Computation Conference, pp. 905911. Morgan Kaufmann, San Francisco
(2002)
[11] Hurst, J., Bull, L.: A Neural Learning Classifier System with Self-Adaptive Constructiv-
ism for Mobile Robot Control. Artificial Life 12(3), 353380 (2006)
[12] Giani, A., Baiardi, F., Starita, A.: PANIC: A Parallel Evolutionary Rule Based System.
In: Proceedings of the Fourth Annual Conference on Evolutionary Programming, EP
1995 (1995)
Use of a Connection-Selection Scheme in Neural XCSF 105
[13] OHara, T., Bull, L.: Prediction Calculation in Accuracy-based Neural Learning Classifi-
er Systems. Tech report UWELCSG04-004 (2004)
[14] Lanzi, P.L., Loiacono, D.: XCSF with Neural Prediction. In: IEEE Congress on Evolu-
tionary Computation, CEC 2006, pp. 22702276 (2006)
[15] Dam, H.H., Abbass, H.A., Lokan, C., Yao, X.: Neural-Based Learning Classifier Sys-
tems. IEEE Trans. on Knowl. and Data Eng. 20(1), 2639 (2008)
[16] OHara, T., Bull, L.: Building Anticipations in an Accuracy-based Learning Classifier
System by use of an Artificial Neural Network. In: Proceedings of the IEEE Congress on
Evolutionary Computation, pp. 20462052. IEEE Press, Los Alamitos (2005)
[17] Prez-Uribe, A., Sanchez, E.: FPGA Implementation of an Adaptable-Size Neural Net-
work. In: Vorbrggen, J.C., von Seelen, W., Sendhoff, B. (eds.) ICANN 1996. LNCS,
vol. 1112, pp. 383388. Springer, Heidelberg (1996)
[18] Watkins, C.J.C.H.: Learning with Delayed Rewards. PhD thesis, Psychology Depart-
ment, University of Cambridge, England (1989)
[19] Wilson, S.W.: ZCS: A Zeroth-level Classifier System. Evolutionary Computation 2(1),
118 (1994)
[20] Bull, L., Hurst, J., Tomlinson, A.: Self-Adaptive Mutation in Classifier System Control-
lers. In: Meyer, J.-A., Berthoz, A., Floreano, D., Roitblatt, H., Wilson, S.W. (eds.) From
Animals to Animats 6 The Sixth International Conference on the Simulation of Adap-
tive Behaviour. MIT Press, Cambridge (2000)
[21] Harvey, I., Husbands, P., Cliff, D.: Seeing the Light: Artificial Evolution, Real Vision.
In: Cliff, D., Husbands, P., Meyer, J.-A., Wilson, S.W. (eds.) From Animals to Animats
3: Proceedings of the Third International Conference on Simulation of Adaptive Beha-
viour, pp. 392401. MIT Press, Cambridge (1994)
[22] Hutt, B., Warwick, K.: Synapsing Variable-Length Crossover: Meaningful Crossover for
Variable-Length Genomes. IEEE Transactions on Evolutionary Computation 11(1),
118131 (2007)
[23] Rocha, M., Cortez, P., Neves, J.: Evolutionary Neural Network Learning. In: Pires, F.M.,
Abreu, S.P. (eds.) EPIA 2003. LNCS (LNAI), vol. 2902, pp. 2428. Springer, Heidel-
berg (2003)
[24] Schaffer, J.D., Whitley, D., Eshelman, L.J.: Combinations of genetic algorithms and
neural networks: A survey of the state of the art. In: Whitley, D., Schaffer, J. (eds.) Pro-
ceedings of the International Workshop on Combinations of Genetic Algorithms and
Neural Networks (COGANN 1992), pp. 137. IEEE Press, Piscataway (1992)
[25] Stanley, K.O., Miikkulainen, R.: Evolving Neural Networks Through Augmenting To-
pologies. Evolutionary Computation 10(2), 99127 (2002)
[26] Stanley, K.O., Miikkulainen, R.: Competitive Coevolution through Evolutionary Com-
plexification. Journal of Artificial Intelligence Research 2004(21), 63100 (2002)
[27] Basheer, A., Hajmeer, M.: Artificial neural networks: fundamentals, computing, design,
and application. Journal of Microbiological Methods 43(1) (2000)
[28] Belue, L.M., Bauer Jr., K.W.: Determining input features for multilayer perceptrons.
Neurocomputing 7, 111121 (1995)
[29] Basak, J., Mitra, S.: Feature selection using radial basis function networks. Neural Com-
put. Appl. 8, 297302 (1999)
[30] Whiteson, S., Stone, P., Stanley, K.O., Miikkulainen, R., Kohl, N.: Automatic feature se-
lection in neuroevolution. In: Proceedings of the 2005 Conference on Genetic and Evolu-
tionary Computation, Washington DC, USA, June 25-29 (2005)
[31] Tan, M., Hartley, M., Bister, M., Deklerck, R.: Automated feature selection in neuroevo-
lution. Evolutionary Intelligence 1(4), 271292 (2009)
106 G.D. Howard, L. Bull, and P.-L. Lanzi
[32] Rocha, M., Cortez, P., Neves, J.: Evolution of neural networks for classification and re-
gression. Neurocomput. 70(16-18), 28092816 (2007)
[33] Howard, D., Bull, L., Lanzi, P.-L.: Self-Adaptive Constructivism in Neural XCS and
XCSF. In: Keijzer, M., et al. (eds.) GECCO 2008: Proceedings of the Genetic and Evo-
lutionary Computation Conference, ACM Press, New York (2008)
[34] Lanzi, P.L.: An Analysis of Generalization in the XCS Classifier System. Evolutionary
Computation 7(2), 125149 (1999)
[35] Boyan, J.A., Moore, A.W.: Generalization in reinforcement learning: Safely approximat-
ing the value function. In: Tesauro, G., Touretzky, D.S., Leen, T.K. (eds.) Advances in
Neural Information Processing Systems 7, pp. 369376. The MIT Press, Cambridge
(1995)
[36] Schlessinger, E., Bentley, P.J., Lotto, R.B.: Analysing the Evolvability of Neural Net-
work Agents through Structural Mutations. In: Capcarrre, M.S., Freitas, A.A., Bentley,
P.J., Johnson, C.G., Timmis, J. (eds.) ECAL 2005. LNCS (LNAI), vol. 3630, pp. 312
321. Springer, Heidelberg (2005)
Building Accurate Strategies in Non Markovian
Environments without Memory
1 Introduction
Classier systems based on genetic algorithm are rule based systems whose di-
agnose ability is known to be used on parameters optimisation problems [10].
Nevertheless, this kind of classiers needs to perform a learning step before
being used in a production and/or a diagnostic context. Most often, this learning
stage is performed on a sample of data representing the available and validated
/ expertised data of the considered environment. Tendencies contained by this
data set are assimilated by the classier system using reinforcement learning.
In this purpose, the system is continuously exposed to signals created using the
learning sample. At this point, the action performed by the classier in reaction
to the incomming signal is rewarded thanks to the tness function. This function
is dened depending on the learning problem considered: its aim is to maintain
accurate classiers within the population by preventing them to be deleted or
lost by genetic pressure.
J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 107126, 2010.
c Springer-Verlag Berlin Heidelberg 2010
108 . Gilles and P. Mathias
surrounding its position. The animat can move only to an empty cell between
these neighbouring cells, moving step by step through the maze in order to fulll
its goal.
A cognitive system which is studied on this kind of environment have to pilot
an animat through the maze in order to reach the food. The problem given
to this cognitive system is to attend to adopt a policy of moves inside of this
environment. This strategy must allow the animat to complete its goal with an
accurate and nite number of steps.
Maze environments oer plenty of parameters that allows to evaluate the com-
plexity of a given maze and the eciency of given a learning method. Aliasing
positions are described in [2] as positions with identical perceptions for the an-
imat. According to this study, it exists three types of aliasing positions. Type
I aliasing positions are located at a dierent distances from the food but re-
quire the same actions to get closer to it (Actionx,y = Actionx ,y D(x, y) =
D(x , y )). Type II aliasing positions are located at dierent distances from the
food and require dierent actions to get closer to this objective (Actionx,y =
Actionx ,y D(x, y) = D(x , y )). At last, type III aliasing squares, which also
require dierent actions to reach the goal, are located at the same distance from
it (Actionx,y = Actionx ,y D(x, y) = D(x , y )).
The folowing chart (g. 1) , extracted from the same study, presents most
of mazes that are available in the literature. It was built considering both the
type of the aliasing squares contained by each maze and the mean of the average
number of steps Q m done by a Q-Learning algorithm to reach the food over the
average distance to the food m measured for this maze.
We have chosen to focus our experimental part on the study of two mazes
: the Woods101 and the Maze E2. The Woods101 maze is easy to represent
Q
(g. 2) and oers a high complexity (Q m = 402.3 and m = 2.7, m = 149 [18]).
m
On the other hand, the E2 maze, which is also easy to represent, has a higher
Q
complexity than the Woods101 (Q m = 710.23 and m = 2.33, m = 304.81 [18]).
m
Moreover, this maze also present both type II and type III aliasing squares.
In order to determine the quality of the results obtained on a given maze, they
are two situations that we need to test and to establish clearly: the performances
observed when using a random method and these that would be obtained con-
sidering the optimal choices. We refer as optimal the best choices that could
be done by a cognitive system which is not able to distinguish two perceptually
similar situations.
We have chosen to perform the same measures on the random walk and on
the optimal choice case than on the performances of the classier systems. Con-
cerning the random walk, the animat is randomly placed in the maze and at each
time step it chooses a random direction between the eight directions available.
We measure the number of steps done by the animat and the nal distance of the
animat to the food (these choices are clearly established in Section 4.2, please
refer to it for more details).
To calculate the optimal number of steps that should be done by the animat
to fulll its goal, we had to consider the fact that the mazes we have chosen
contain aliasing squares. As the learning classier system we study do not use
memory mechanisms, we must take into account that neither the XCS, neither
the APCS are able to make the dierence between two squares of a given aliasing
situation. As a consequence, we need to reevaluate the optimal number of steps
for each maze considering the optimal policy that these classier systems would
be able to adopt to solve the problem.
Concerning the Woods101 environment, each aliasing square (dashed on g.
3a) impose to the system to maintain 2 dierent moves in order to reach the
food. If the animat is placed on the left-side aliasing position, it must go south-
east. On the contrary, if the animat is placed on the right-side aliasing position,
it must go south-west to reach the food. As a consequence, the optimal number
of steps done when starting from those positions should be set to 3 instead of 2
(g. 3b). As a consequence, it increases the number of steps to reach the food
for squares situated behind the aliasing positions. Those modications raise the
Building Accurate Strategies in Non Markovian Environments 111
average number of steps for this maze to 3.5 instead of 2.7 [2] considering this
new optimal policy.
The impact of the aliasing positions is harder to evaluate when considering
the maze E2. Sixteen aliasing squares are at the same distance to the food (2
squares) and oer the same perception (8 empty squares). As a consequence,
when we consider each one of these squares, we notice that at least 2 opposite
available diagonal directions can be kept in order to reach the food (see g. 3c).
As a consequence, in order to establish a metric for this maze, we will consider the
following statement to evaluate the optimal number of steps for those squares:
we exclude action chains that allow moves from an aliasing position to another
of the same type. These modications raise the average number of steps when
using an optimal policy on this maze to 3.16 instead of 2.33.
3 Principles of Studied CS
3.1 The Adapted Pittsburgh Classifier System
The Adapted Pittsburgh Classier System (APCS) is derived from the original
work of Smith on LS1 [16]. As Michigan classier systems (ZCS, XCS) [15],
Pittsburgh classier systems rely on two basic elements: the classier, which is
the container of the knowledge acquired by the system, and the genetic algorithm
which allow this knowledge to evolve. Nevertheless, instead of considering a
global collection of classiers, Pittsburgh classier systems co-evolve multiple
small collections of classiers in parallel.
112 . Gilles and P. Mathias
In the next sub-sections, we will precise how does structure, evaluation and
evolution mechanism in APCS diers from the original work done by Smith on
LS1 and from other existing Pittsburgh approaches.
In the introduction of this study, we have announced that the covering mecha-
nism is newly implemented in APCS. Directly inspired from Wilson work with
XCS[17], this mechanism consists here in replacing the sensor part of a classier
when no classier matches a given signal from environment. To enhance this
mechanism, we add a parameter to each classier, called covering time Ct , in
order to measure the number of generations a classier has not been activated
before it should be covered. This permits to every classier to have a chance
of being useful to its cognitive pool, i.e. the individual, before being covered.
The covering mechanism replaces the condition part of a classier with the mes-
sage from environment adding wildcards depending on the wildcard probability
116 . Gilles and P. Mathias
P# (see section 3.1). Nevertheless, the action part of the covered classier is kept
as it was before covering.
Now that we have presented the main trends of our system, the APCS, we will
focus on the other learning classier system used in this study : the eXtended
Classier System (XCS).
The eXtended Classier System is a classier issued from the Michigan approach
which was rst built by Wilson in 1995 [17]. It started to become mature near
2000 thanks to Butz, Lanzi and Kovacs who realised rigorous performance and
accuracy studies over this system, allowing it to evolve through new mechanisms:
improvements on the main algorithm, adding memory (XCSM1, XCSM2 [12]).
As a consequence, this system shows pretty good results, even on maze-type
environments with aliasing squares (see [13]).
4 Experiments Results
4.1 Experimental Settings
For each experiment, we submitted 20000 consecutive problems to the system:
for each problem, the animat is randomly put on a free square of the maze and
the trial stops when one of those two conditions is fullled:
1. the position of the animat in the maze is equal to the position of the food
2. the number of steps done by the animat surpass a certain threshold
(MaxSteps, equal to 50 steps for every presented results)
When the problem is solved, we record the starting distance to the food of the
animat, its nal distance to the food1 and its total number of steps. Both APCS
and XCS results shown in this paper are averaged over 10 experiments.
As done in [5] and in [13], the signal received by the system consists in a 16
bits string that represents each of the 8 squares surrounding the animat. Those
squares are encoded clockwise, starting by North: (00) stands for an empty cell,
(11) for food and (10) for an obstacle. As a consequence, the sensor part of the
classier also contain 16 positions. Each position in the sensor can be randomly
occupied by 0, 1 or by a wildcard (#). The eector part, coded by a string
of 3 bits, stands for one of the eight directions available for the animat, coded
clockwise, as the sensor part.
Specic settings used for XCS are the same as those used by Lanzi in 1999
[13], please refer to this experiment for more details.
Concerning the APCS, each evaluation group, i.e. individual, controls an an-
imat. As a consequence, during the experiment, each group is submitted to the
20000 problems and solve them asynchronously. The experiment stops when all
evaluation groups have solved at least 20000 problems. The number of moves is
measured during each of the K trials (see section 3.1).
the animat of the group; if it reaches the food, it receives a reward of 1.0 K and
the animat of this group is randomly placed in the maze. Else, the individual
receives a negative reward of 0.5K and the animat of the group is not moved.
During the K trials, for each individual, the random selection method is used to
select a classier from [M ] (see section 3.1).
Concerning the GA step, the mutation mechanism allows to reinforce the
exploratory ability of the system by creating new classiers. These classiers
are created when modifying existing classiers locally and randomly according
to a certain rate, which is expressed by the chosen mutation probability P Mut .
However, two consecutive positions of the considered mazes rarely diers one
from another for more than 4 bits, and the non-sense sequence (01) can, in the
present case, invalidate the activation of a classier. As a consequence, the higher
the number of mutated bits, the more the system may lose classiers that could
have allowed it to manage to nd the food. Experiments performed in [8] have
also validated that a high mutation rate (P Mut > 0.2) prevents the system from
keeping optimal classiers.
The other important parameter of the GA step, the cross-over rate, has a great
inuence on the homogeneity and on the stabilisation of the system: combined to
elitism, it allows the system to preserve the best behaviours expressed inside the
population. ne has measured in [8] that a low cross-over rate (P Cross < 0.6)
slows the convergence of the system by preventing good genetic precursors to be
replicated inside the population.
Elitism was set to 60% in order to keep the best parents and to stabilize
population more rapidly.
As shown in [9,8], stable results are obtained using a mutation probability
P Mut set to 0.005 and a cross-over probability P Cross set to 0.75. As a conse-
quence, we have chosen to use those values of parameters to perform our exper-
iments.
Table 3. Measure of the inuence of the variation of the number of individuals on the
average number of steps (Nc = 30, NI = 20, .., 50)
Table 4. Measure of the inuence of the variation of the number of classiers on the
average number of steps (NI = 30, Nc = 20, .., 50)
When the number of individuals changes, it also modies the number of eval-
uation groups (see section 3.1). As in this experiment, each group controls the
moves of an animat, the classier system is able to learn on N I dierent situ-
ations for each trials. As a consequence, the raise of the number of individuals
also raise the exploratory ability of the system which accelerates and improves
the convergence of the system.
When considering the evolution of the average nal distance to the food
(Table 3, column Avg. nal dist)., we can conclude that the raise of the number
of individuals contribute to the systems stabilization by diusing more eciently
a common accurate strategy.
Lets now consider the inuence of the variation of the number of classiers
contained by an individual on the performances of the system on this problem.
The following experiments (Table. 4) have been conducted with a xed number
of individuals (NI = 30) and various number of classiers (N c between 20 and
50).
Each classier determines the answer of the individual to one or several given
signals coming from the environment. As a consequence, the number of classiers
contained by an individual can possibly have an inuence on the number of
signals which may trigger an answer from a given individual.
As shown by Smith [16], if the potential information contained by an in-
dividual is too high, the exceeding information generates noise that disturbs
the answer of the system and the evolution of more tted classiers. In addi-
tion to this phenomenon, as non markovian situations may induce a rise in the
Building Accurate Strategies in Non Markovian Environments 121
average wildcard rate of the classiers [2], each additional classier may bring
more unnecessary information. This side eect emphasizes the fact that it exists
a potential information threshold on information contained by an individual.
When considering the obtained measures, we can conclude that this threshold
depend on the considered problem.
As a conclusion, if in the beginning, providing to the individuals of the APCS
additional cognitive capacity can improve the quality of the answer of the system,
additional classiers may contain useless precursors that disturb the convergence
of the system. As said earlier, this issue is successfully addressed in studies of
other Pittsburgh approaches as the bloat eect [1].
Table 5. Measure of the inuence of the variation of the parameter Ct on the obtained
results
As shown by the results of the experiments presented in this paper, APCS man-
age to evolve classiers allowing it to adopt a stable moving policy whose quality
is greatly improved by the recently added covering mechanism.
Building Accurate Strategies in Non Markovian Environments 123
We will now focus on the comparative study of the best results we obtained
with the XCS (g. 7 and 8) with the best results obtained with the APCS.
Parameters used for the XCS during those experiments are those used in
the experiment conducted by Lanzi in 1999 on this type of environment[13],
exception made for the number of individual which is 2000 for the Woods101
and 8000 on the Maze E2.
We will center our discussion on the policies built by the classiers. When
we consider the dierences between the best results obtained by the XCS and
the best results obtained by the APCS, we notice two dierences. The rst one,
signicant on the Maze E2 but not on the Woods 101 is that the average number
of steps done by the APCS is closer to the optimal than the average number of
steps done by the XCS. This statement can nd its foundation in the second
dierence noticed: the average nal distance to the food for the APCS on the
two mazes is lower than the one measured for the XCS. This dierence means
that the APCS accurately nds the food more often than the XCS which implies
that the policy built by the APCS is more stable and accurate.
However, the learning strategies employed by the two systems are quite dier-
ent one from another: the XCS learning stage focuses on the value function and
the policy deployed by this system is strongly dependent on this value function.
In addition to that fact, in order to keep the accuracy of its prediction, this
mechanism requires to maintain most of the actions available for a given signal.
XCS was built to nd Markov chains to solve a problem, so it is not supposed
to maintain at the same level of prediction, classiers with identical condition
parts and dierent actions. Reward in XCS is given to each classier part of the
Markov chain to solution. This is what we pointed out while describing both
CS. APCS maintains knowledge structures with several classiers. As the re-
ward concerns the whole knowledge structure, it is possible to have two or more
classiers with the same condition part but with dierent actions upon environ-
ment. Due to those facts, even if the XCS succeed in keeping an almost stable
policy for the Woods 101 environment, it fails when facing an environment with
numerous aliasing situations.
124 . Gilles and P. Mathias
(a) XCS (8000 Individuals) (b) APCS (NI = 40, Nc = 50, Ct = 11)
On the opposite, the learning mechanism used by the APCS relies on its cog-
nitive capacity (Nc ) which is highly problem dependent but allows it to develop
strategies regarding its past actions. The most accurate classiers will tend to
stay in the population because they will allow their owner to reach the food
more accurately and more often than the other individuals of the APCS. As
a consequence, in a multistep environment, the classiers contained by those
strong individuals allow them to build action chains which are reward depen-
dent. Moreover, instead of solving one problem at once, the APCS tries to solve
NI problems in the same time (see Section 3.1.2) which allows it to explore a
higher number of situations in a short time of simulation. Due to all those facts,
the system tend to nd a sub-optimal solution that allows it to be rewarded
accurately and more often.
We observe the formation of policies (see g. 9 for policy obtained with APCS
in maze E2) that makes it possible to reach the food from every position, even
when the environment contains aliasing squares. This policy evolves with the
frequency of the reward encountered by the individuals which allow them to
adapt and modify (via the genetic algorithm) their classiers.
6 Conclusions
Through this paper, we have shown and studied results which indicate that,
without any knowledge of its environment, even when facing non-markovian
positions, the Adapted Pittsburgh Classier System improved with the covering
mechanism and with the tted parameters, is able to adopt an almost stable
policy in maze environments containing aliasing square. This policy allows it to
reach accurately the food with a low but not optimal number of steps.
When studying the number of classiers contained by an individual, we have
shown that the raise of this local cognitive capacity can benet to the system if it
remains under a problem dependent threshold (see also [9]). Cognitive capacity
provided over this threshold showed the conservation of defective precursors and
a strong disturbance of the system answer due to them. Fortunately, as shown in
the experiments, those precursors are assimilated/eliminated by the system in
an amount of trials depending on the useless amount of information they carry.
We have also shown that the covering mechanism we propose has a notice-
able inuence on the performances of the system: classiers eliminated by this
mechanism carry an information that have a decreasing usability depending on
the number of evaluation steps during which those classiers are not triggered
by a signal from the environment.
Some interesting further work remains to be nalized, especially concerning
the precise built of the policy evolved by the APCS along the experiment.
References
1. Bacardit, J., Garrell-Guiu, J.M.: Bloat control and generalization pressure using
the minimum description length principle for a pittsburgh approach learning clas-
sier system. In: Kovacs, T., Llor, X., Takadama, K., Lanzi, P.L., Stolzmann, W.,
Wilson, S.W. (eds.) IWLCS 2003. LNCS (LNAI), vol. 4399, pp. 5979. Springer,
Heidelberg (2007)
2. Bagnall, A.J., Zatuchna, Z.: On the classication of maze problems. In: Bull, L.,
Kovacs, T. (eds.) Applications of Learning Classier Systems. Studies in Fuzziness
and Soft Computing, vol. 183, pp. 307316. Springer, Heidelberg (2005)
3. Bernad-Mansilla, E., Llor, X., Garrell-Guiu, J.M.: Xcs and gale: A compara-
tive study of two learning classier systems on data mining. In: Lanzi, P.L., Stolz-
mann, W., Wilson, S.W. (eds.) IWLCS 2001. LNCS (LNAI), vol. 2321, pp. 115132.
Springer, Heidelberg (2002)
4. Bull, L.: Lookahead and latent learning in ZCS. In: GECCO 2002: Proceedings of
the Genetic and Evolutionary Computation Conference, New York, July 9-13, pp.
897904. Morgan Kaufmann Publishers, San Francisco (2002)
126 . Gilles and P. Mathias
5. Butz, M., Wilson, S.W.: An algorithmic description of XCS. In: Lanzi, P.L., Stolz-
mann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 253272.
Springer, Heidelberg (2001)
6. Butz, M.V.: Documentation of XCS+TS c-code 1.2. IlliGAL Report 2003023, Illi-
nois Genetic Algorithms Laboratory (October 2003)
7. De Jong, K.A., Spears, W.M., Gordon, D.F.: Using Genetic Algorithms for Concept
Learning. Machine Learning 13(3), 161188
8. ne, G.: Systmes de Classeurs et Communication dans les Systmes Multi-
Agents. PhD thesis, Ecole Doctorale de STIC, Universit de Nice Sophia-Antipolis,
(Janvier 2003)
9. ne, G., Barbaroux, P.: Adapted pittsburgh-style classier-system: Case-study.
In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2003. LNCS (LNAI),
vol. 2661, pp. 3045. Springer, Heidelberg (2003)
10. Holmes, J.H., Lanzi, P.L., Stolzmann, W., Wilson, S.W.: Learning classier sys-
tems: New models, successful applications. Inf. Process. Lett. 82(1), 2330 (2002)
11. Lanzi, P.L.: Adding Memory to XCS. In: Proceedings of the IEEE Conference on
Evolutionary Computation (ICEC 1998), IEEE Press, Los Alamitos (1998),
http://ftp.elet.polimi.it/people/lanzi/icec98.ps.gz
12. Lanzi, P.L.: An analysis of the memory mechanism of XCSM. In: Proceedings of
the Third Genetic Programming Conference, pp. 643651. Morgan Kaufmann, San
Francisco (1998), http://ftp.elet.polimi.it/people/lanzi/gp98.ps.gz
13. Lanzi, P.L., Wilson, S.W.: Optimal classier system performance in non-markovian
environments. Technical Report 99.36, Illinois Genetic Algorithms Laboratory, Mi-
lan, Italy (1999)
14. Sigaud, O.: Les systmes de classeurs: un tat de lrt. Revue dintelligence Arti-
cielle RSTI srie RIA,Lavoisier, vol. 21 (February 2007)
15. Sigaud, O., Wilson, S.W.: Learning classier systems: a survey. Soft Com-
put. 11(11), 10651078 (2007)
16. Smith, S.F.: A Learning System based on Genetic Adaptive Algorithms. PhD the-
sis, University of Pittsburgh (1980)
17. Wilson, S.W.: Classier tness based on accuracy. Evolutionary Computation 3(2),
148175 (1995)
18. Zatuchna, Z.V.: AgentP: A Learning Classier System with Associative Perception
in Maze Environments. PhD thesis, School of Computing Sciences, UEA (2005)
Classification Potential vs. Classification Accuracy:
A Comprehensive Study of Evolutionary Algorithms
with Biomedical Datasets
Abstract. Biomedical datasets pose a unique challenge for machine learning and
data mining techniques to extract accurate, comprehensible and hidden knowl-
edge from them. In this paper, we investigate the role of a biomedical dataset
on the classification accuracy of an algorithm. To this end, we quantify the com-
plexity of a biomedical dataset in terms of its missing values, imbalance ratio,
noise and information gain. We have performed our experiments using six well-
known evolutionary rule learning algorithms XCS, UCS, GAssist, cAnt-Miner,
SLAVE and Ishibuchi on 31 publicly available biomedical datasets. The results
of our experiments and statistical analysis show that GAssist gives better classi-
fication results on majority of biomedical datasets among the compared schemes
but cannot be categorized as the best classifier. Moreover, our analysis reveals
that the nature of a biomedical dataset not the selection of evolutionary algo-
rithm plays a major role in determining the classification accuracy of a dataset.
We further show that noise is a dominating factor in determining the complex-
ity of a dataset and it is inversely proportional to the classification accuracy of
all evaluated algorithms. Towards the end, we provide researchers with a meta-
classification model that can be used to determine the classification potential of a
dataset on the basis of its complexity measures.
1 Introduction
Recent advancements in the field of bioinformatics and computational biology are in-
creasing the complexity of underlying biomedical datasets. The use of sophisticated
equipment like mass spectrometers and magnetic resonance imaging (MRI) scanners
generate large amounts of data that pose a number of issues regarding electronic stor-
age and efficient processing. One of the major challenges in this context is to automat-
ically extract accurate, comprehensible, and hidden knowledge from large amounts of
raw data. The discovered knowledge can then help medical experts in classification of
anomalies for these datasets.
Well-known data mining techniques for knowledge extraction and classification in-
clude probabilistic methods, neural networks, support vector machines, decision trees,
J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 127144, 2010.
c Springer-Verlag Berlin Heidelberg 2010
128 A.K. Tanwani and M. Farooq
instance based learners, rough sets and evolutionary algorithms. The evolutionary algo-
rithms inspired from the evolution process in the biological species show a number
of desirable properties like self-adaptation, robustness, collective learning etc., which
make them suitable for challenging real world problems. The Evolutionary Computa-
tion (EC) paradigm has been successfully used in several data mining techniques in-
cluding but not limited to genetic based machine learning systems (GBML), learning
classifier systems (LCS), ant colony inspired classifiers, and hybrid variants of evolu-
tionary fuzzy systems and neural networks. The evolutionary classifiers are becoming
popular for data mining of medical datasets because of their ability to find hidden pat-
terns in electronic records that are not otherwise obvious even to physicians [1].
However, it is not obvious to a researcher working on the classification of biomed-
ical datasets to choose a suitable classifier. Consequently, the common methodology
adopted by researchers is to empirically evaluate their dataset with a few well-known
machine learning techniques and select the one that gives better results. As a result,
no attempt is made to systematically investigate the factors that define the accuracy of
a classifier. An important contribution of this paper is that the accuracy of a classifier
depends on the complexity of a datset. We define the complexity of a dataset in terms
of missing values, imbalance ratio, noise and information gain. Moreover, we eval-
uate the performance of six well-known evolutionary rule learning classifiers XCS,
UCS, GAssist, cAnt-Miner, SLAVE and Ishibuchi on 31 publicly available biomedical
datasets. The results of our experiments provide two valuable insights: (1) classification
accuracy strongly depends on the complexity of a biomedical dataset, and (2) noise of a
dataset predominately defines its complexity. To conclude, we propose that researchers
should first evaluate the complexity of their medical dataset and then use our proposed
meta-model to determine its classification potential.
The remaining paper is organized as follows: we introduce the evolutionary algo-
rithms used in our study in Section 3. In Section 4, we quantify the complexity of the
biomedical datasets. We report the results of our experiments which are followed by
statistical analysis and discussions in Section 5. Finally, we conclude the paper with an
outlook to our future work.
2 Related Work
We now present a brief overview of different studies that analyze the performance of
evolutionary algorithms on various biomedical domains. In [2], Wong et al. applied evo-
lutionary algorithms to discover knowledge in the form of rules and casual structures
from fracture and scoliosis databases. Their results suggest that evolutionary algorithms
are useful in finding interesting patterns. John Holmes in [3] presented his stimulus
response learning classifier system, EpiCS, to enhance classification accuracy in an
imbalanced class dataset. He, however, used artificially created liver cancer dataset.
Bernado-Mansilla in [4] characterized the complexity of the classification problem by a
set of geometrical descriptors and analyzed the competence of XCS in this domain. The
authors in [5] compared XCS with Bayesian network, SMO and C4.5 for mining breast
cancer data and showed that XCS provides significantly higher accuracy followed by
C4.5. However its rules are considered more comprehensible and descriptive by the
Classification Potential vs. Classification Accuracy 129
domain experts. The work in [6] evaluates two competitive learning classifier systems,
XCS and UCS, for extracting knowledge from imbalanced data using both fabricated
and real world problems. The results of their study prove the robustness of these algo-
rithms compared with IBk, C4.5 and SMO. In [7], the authors compared the Pittsburgh
and Michigan style classifier using XCS and GAssist on 13 publicly available datasets
to reveal important differences between the two systems. The comparative study per-
formed in [8] between evolutionary algorithms (XCS and Gale) and non-evolutionary
algorithms (instance based, decision trees, rule-learning, statistical models and support
vector machines) on several datasets suggests evolutionary algorithms as more suitable
for data mining and classification. The results of the experiments carried in [9] show
better classification accuracy for well-known ant colony inspired, Ant-Miner, compared
with C4.5 on 4 biomedical datasets. The authors in [10] have analyzed several strate-
gies of evolutionary fuzzy models for data mining and knowledge discovery. In our
earlier work [11], we provide several guidelines to select a suitable machine learning
scheme for classification of biomedical datasets, however, the work is limited to non-
evolutionary algorithms.
A common theme observed in various studies is that they are inclined towards par-
ticular classifier(s) instead of the biomedical dataset(s). In contrast, our study uses a
novel methodology to quantify the complexity of a dataset, which we show, defines the
accuracy of a classifier. Moreover, we also build a meta-model of our findings that can
be used to determine the classification potential of a biomedical dataset.
3 Evolutionary Algorithms
We have selected a diverse set of well-known evolutionary rule learning algorithms
for our empirical study. The selected algorithms are: (1) reinforcement learning based
Michigan style XCS [12], (2) supervised learning based Michigan style UCS [13], (3)
Pittsburgh style GAssist [14], (4) Ant Colony Optimization (ACO) inspired cAnt-Miner
[15], (5) genetic fuzzy iterative learner SLAVE [16], and (6) genetic fuzzy classifier
Ishibuchi [17]. In all our experiments, the parameters are selected to achieve the best
operating point on the ROC (Receiver Operating Characteristic) curve [18].
3.1 XCS
XCS is a reinforcement learning based Michigan-style classifier that evolves a set of
rules as a population of classifiers (P ). Each rule consists of a condition, an action and
three performance parameters: (1) payoff prediction (p), (2) prediction error (), and
(3) fitness (F ). The first step in classification is to build a match set (M ) that consists
of rules whose conditions are satisfied. The payoff prediction of each rule is computed
and its corresponding action set (A) is created. The online learning is made possible
with a reward (r), returned by the environment, that is subsequently used to tune the
performance parameters of the rules in the action set. The updated fitness is inversely
proportional to the prediction error. Finally a genetic algorithm GA, with crossover and
mutation probabilities and respectively, is applied to the rules in the action set and
consequently new rules are added to the population. Some rules are also deleted from
the population depending on their experience.
130 A.K. Tanwani and M. Farooq
3.2 UCS
UCS is an accuracy based Michigan-style classifier which is in principle quite similar
to XCS. However, it uses a supervised learning scheme to compute fitness instead of
reinforcement learning employed by XCS. UCS like XCS also evolves a population
of rules (P ). Each rule has two parameters: (1) accuracy (acc), and (2) fitness (F ).
During the training phase, for every instance a set of rules whose conditions are satisfied
become part of its match set (M ). The rules that perform correct classification become
part of the correct set (C), and the others become part of the incorrect set (!C). Finally,
the genetic algorithm GA is applied to the correct set to update its population. Every
instance during testing is classified through weighted voting, on the basis of fitness, to
select the action.
We have used following parameter settings: N = 6400, number of iterations = 100, 000
and acc0 = 0.99. The other tuning parameters of GA are kept same as that in XCS.
3.3 GAssist
GAssist (Genetic Algorithms based claSSIfier sySTem), in contrast to XCS and UCS,
is a Pittsburgh-style learning classifier in which the rules are assembled in the form
of a decision list. GAssist-ADI uses Adaptive Discretization Intervals (ADI) rule rep-
resentation. In such systems, the continuous space is discretized into fixed intervals
for developing rules. Generalization is introduced by deleting and selecting rule set as
a function of their accuracy and length. The crossover between two rules takes place
across attribute boundaries rather than attribute intervals.
GAssist parameter setting is as follows: crossover probability = 0.6, number of it-
erations = 500, minimum number of rules for rule deletion = 12, and set of uniform
discreteness 4, 5, 6, 7, 8, 10, 15, 20 and 25 bins.
3.4 cAnt-Miner
Ant Miner, inspired by behavior of real ant colonies, uses Ant Colony Optimization
(ACO) to construct classification rules from the training data. The Rule Discovery pro-
cess consists of 3 steps i.e. rule generation, rule pruning and rule updating. In the rule
generation step, an ant starts with an empty rule list and adds one term at a time based
on the probability of that attribute-value pair. It continues to add terms to the rule with-
out duplication until all the attributes are exhausted or the new terms make the rule more
specific, defined by a user specified threshold. In the rule pruning step all the terms are
removed one by one from the rule that degrades the accuracy of that rule. While up-
dating rules, the pheromone values of terms are increased or decreased on the basis of
their usage in the rule discovery process. cAnt-Miner is a variant of Ant Miner for real
valued attributes.
The parameters of cAnt-Miner are: the number of ants = 3000, minimum cases per
rule = 5, maximum number of uncovered cases = 10 and convergence test size = 10.
Classification Potential vs. Classification Accuracy 131
3.5 SLAVE
SLAVE (Structural Learning Algorithm in Vague Environment) is totally different from
the classical Michigan-style and Pittsburgh-style rule learning algorithms. In this ap-
proach, every entity in the population represents a unique rule. But during an iteration
of a genetic algorithm, only the best individual is added to the final set of rules which
is eventually used for classification. In this way, SLAVE combines its iterative learning
approach with the fuzzy models. The fitness of the rules is determined by their com-
pleteness and consistency.
In our experiments, the parameter configuration of SLAVE is: the number of labels
= 5, population size = 100, number of iterations allowed without change = 500 and
mutation probability = 0.01.
3.6 Ishibuchi
Ishibuchi et al. proposed a fuzzy rule learning method for multidimensional pattern
classification problem with continuous attributes. The classification is done with the
help of a fuzzy-rule base in which each fuzzy if-then rule is handled as an individual,
and a fitness value is assigned to each rule. The criteria for assigning a class label
is based on a simple heuristic procedure which assigns a grade of certainty for each
fuzzy if-then rule. Because it uses linguistic values with fixed membership functions
as antecedent fuzzy sets, a linguistic interpretation of each fuzzy if-then rule is easily
obtained which greatly helps in comprehending the generated solution.
The experiments are carried with the following parameters: the number of labels =
5, population size = 100, number of evaluations = 10, 000, along with crossover and
mutation probabilities of 1.0 and 0.9.
4.3 Noise
Noise is of two types: (1) attribute noise, and (2) class noise. Research has shown that
the impact of class noise on classification accuracy is significantly more as compared
to the attribute noise [20] and hence, we only quantify class noise in our study. The
common sources of class noise are inconsistent and mislabeled instances. A number
of research efforts have been made to quantify the level of noise in a dataset, but its
definition still remains subjective. Brodley and Friedl characterized noise as the pro-
portion of incorrectly classified instances by a set of trained classifiers [21]. We use a
similar approach to quantify noise but utilize confusion matrices for a set of classifiers
to determine noisy instances. Noise is then quantified as the sum of all off-diagonal
entities (incorrectly classified instances) where each entity is the minimum of all the
corresponding elements in a set of confusion matrices. The defined criteria is based
upon two assumptions: (1) an inconsistent or misclassified instance is likely to confuse
every classifier, and (2) the bias of an algorithm towards particular class instances can
be factored out by using a set of classifiers. The advantage of our approach is that we
separately identify misclassified instances of every class and only categorize those as
noisy which are misclassified by all the classifiers.
The confusion matrix of a nth classifier in a set of n classifiers can in general be
represented as: n n
i11 i12 . . . in1j
in21 in22 . . . in2j
Cn = . . ... .
. . ... .
ini1 ini2 . . . inij
where the diagonal elements in Cn represent the correctly classified instances and off-
diagonal elements are the incorrectly classified instances. The percentage of class noise
in a dataset of In instances can be computed as below:
Classification Potential vs. Classification Accuracy 133
1
Nc Nc
N oise = min(C1 (i, j), C2 (i, j)......Cn (i, j)) 100 (2)
In i=1 j=1
where i = j and min(C1 (i, j), C2 (i, j)......Cn (i, j)) is an entity for corresponding i
and j that represents minimum number of class instances misclassified by all the classi-
fiers. We have used five well-known and diverse machine learning algorithms as a set of
classifiers in our study: Naive Bayes (probabilistic), SMO (support vector machines),
J48 (decision trees), Ripper (inductive rule learner) and IBk (instance based learner). We
use the standard implementations of these schemes in Wakaito Environment for Knowl-
edge Acquisition (WEKA) [22]. It is evident from Table 1 that biomedical datasets are
generally associated with high percentage of noise levels.
The average and total information gain of a biomedical dataset shown in Table 1 gives
a direct measure of the quality of its attributes for classification.
We now present the results of our experiments that we have done to analyze the nature
of 31 biomedical datasets with six evolutionary algorithms. We have used the standard
ACO framework, MYRA [23], for cAnt-Miner and Knowledge Extraction based on
Evolutionary Learning (KEEL) [24] for other evolutionary classifiers to remove any
implementation bias in our study. We evaluate the classification accuracy of the evolu-
tionary algorithms using standard ten fold stratified cross-validation in order to ensure
systematic and unbiased analysis. The results summarized in Table 1 show the nature
of a dataset in terms of its quantified parameters, along with the resulting classification
accuracies of all the algorithms. We now provide the insights of the obtained results
using statistical procedures to analyze the effect of evolutionary learning paradigm and
then discuss in detail the role of nature of biomedical dataset on classification accuracy.
In this section, we provide the statistical analysis of the results obtained in Table 1 to
systematically quantify the performance of evolutionary algorithms. The common ap-
proach used by many researchers in such cases is to use pairwise comparisons between
all the classifiers using commonly used statistical tests such as paired t-test or wilcoxon
134
Table 1. The Table shows: (1) Summary of used datasets in alphabetical order; number of instances, classes, attributes (continuous, binary, nominal),
percentage of missing values in the attributes, noise, average information gain (Avg Info Gain) and total information gain (Net Info Gain). (2) Classification
accuracies of evolutionary rule-learning algorithms; bold entries in every row represents the best accuracy.
Nature of Dataset Evolutionary Rule Learning Classifiers
Dataset Attributes Missing Imb Avg Info Net Info
Instances Classes Noise XCS UCS GAssist cAnt-Miner SLAVE Ishibuchi Mean
Con Bin Nom Values Ratio Gain Gain
Ann-Thyroid 7200 3 6 15 0 0 0.11 8.37 0.037 0.78 97.08 96.99 94.67 99.15 93.29 92.61 95.63 2.52
Breast Cancer 699 2 1 0 9 0.23 2.72 1.21 0.451 4.51 96.14 96.57 94.56 93.56 94.70 94.71 95.04 1.11
Breast Cancer Diagnostic 569 2 31 0 0 0 2.11 1.14 0.303 9.39 93.67 92.44 95.43 93.15 91.56 92.09 93.06 1.38
Breast Cancer Prognostic 198 2 33 0 0 0.06 13.64 1.76 0.004 0.15 65.76 72.82 70.29 73.82 74.29 76.29 72.21 3.72
Cardiac Arrhythmia 452 16 272 7 0 0.32 11.28 1.57 0.047 13.06 - 61.31 54.86 67.92 65.49 - 62.39 5.72
Cleveland-Heart 303 5 10 3 0 0.15 17.82 1.37 0.115 1.49 58.09 52.15 57.41 57.74 48.85 54.44 54.78 3.71
Contraceptive Method 1473 3 2 3 4 0 31.98 1.04 0.041 0.36 53.43 47.32 55.54 50.92 25.46 43.58 46.04 10.95
Dermatology 366 6 1 1 32 0.06 0.82 1.05 0.442 15.02 94.84 96.99 92.64 91.00 3.83 30.60 68.32 40.53
A.K. Tanwani and M. Farooq
Echocardiogram 132 2 8 2 2 4.67 6.06 1.24 0.084 1.01 88.63 84.78 96.21 83.19 92.47 93.24 89.75 5.10
E-Coli 336 8 7 0 1 0 6.55 1.25 0.678 5.42 90.51 93.73 74.74 79.17 82.72 67.89 81.46 9.68
Habermans Survival 306 3 3 0 0 0 16.67 1.57 0.023 0.07 74.23 74.20 69.96 71.53 73.18 73.20 72.72 1.67
Hepatitis 155 2 6 0 13 5.67 10.97 2.05 0.058 1.10 81.33 81.29 91.50 80.00 81.96 80.04 82.69 4.39
Horse Colic 368 2 8 4 15 19.39 11.96 1.15 0.061 1.64 84.23 81.47 93.73 83.97 67.33 63.05 78.96 11.54
Hungarian Heart 294 5 10 3 0 20.46 13.61 1.74 0.079 1.02 65.98 62.26 75.14 62.95 64.60 63.95 65.81 4.75
Hyper Thyroid 3772 5 7 21 1 2.17 0.34 28.81 0.012 0.36 97.35 97.88 98.57 98.12 97.43 97.30 97.77 0.51
Hypo-Thyroid 3163 2 7 18 0 6.74 0.54 9.99 0.024 0.60 97.16 97.85 99.43 98.96 95.51 95.23 97.36 1.74
Liver Disorders 345 2 6 0 0 0 21.88 1.05 0.011 0.06 63.26 67.27 61.18 65.48 58.54 58.27 62.33 3.67
Lung Cancer 32 3 0 0 56 0.28 9.86 1.02 0.152 8.50 30.83 44.99 41.67 45.83 - - 40.83 6.90
Lymph Nodes 148 4 3 9 6 0 10.81 1.46 0.138 2.48 79.19 81.09 78.57 77.81 69.57 72.90 76.52 4.36
Mammographic Masses 961 2 1 0 4 3.37 14.15 1.01 0.193 0.97 80.75 82.21 83.25 81.06 65.56 66.60 76.57 8.18
New Thyroid 215 3 5 0 0 0 2.79 1.78 0.602 3.01 94.93 92.60 92.19 90.24 91.23 86.15 91.22 2.94
Pima Indians Diabetes 768 2 8 0 0 0 20.18 1.20 0.064 0.52 73.71 74.76 72.15 75.00 72.67 68.62 72.81 2.34
Post Operative Patient 90 3 0 0 8 0.44 30.00 1.90 0.016 0.13 70.00 63.33 61.11 60.00 70.00 71.11 65.93 5.00
Promoters Genes Sequence 106 2 0 0 58 0 4.72 1.00 0.078 4.51 2.82 76.27 62.91 75.45 27.27 - 48.94 32.56
Protein Data 21618 3 0 0 1 0 45.48 1.19 0.065 0.07 51.41 51.21 54.52 54.46 54.52 54.52 53.44 1.65
Sick 2800 2 7 21 1 2.24 0.71 7.72 0.013 0.37 93.89 97.57 97.32 97.18 93.86 93.89 95.62 1.91
Splice-Junction Gene Sequence 3190 3 0 0 61 0 4.6 1.15 0.022 3.94 5.60 57.30 92.45 83.80 52.55 - 58.34 34.01
Statlog Heart 270 2 7 3 3 0 15.19 1.03 0.092 1.19 80.74 83.33 81.11 75.19 72.22 73.33 77.65 4.65
Switzerland Heart 123 5 10 3 0 17.07 32.52 1.14 0.023 0.30 31.67 31.67 65.83 30.19 31.79 42.37 38.92 13.91
Thyroid0387 9172 32 7 21 1 5.50 1.35 2.99 0.091 2.64 74.13 81.92 79.83 85.47 76.46 74.02 78.64 4.59
VA-Heart 200 5 10 3 0 26.85 27.00 1.04 0.023 0.30 32.00 28.99 58.50 29.00 20.00 33.50 33.66 13.04
70.11 (5) 74.34 (3) 77.33 (1) 74.56 (2) 66.96 (6) 70.87 (4)
Mean 1930 4.5 15 4 9 3.73 12.49 2.97 0.13 2.74
26.71 19.95 16.69 18.76 24.92 19.21
Average Ranks 3.29 (3) 3.02 (2) 2.71 (1) 3.35 (4) 4.35 (6) 4.27 (5)
Classification Potential vs. Classification Accuracy 135
signed rank test and to report significant differences between the pairs [6][8]. Demsar
has criticized the misuse of these approaches for multiple classifier comparisons be-
cause: (1) none of them reasons about comparing the means of more than two random
variables, and (2) a certain portion of null hypothesis is always rejected due to a random
chance by doing so [25]. In this paper, we use more specialized methods for comparing
the average ranks of evolutionary classifiers (see Table 1) as suggested by Demsar [25]
and Garcia [26].
Global Comparison of Evolutionary Classifiers. We use two most widely used non-
parametric tests for comparison of multiple hypothesis among the classifiers: (1) Fried-
man Test [27], and (2) Iman and Davenport Test [28]. These tests utilize 2 and F
distributions respectively to check if the distribution of observed and expected frequen-
cies differ from each other.
Friedman and Iman and Davenport tests perform a global analysis to check whether
the measured average ranks of all the classifiers are significantly different from the
mean rank (3.5 in our case). The corresponding statistics 2F and FF are calculated as
explained by Friedman and Iman and Davenport:
2F = 19.94, FF = 4.44
The critical values for corresponding tests 2C and FC obtained from the 2 and F
distribution tables at = 0.05 with 5 and 150 degrees of freedom are:
2C (5) = 11.07, FC (5, 150) = 2.27
Since, the critical values are lower than the test statistics, the null hypothesis can be
rejected and the post-hoc tests can be applied to detect significant differences between
classifiers.
Comparison with the Control Classifier GAssist. It can be seen from the results in
Table 1 that GAssist provides the best overall classification accuracy of 77.33 and least
standard deviation of 16.63. Moreover, it also outperformed other classifiers for 13
biomedical datasets. To compare the performance of GAssist with other evolutionary
algorithms, we now establish the multiple hypothesis where every other evolutionary
classifier is statistically compared with GAssist.
We use two post-hoc tests to determine the statistical significance of results: (1)
Bonferroni-Dunn Test [29], and (2) Holm Test [30]. In general, these post-hoc tests
vary in adjusting the threshold of significance level in accordance with their multiple
hypothesis. Bonferroni-Dunn Test controls the family-wise error rate in a single step by
dividing with the number of comparisons (k 1). Holms Test is a step-down pro-
cedure in which the hypothesis is tested on the p-values arranged in ascending order.
Starting from the lowest p-value, all the hypothesis are rejected for which pi <= /ki
while all the other remaining hypothesis are retained. Holms Test is more powerful as
it makes no assumptions about the hypothesis and in general, rejects more hypothesis
than Bonferroni-Dunns Test. The corresponding probability of the test statistic from
the normal distribution table is obtained from the z-value by comparing ith and j th
classifier. If the probability is less than the appropriate significance level, the null hy-
pothesis is rejected. The results of comparison with control classifier GAssist are shown
in Table 2.
136 A.K. Tanwani and M. Farooq
Table 2. Test statistics for comparison with control classifier - GAssist ( = 0.05, k = 6,
N = 31 and Rj = 2.71). Null hypothesis is rejected for bold entries in p column.
Z-Value
Bonferroni-Dunn (B-D) Holm Critical Value
i Algorithms p
(Ri Rj )/ k(k + 1)/6N /(k 1) /(k i) B-D Holm
1 SLAVE 3.462 5.36E-4 0.01 0.01
2 Ishibuchi 3.292 9.93E-4 0.01 0.0125
3 cAntMiner 1.358 0.174 0.01 0.017 0.01 0.017
4 XCS 1.222 0.222 0.01 0.025
5 UCS 0.645 0.519 0.01 0.05
Table 3. Test statistics for pairwise comparisons ( = 0.05, k = 6, N = 31). Null hypothesis is
rejected for bold entries in p column.
Z-Value
Nemenyi Holm Critical Value
i Algorithms p
(Ri Rj )/ k(k + 1)/6N 2 /k(k 1) /(k i) Nemenyi Holm
1 GAssist vs SLAVE 3.462 5.36E-4 0.003 0.003
2 GAssist vs Ishibuchi 3.292 9.93E-4 0.003 0.004
3 UCS vs SLAVE 2.817 0.004 0.003 0.004
4 UCS vs Ishibuchi 2.647 0.008 0.003 0.004
5 XCS vs SLAVE 2.240 0.025 0.003 0.004
6 cAnt-Miner vs SLAVE 2.104 0.035 0.003 0.005
7 XCS vs Ishibuchi 2.070 0.038 0.003 0.0055
8 cAnt-Miner vs Ishibuchi 1.935 0.053 0.003 0.006 0.003 0.004
9 GAssist vs cAntMiner 1.358 0.174 0.003 0.007
10 XCS vs GAssist 1.222 0.222 0.003 0.008
11 UCS vs cAnt-Miner 0.713 0.476 0.003 0.01
12 UCS vs Gassist 0.645 0.519 0.003 0.0125
13 XCS vs UCS 0.577 0.564 0.003 0.017
14 SLAVE vs Ishibuchi 0.170 0.865 0.003 0.025
15 XCS vs cAnt-Miner 0.136 0.892 0.003 0.05
The last column gives the critical values of the used tests. If the p-value is less than
or equal to this critical value, the null hypothesis is rejected for the corresponding test.
It can be seen that the results of GAssist are statistically significant compared to SLAVE
and Ishibuchi and hence, the null hypothesis can be rejected, while nothing much can
be said about other algorithms with the given results.
Pairwise Comparisons. As GAssist cannot be termed as the best classifier against all
the other classifiers in the last section, we now make the pairwise comparisons to ana-
lyze the statistical differences between all the classifiers. Along with the Holms Test,
we use the pairwise counterpart of Bonferroni-Dunns Test called Nemenyi Test [31],
for comparing all classifiers with each other. Nemenyi Test is more conservative than
Benferroni-Dunns Test as it steps-down the significance level by number of pairwise
comparisons (k(k 1)/2 instead of (k 1)). The results in Table 3 show that the Ne-
menyi Test rejects the hypothesis of GAssist against SLAVE and Ishibuchi while the
Holms method also allows to reject the hypothesis for UCS vs SLAVE.
The use of statistical analysis provides deeper analysis to the obtained results than sim-
ply averaging the classification accuracies; a raw measure of ranking the performance
Classification Potential vs. Classification Accuracy 137
Michigan-Style UCS and XCS. The Michigan-style learning classifiers UCS and
XCS use online learning to evolve a set of condition-action rules from each training
instance. Thus, they can be more useful in identifying hidden patterns and generating in-
formation rich rules compared with simple and generic rules of GAssist and cAntMiner.
We therefore suggest that if medical experts are available to refine rules, Michigan-style
classifier for knowledge extraction can prove to be useful.
Genetic Fuzzy SLAVE and Ishibuchi. The results show that the genetic fuzzy rule
learning classifiers are not generally suitable for classification of biomedical datasets.
The fuzzy rules so generated, however, can be particularly used to evaluate the uncer-
tainty associated with the prognosis.
Role of Multiple Classes. It can be inferred from Table 1 that for multi-class problems,
UCS gives significantly better accuracy compared with other classifiers. The reason is
that it evolves only those highly-rewarded classifiers of the match set in the correct
set, which predict the same class as that of the training example [33]. In comparison,
GAssist has serious problems in dealing with multi-class problems specially when the
number of output classes are more than 5. On these datasets, the average accuracy of
UCS is 83.49% compared with 75.52% of GAssist.
138 A.K. Tanwani and M. Farooq
Role of Attributes. The attributes of a dataset vary in three aspects: (1) number, (2)
type (continuous, binary and nominal), and (3) quality. We see in Table 1 that number
and type of attributes have little role in defining the classification potential of a dataset.
Very poor performance of XCS on Splice-Junction Gene Sequence, Promoters Genes
Sequence and Lung Cancer datasets came as a surprise to us. Our analysis reveals that
large number of nominal attributes in these datasets 61, 58 and 56 respectively is the
main cause of their poor performance with XCS. Our conclusion is that XCS is unable
to cater for large number of nominal attributes in a dataset.
Remember, we quantify quality of attributes with information gain. The graph in
Figure 1 clearly shows that classification accuracy increases with an increase in the
information gain of its attributes.
Role of Missing Values. The missing or incomplete data degrades the accuracy of
learning algorithms. Therefore, a number of methods like Wild-to-Wild, mean or mode
method, random assignment, InGrimputation Model, listwise deletion etc. have been
proposed for imputation to increase the accuracy of a classifier. Figure 2 reveals that
GAssist is relatively more resilient to missing values compared with other algorithms.
GAssist replaces a missing value with the mean of its class for real valued attributes.
For nominal attributes it replaces missing value with their mode.
Role of Noise. The results in Table 1 show that classification potential of a dataset
is inversely proportional to the level of noise in a dataset. Consequently, accuracy of
classifying noisy datasets is very small (see Figure 4). GAssist shows more resilience
to noise in datasets because of its added generalization pressure with bloat control based
on MDL principle. The MDL principle forces GAssist to reduce the size and length of
its individuals. In short its simple evolution policy makes it resilient to noise.
Fig. 5. Relationship between Classification Accuracy and Nature of Dataset: x-axis contains
biomedical datasets in increasing order of their classification accuracies; y-axis contains nor-
malized parameters of datasets, 1-Average Information Gain, 2-Missing Values, 3-Noise, 4-
Classification Accuracy
We categorize the output class classification potential into three classes based on the
classification accuracy: good (greater than 0.8), satisfactory (0.6-0.8) and bad (less than
0.6). The interesting patterns lying in this meta-dataset are extracted using two clas-
sifiers: (1) GAssist, it gives good classification results, and (2) Boosted J48 [22], to
compare the results with well-known non-evolutionary algorithm.
The classification rules generated by both classifiers prove our thesis that a noise
level greater than 0.25 severely degrades the classification potential of a dataset. As
expected, GAssist is able to generate more generic and comprehensible rules. For ex-
ample, if noise level is above 0.667, the classification potential is bad irrespective of
other parameters. The knowledge extracted by both algorithms provide same general-
ization. Hence, our proposed meta-model can be effectively used in determining the
true classification potential of a biomedical dataset. We believe this can prove to be a
very effective tool for analyzing the inherent complexities and needs for pre-processing
the dataset.
6 Conclusion
In this paper, we have quantified the complexity of biomedical datasets in terms of
missing values, noise, imbalance ratio and information gain. The effect of complexity
on classification accuracy is evaluated using six well-known evolutionary rule learning
algorithms. The results of our experiments show that GAssist in most of the datasets
provides better classification accuracy compared with other algorithms. Our analysis
reveals that the classification accuracy of a biomedical dataset is, however, a function
142 A.K. Tanwani and M. Farooq
of the nature of biomedical dataset rather than the choice of a particular evolutionary
learner. The major contribution of this paper is a unique methodology to determine the
classification potential of a dataset using a meta-model framework. In the future, we
would like to present the generated rules of different classifiers to the medical experts
for their feedback.
Acknowledgements
The authors of this paper are supported, in part, by the National ICT R&D Fund, Min-
istry of Information Technology, Government of Pakistan. The information, data, com-
ments, and views detailed herein may not necessarily reflect the endorsements of views
of the National ICT R&D Fund.
References
1. Pena-Reyes, C.A., Sipper, M.: Evolutionary computation in medicine: an overview. Journal
of Artificial Intelligence in Medicine 19(1), 123 (2000)
2. Wong, M.L., Lam, W., Leung, K.S., Ngan, P.S., Cheng, J.C.V.: Discovering knowledge from
medical databases using evolutionary algorithms. IEEE Engineering in Medicine and Biol-
ogy 19(4), 4555 (2000)
3. Holmes, J.H.: Learning classifier systems applied to knowledge discovery in clinical research
databases. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI),
vol. 1996, pp. 243261. Springer, Heidelberg (2001)
4. Bernado Mansilla, E.: Domain of competence of XCS classifier system in complexity mea-
surement space. IEEE Transactions on Evolutionary Computation 9(1), 82104 (2005)
5. Kharbat, F., Bull, L., Odeh, M.: Mining breast cancer data with XCS, Genetic and Evolution-
ary Computation Conference (GECCO), pp. 2066-2073, UK (2007)
6. Puig, A.O., Mansilla, E.B.: Evolutionary rule-based systems for imbalanced data sets. Soft
Computing - A Fusion of Foundations, Methodologies and Applications 13(3), 213225
(2009)
7. Bacardit, J., Butz, M.V.: Data mining in learning classifier systems: comparing XCS with
GAssist. In: Kovacs, T., Llor`a, X., Takadama, K., Lanzi, P.L., Stolzmann, W., Wilson, S.W.
(eds.) IWLCS 2003. LNCS (LNAI), vol. 4399, pp. 282290. Springer, Heidelberg (2007)
8. Bernado, E., Llor`a, X., Garrell, J.M.: XCS and GALE: a comparative study of two learning
classifier systems with six other learning algorithms on classification tasks. In: Lanzi, P.L.,
Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2001. LNCS (LNAI), vol. 2321, pp. 115132.
Springer, Heidelberg (2002)
9. Parpinelli, R.S., Lopes, H.S., Freitas, A.A.: An ant colony based system for data mining: ap-
plications to medical data. In: Int. Conf. on Knowledge Discovery and Data mining, Boston,
pp. 5562 (2000)
10. Galea, M., Shen, Q., Levine, J.: Evolutionary approaches to fuzzy modelling for classifica-
tion. Knowledge Engineering Review 19(1), 2759 (2004)
11. Tanwani, A.K., Afridi, J., Shafiq, M.Z., Farooq, M.: Guidelines to select machine learning
scheme for classifcation of biomedical datasets. In: Pizzuti, C., Ritchie, M.D., Giacobini, M.
(eds.) EvoBIO 2009. LNCS, vol. 5483, pp. 128139. Springer, Heidelberg (2009)
12. Butz, M.V., Kovacs, T., Lanzi, P.L., Wilson, S.W.: Toward a theory of generalization and
learning in XCS. IEEE Transactions on Evolutionary Computation 8(1), 2846 (2004)
Classification Potential vs. Classification Accuracy 143
13. Bernado-Mansilla, E., Garrell-Guiu, J.M.: Accuracy-based learning classifier systems: mod-
els, analysis and applications to classification tasks. Evolutionary Computation 11(3), 209
238 (2006)
14. Bacardit, J., Garrell, J.M.: Bloat control and generalization pressure using the minimum de-
scription length principle for a Pittsburgh approach Learning Classifier System. In: Kovacs,
T., Llor`a, X., Takadama, K., Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2003.
LNCS (LNAI), vol. 4399, pp. 5979. Springer, Heidelberg (2007)
15. Otero, F.E.B., Freitas, A.A., Johnson, C.J.: cAnt-Miner: an ant colony classification algo-
rithm to cope with continuous attributes. In: Ant Colony Optimization and Swarm Intelli-
gence, Belgium, pp. 4859 (2008)
16. Gonzalez, A., Perez, R.: SLAVE: a genetic learning system based on an iterative approach.
IEEE Transaction on Fuzzy Systems 7(2), 176191 (1999)
17. Ishibuchi, H., Nakashima, T., Murata, T.: Performance evaluation of fuzzy classifier systems
for multidimensional pattern classification problems. IEEE Transactions on Systems, Man,
and Cybernetics 29(5), 601618 (1999)
18. Fawcett, T.: ROC graphs: notes and practical considerations for researchers, TR HPL-2003-
4, HP Labs, USA (2004)
19. UCI repository of machine learning databases, University of California-Irvine, Department
of Information and Computer Science,
www.ics.uci.edu/mlearn/MLRepository.html (last accessed: June 25, 2010)
20. Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study of their impacts. Artifi-
cial Intelligence Review 22(3), 177210 (2004)
21. Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. Journal of Artificial Intel-
ligence Research 11, 131167 (1999)
22. Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques, 2nd
edn. Morgan Kaufmann, San Francisco (2005)
23. Otero, F.E.B.: Ant Colony Optimization Framework, MYRA,
http://sourceforge.net/projects/myra/ (last accessed: June 27, 2010)
24. Alcala-Fdez, J., Sanchez, L., Garcia, S., del Jesus, M.J., Ventura, S., Garrell, J.M., Otero, J.,
Romero, C., Bacardit, J., Rivas, V.M., Fernandez, J.C., Herrera, F.: KEEL: a software tool
to assess evolutionary algorithms for data mining problems. Soft Computing 13, 307318
(2008)
25. Demsar, J.: Statistical comparisons of classifiers over multiple datasets. Journal of Machine
Learning and Research 7, 130 (2006)
26. Garcia, S., Herrera, F.: An extension on Statistical comparisons of classifiers over multiple
datasets for all pairwise comparisons. Journal of Machine Learning and Research 9, 2677
2694 (2008)
27. Friedman, M.: A comparison of alternative tests of significance for the problem of m rank-
ings. Annals of Mathematical Statistics 11, 8692 (1940)
28. Iman, R.L., Davenport, J.M.: Approximations of the critical region of the Friedman statistic.
Communications in Statistics, 571595 (1980)
29. Dunn, O.J.: Multiple comparisons among means. Journal of the American Statistical Asso-
ciation 56, 5264 (1961)
30. Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian Journal of
Statistics 6, 6570 (1979)
31. Nemenyi, P.B.: Distribution-free multiple comparisons, PhD Thesis, Princeton University
(1963)
144 A.K. Tanwani and M. Farooq
32. Parpinelli, R.S., Lopes, H.S., Freitas, A.A.: Data mining with an ant colony optimization
algorithm. IEEE Transactions on Evolutionary Computation 6(4), 321332 (2002)
33. Orriols-Puig, A., Bernado-Mansilla, E.: Revisiting UCS: description, fitness sharing and
comparison with XCS. In: Bacardit, J., Bernado-Mansilla, E., Butz, M.V., Kovacs, T., Llor`a,
X., Takadama, K. (eds.) IWLCS 2006 and IWLCS 2007. LNCS (LNAI), vol. 4998, pp. 96
116. Springer, Heidelberg (2008)
34. Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. SIGKDD Explor. Newsl. 6(1),
4049 (2004)
35. Tanwani, A.K., Farooq, M.: The role of biomedical dataset in classification. In: Combi,
C., Shahar, Y., Abu-Hanna, A. (eds.) Artificial Intelligence in Medicine. LNCS (LNAI),
vol. 5651, pp. 370374. Springer, Heidelberg (2009)
Supply Chain Management Sales Using XCSR
1 Introduction
The supply chain management embodies the management of all the process and
information that moves along through the supply chain from the supplier to the
manufacturer right through to the retailer and the nal customer. Nowadays,
the supply chain management is one of the most important industrial activities.
Planning the activities through the supply chain is vital to the competitiveness
of manufacturing enterprises. According to [6], while todays supply chains are
essentially static, relying on long-term relationships among key trading partners,
more exible and dynamic practices oer the prospect of better matches between
suppliers and customers as market conditions change.
The Trading Agent Competition of Supply Chain Management (TAC SCM)[6]
was designed to expose the participants to the typical challenges presented in the
dynamic supply chain. These challenges include competing for the components
provided by the suppliers, managing the inventory, transforming components
into nal products and competing for the customers. These problems can be
classied into three main problems: purchases, production and sales.
Pardoe and Stone made experiments applying dierent learning techniques to
sales decisions of TAC SCM agents[11]. One of their main conclusions was that
winning oers in TAC SCM is a very complex problem because the winning prices
J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 145165, 2010.
c Springer-Verlag Berlin Heidelberg 2010
146 M. Franco, I. Martnez, and C. Gorrin
may vary very quickly. Therefore, this work arms that taking decisions based
on previous states of the current game is inaccurate, while using information
taken from a lot of previous games will show better results.
The goal of this work is to present an approach to the TAC SCM problem
using an evolutionary reinforcement learning system. We specically use XCSR
to solve one of the most important sales problems: pricing the components in
order to compete over the market and maximize the prot at the same time.
2 TAC SCM
The TAC SCM competition[6] was designed by a team of researchers from the e-
Supply Chain Management Lab at Carnegie Mellon University in collaboration
with the Swedish Institute of Computer Science (SICS). In this contest, each
team has to develop an intelligent agent capable of handle the main supply chain
management problems (which orders accept, decide the sale price for products,
compete over the market, among others).
Agents compete against each other in a simulation that lasts 220 days and
includes customers and suppliers to deal with. The main goal of the competitors
is to maximize the nal prot by selling assembled computers to the customers.
The prot of an agent is calculated by subtracting production costs to the in-
comes. This prot reects itself in the amount of money the agents have at the
end of the game, which indicates which agent is the winner.
Each TAC SCM simulation has three actors: customers who buy computers,
manufacturers (agents) who produce and sell computers, and suppliers who pro-
vide the unassembled components to the manufacturers. A detailed description
of these actors can be found in [6].
At the beginning of each day, the agent receives request for quotes (also
known as RFQs) from the customers. Afterwards, the agent decides which RFQs
should be accepted and which should be the nal oer price. After sending
the oers, the agent waits for the orders from the customers. Only the best
priced oers are accepted and turn into orders. If the agent receives the order,
it decides when to produce and deliver, and even more important, how much
components it should buy to accomplish the production schedules. In order to
buy the components, the agent sends the suppliers RFQs for the spare parts. In
response, the suppliers send oers to the agent who has to decide whether or
not to accept them.
Each team competing in a TAC SCM game should develop a manufacturer
agent that has to deal with the main decisions of the supply chain management:
how much components shall we buy, when shall we produce an order and which
RFQs shall we accept. Moreover, when accepting a RFQ, the agent should de-
cide which the nal price should be for these goods. In this work, these three
problems will be referred as the purchase problem, the production problem and
the sales problem.
More than 30 agents participate in this competition each year. Among the
most successful solutions to the TAC SCM problem we found: TacTex-06,
Supply Chain Management Sales Using XCSR 147
3 XCSR
XCS is a Michigan Learning Classier System rst described by Wilson[14]. This
system is based on the work proposed by Holland[9] but uses the accuracy instead
of the payo as a measure of goodness of a classier. In our implementation we
used XCSR[15], a version of the of XCS that accepts real numbers as inputs. To
do this the features in the condition are represented by lower and upper bounds
while the action remains discrete.
The reason why we decided to use this approach is because all the inputs of
the decision we wanted to make were real and the decisive thresholds needed to
be found dynamically. We also decided to use XCSR because the rule system
can constantly adapt to new environments using a xed rate of exploration[3]
and the rules that it generates are interpretable by human beings[10].
4 TicTACtoe
TicTACtoe is our approach to the TAC SCM problem. TicTACtoe has three
modules: Purchase, Production and Sales (see Figure 1). Each module manages
one of the sub-problems in the supply chain management. Every module takes its
own decisions using information taken from the environment and other modules.
On the next subsections, we will focus on the details of these modules.
In addition to these modules, we provided the agent with memory through an
organizer structure. This structure keeps track of: orders scheduled for produc-
tion, possible order commitments, actual produced orders1 and possible future
inventory2. This memory allows to record decisions taken by the agent each day
1
Production schedules may vary due to the lack of components.
2
The future inventory is based on the component orders placed by the agent.
148 M. Franco, I. Martnez, and C. Gorrin
and to consider events that would happen in the future, which will be used to
make further decisions.
4.1 Purchases
The purchase module is in charge of sending RFQs to the suppliers in order
to buy the necessary components for production. This module has two tasks:
a) creating the RFQs to get the current component prices and b) decide which
supplier oers to accept.
Suppliers RFQ creation. First, the agent calculates how many components
are needed for production within the next ten days. These calculations are based
on the current inventory, orders scheduled for the next ten days and component
orders that have been placed already. The agent always sends the RFQ to its
favourite supplier for that particular component, which is the one who has given
the best prices lately. There is only one favorite supplier for each component.
However, the agent also asks the other suppliers for the current prices in order
to update the favourite supplier if necessary.
The favorite supplier is preferred in order to get lower prices. This is based in
the assumption that the state of a supplier does not change drastically. Therefore,
if a supplier gives an agent the best price, probably it would continue giving good
prices for some time.
Accepting the oers. When the agent ask for components, the suppliers might
not be able to comply with the agents requirements. When the supplier is not
able to deliver the products the agent asks for, it sends two types of adjusted
oers instead: oers that vary the quantity and oers with a later due date. If
this happens, the priority of TicTACtoe is to accept rst the complete oers
and then the ones that vary the quantity. Once an order is set, the agent adds
a record of the components arrival to calculate the future inventory.
Supply Chain Management Sales Using XCSR 149
Furthermore, the agent keeps a historical record of the base price for each
component. The component base price is calculated every day as a weighted
average as shown in equation (1):
4.2 Production
The production module is in charge of scheduling the production of the active
orders (the orders that are waiting to be elaborated and delivered). This module
prioritizes the active orders with sooner due dates and higher penalties (in case
the orders are behind schedule). The agent loops over the orders checking if
there is enough inventory of products to deliver them. If the agent has enough
products the order is delivered. This strategy is used by PhantAgent[13] to avoid
extra storing charges. In case there are not enough components to deliver the
order, the agent veries if the order is beyond the latest possible delivery day3 .
In case it is already too late, the customer would not receive the order anymore.
So, the agent cancels it and frees all the components and products associated in
order to be able to use them to full other orders.
If there are not enough products to fulll the order but the customer can still
wait for it, the agent tries to produce it. To produce an order scheduled for a
specic day, the agent checks if there are enough components. When there are
not enough components to produce the desired quantity, the agent produces the
maximum quantity allowed. If the agent cannot produce an order completely, it
continues producing it the next day.
At the end of the day, the production module determines the number of late
orders and the number of active orders. This information is used by the sales
module to adjust the quantity of free cycles the agent can oer. This forces the
agent to save cycles for late order production.
4.3 Sales
The sales module is in charge of pricing the products and dealing with the cus-
tomers. This module checks everyday the customer RFQs and sends oers to the
ones that meet the following characteristics: (a) a reserve price higher than
the products base price and (b) a due date earlier than the end of the simulation.
The agent calculates the base price for a product as the sum of the estimated prices
of all the spare parts. This estimates how protable the order would be.
Afterwards, the agent uses the set of rules generated using a XCSR to deter-
mine the discount factor over the reserve price of each RFQ. The reserve price is
the maximum price a customer is willing to pay for an order. The agent that of-
fers the lower price wins the bid. The implementation of XCSR will be explained
in greater detail in Section 5.
3
The latest possible deliver day is determined when the customer sends the RFQ.
150 M. Franco, I. Martnez, and C. Gorrin
The nal oer price is determined by equation (2), where BaseP rice is the
calculated cost of the product based on recent experiences, ReserveP rice is
the reference price determined by the customer and d is the discount factor
determined by the XCSR.
Once the agent calculates the oer price for each RFQ, a production schedule
is generated including these possible orders4 . The orders that involve higher
revenues have more priority. In order to save production cycles for future orders
that would need to be delivered earlier, the agent always tries to produce an
order as late as possible according to its due date. This strategy is very similar
to the one used in [12]. If there are not free cycles the agent checks the inventory
to see if there are enough products to deliver these orders the next day. Moreover,
non of these is possible the less protable RFQs are discarded.
Moreover, the daily free cycles are multiplied by a factor between 0 and 1,
inversely proportional to the quantity of late orders that the agent has. This helps
the agent to get on schedule again, by leaving some cycles for the production of
late orders.
Our agent remembers all the placed oers as possible commitments. However,
customers only accept the best-priced oers. In case a customer rejects an oer,
the commitment is removed and all the components and cycles associated are
released.
In the following sections, we explain the structure used to represent the TAC
SCM sales problem using real inputs and discrete actions.
4
This helps the agent to calculate how much free cycles are left for the production of
further orders.
Supply Chain Management Sales Using XCSR 151
Condition. There are simulation values known by the agent that provide impor-
tant information for its future decisions. Including all this values in the classiers
structure decrease the eciency of the GA in terms of execution time. To avoid
this, we selected the more important features for the decision we wanted to make.
Preliminary experiments showed that the more suitable features for the classier
would be:
x1 Rate of late orders5 over the total of active orders. This determines how much
work is late and how convenient is to make a good oer when the agent is
already behind schedule.
lateOrders
x1 = (4)
totalOrders
x2 Rate of the factory cycles that remain unused the day before. This helps the
agent determine if it should raise or lower the price discount. For example,
if the factory is full, the agent should give low discounts in order to try to
nish with its active orders before getting new ones.
freeFactoryCapacity
x2 = (5)
totalFactoryCapacity
x3 Rate of the base price over the reserve price indicated by the customer. This
represents how protable an order would be. The agent discards the cases
when the base price is higher than the reserve price.
basePrice
x3 = (6)
reservePrice
x4 The number of days between the current date and the day the order should
be delivered. This indicates how much time the agent has to produce and
deliver an order. This value is scaled between 0 and 1 considering that the
due dates are, at most, 12 days after the actual date.
(dueDate day)
x4 = (7)
12
x5 The actual day of the simulation normalized by the maximum number of
days a game has. This value is very important because there are dierent
situations as the days go by. For example, in the middle of a simulation
components start to be scarce and their prices start rising. This feature
helps the agent to determine dierent stages of the simulation that require
specic behaviours.
day
x5 = (8)
220
All the features are normalized between 0 and 1, to use these values as upper
and lower bounds. This aspect will be better explained in Section 6.1.
5
The late orders are the active orders that are producing penalties because they are
going to be delivered after the due date.
152 M. Franco, I. Martnez, and C. Gorrin
Action. Our implementation of XCSR has 10 actions that represent the dierent
discounts over the possible revenue. The revenue is computed as the dierence
between the base price and the reserve price determined customers. The dierent
discounts go from 0% to 90% with 10% steps.
6 Implementation Details
The dont care in our library was implemented as the absence of lower or upper
bound, depending on the allele we wanted to modify. To implement this dont
care, we had to put a restriction to the data: all the features should be bounded
between 0 and 1. Putting a dont care in an allele is equivalent to put 0 or 1,
depending if it is a lower or an upper bound. In this way, we open the range to
the maximum limit so the allele classies all the states.
Since Butzs library was oriented to boolean features, we had to implement other
subsumption rules so they adapt to our classier structure, where all the features
are bounded between 0 and 1. The rules used were the same rules used by Wilson
in [16], where a classier is more general than another if all the ranges of the
rst classier contain the second one. For example, (li , ui ) subsumes (lj , uj ),
if ui > uj li < lj . The actions of the classier should be the same for the
subsumption to occur.
Additional adaptations were necessary to include XCSR in our TAC SCM agent
due to the characteristics of the problem.
Blocking classifers. In our classier system, the reward of an action set is given
based on the amount of money the agent wins o loses when making the correspond-
ing oer. This value merely depends on the discount given by the agent in the oer.
This is the reason why this problem is modelled as a single-step problem. Never-
theless, the agent only knows the reward few days after making the decision. This
diers from the classic problems used as benchmarks (i.e. boolean multiplexer), in
which the reward arrives immediately after applying an action.
154 M. Franco, I. Martnez, and C. Gorrin
Considering the delayed reward, it is necessary to save the action set asso-
ciated to the order, so these classiers are given a reward when the agent gets
the nal result. Since we are interested in continuing learning while a classier
waits for its reward, classiers are used in multiple learning iterations parallel to
each other. This aspect of the online learning, in addition to the delayed reward,
presents a new problem to us. The problem occurs when a classier that is wait-
ing for a reward is selected for deletion or subsumption. Since these mechanisms
could be executed by any learning iteration, they could erase this classier based
on information that is not updated. Consequently, the knowledge represented by
this classier and its upcoming rewards are lost.
In order to avoid the deletion of the classiers expecting a reward based on
incomplete information, we implemented a simple counting semaphore. Each
classier has a counter that indicates the number of rewards it is expecting.
A single classier participates in a lot of decisions each day and needs to wait
a reward for each one of them. Therefore, we only consider for deletion the
classiers that are not blocked, the ones that have their counter in zero.
We had to add also another important restriction in the subsuming mecha-
nism. A classier can not be subsumed if it is blocked, because its information
is not entirely up to date to become part of another classier. The blocked clas-
siers may participate in all the other mechanisms like crossover and mutation.
Since all the decisions taken by the XCSR aects the nal result, regardless
of whether it was determined by exploration or exploitation, we changed the
algorithm in order to reward the classiers in both cases.
Considering the dynamic characteristics of the simulation, we decided to use
an -greedy action selection policy. This consist of selecting the best possible
action with probability 1 and exploring the rest of the time. However, we
did a slightly modication so the starts at 1 and decreases linearly until it
reaches a threshold. This forces the system to explore more at the beginning
of the simulation and less by the end of it. When reaches the threshold, its
value remains constant allowing the agent to perform some explorations that
facilitates its adaptation to changes in the simulation.
(a) Final result: the nal amount of money in the agents bank account. This in-
dicates how much money the agent earned and how protable its investments
were.
(b) Received orders: the number of orders placed by the customers. This value
indicates the percentage of the market the agent served. This is directly
linked to the decision taken by the XCSR, because if the agent gives a better
price, it receives more orders
7
These agents come along with the TAC SCM library. They are used for testing pur-
poses and they use simple but coherent strategies to handle the dierent problems.
8
The standard parameters for the games are 220 simulation days with a duration of
15 seconds.
156 M. Franco, I. Martnez, and C. Gorrin
(a) Factory usage: the percentage of usage of the factory capacity. This value
indicates how many factory cycles are used on average. This represent the
productive capacity of the agents, and it should be used at maximum.
(b) Penalties: the amount of money paid to the costumers for late deliveries.
This indicates how many late orders the agent had.
(c) Interests: the amount of money paid to bank entity for having a negative
balance in the bank account.
(d) Total income: the total amount of money earned by the agents without
considering the losses.
(e) Component costs: the amount of money spent in buying components.
(f) Storage costs: the amount of money spent in storing components to be used
in future production.
The component costs, storage costs, penalties and interests are represented as
the percentage of the total revenue, while the nal result and the total income
are represented in US dollars. The combination of these measures with the main
ones will show how eective the learning was, considering that we wanted to
learn a discount strategy that maximizes the revenue of the agent by winning
protable and manageable orders. However, these performance measures are
shown only as a support of the two main measures. Therefore, no statistical test
were performed over them.
The parameters used in our implementation of XCSR for the calculation of
price discounts are = 0.1, = 0.2, = 0.1, = 5, GA = 25, 0 = 10,
del = 20, = 0.8, = 0.04, p# = 0.1, pI = 10.0, I = 0, FI = 0.01,
sub = 20, mna = 1, s0 = 0.05 and N = 1000. The meaning of these pa-
rameters is explained in [5]. Moreover, the sources of TicTACtoe can be found
in http://www.gia.usb.ve/~maria/tictactoe.
8000
7000
0e+00
Final Result (US$)
Number of Orders
6000
2e+07
5000
4000
4e+07
3000
6e+07
2000
Agents Agents
Fig. 2. Comparison of the dummy agent and the TicTACtoe agent using dierent
pricing strategies in terms of nal result and received orders
Table 1 shows the p-values of the statistical comparisons among the agents.
This table shows that the L-TicTACtoe is signicantly better than the other
solutions presented. Moreover, the learning agent performs better than the Ran-
dom agent in 99.9% of the cases, supporting the statements above.
Even though we could expect that L-TicTACtoe manages more orders than
the other agents, Figure 2(b) reveals that Random and Static win more oers.
However, Table 2 indicates that Random and Static are delivering more orders
late and therefore, incurring in more penalties. These results show that the
pricing strategy of these agents is less advantageous because they commit to
orders which they cannot deliver on time and hence, they are penalized.
We can also observe in this table that Static gets negative interests. In other
words, this agent had to pay the bank for having a negative balance in its bank
158 M. Franco, I. Martnez, and C. Gorrin
Table 1. Statistical comparison of the TicTACtoe agent using dierent pricing starte-
gies. Column Kt shown the p-value for the Kruskal-Wallis test and column W ilcox. test
shows the p-values for the Wilcoxon tests.
Table 2. Results in terms of penalties, interest and factory usage of the TicTACtoe
agent using dierent pricing strategies and the dummy agent
account. This indicates that the strategy taken by Static is decient, because it
incurs in negative balances on most of the simulation days. On the other hand,
L-TicTACtoe is the agent that earns more interests from the bank and presents
the lowest variance. This shows that this agent has a more stable behavior in
terms of bank account balances.
Regarding the factory utilization, we can appreciate that the agents Random
and Static achieve a higher factory utilization. High factory utilization suggests
a procient management of the productive capacity. However, the penalties ob-
tained by these agents demonstrate that these agents are surpassing their pro-
duction capacity. L-TicTACtoe does not use the factory as much as these agents,
but still presents a better solution to this problem because it served eciently
a considerable portion of the market.
Finally, through this experiment we can conrm that the strategy used by
L-TicTACtoe improves the global performance of our solution to the TAC SCM
problem. Furthermore, the static and random strategies show poor results as
a consequence of the incapacity to adapt themselves to new situations. These
Supply Chain Management Sales Using XCSR 159
7500
2e+07
7000
1e+07
Final Result (US$)
Number of Orders
6500
0e+00
6000
2e+07 1e+07
5500
Agents Agents
Fig. 3. Comparison of the performance of TicTACtoe with and without the blocking
classiers technique in terms of nal result and received orders
Table 3 shows that t-block (L-TicTACToe with blocking) receives 312 more
orders than t-noblock(L-TicTACtoe without blocking). This dierence is small
and is not strong enough to make any assumptions on the performance of the
agents as shown in Table 3. However, Figure 3(a) shows that t-block obtains
more frequently a better nal result than t-noblock. According to Table 3 this
dierence is not statistically signicant using using a condence interval of 0.05.
However, we could say that t-block behaves better than t-noblock in 94.4% of
the cases. This dierence in the nal balance is explained by the high penalties
obtained by t-noblock as shown in Table 4. These penalties indicate that this
agent does not develop an appropriate set of rules to determine the nal sale
price for an RFQ. Moreover, t-noblock makes oers at very low prices to orders
that have a high penalty and are very dicult to produce because of the lack
of the required components. When this agent oers products at low prices, it
obtains plenty of orders, but most of them do not represent a protable portion
of the market considering its penalties.
Moreover, we can observe in Table 4 that t-noblock gets negative interests
from the bank, while the t-block gets positive interests. This implies that, on
160 M. Franco, I. Martnez, and C. Gorrin
Table 3. Statistical results from the comparison of both agents using and not using
the blocking classier technique. The columns W (p value) show the p-value of the
Wilcoxon test between agents.
Table 4. Results in terms of interest, penalties, component costs and storage costs of
the TicTACtoe agent using and not using the blocking classiers technique.
average, the agent that does not block classiers incurs in debts, while the other
agent maintains a positive balance in its bank account. This factor, in addition
to the penalties, explains why in Figure 3(a) the agent t-noblock ends with less
money than agent t-block.
To determine the impact of the blocking technique, it is also important to ana-
lyze the experience of the XCSR system in each agent. The experience is a measure
of the classier usage; it indicates how many times a classier has been used.
In Figure 4, we can observe that the mean experience of the population of
t-block is higher than the mean experience of t-noblock. This pattern occurs
because t-noblock allows erasing classiers anytime based on incomplete and in-
accurate information. This rules are still waiting for a reward that will determine
if they performed well. Consequently, classiers that could lead to good decisions
are erased before the reward arrives, and their knowledge is completely lost.
It is interesting to notice that the relationship between the average experience
and the day of the simulation is approximately 0.05. This means that each classi-
er is used at most during 5% of the simulation. Considering that the simulation
has 220 days, the 5% correspond to 11 days. Our explanation for this behaviour
is that the generated rules are, in fact, detecting dierent stages during the
simulation, and not all the classiers are used in the same stages.
The blocking classier technique increments the global experience of the pop-
ulation and the probability of survival of possible good sub-solutions. Never-
theless, the tradeo of using these mechanisms is that the system could also
block bad solutions, and the probability of erasing good rules that have not
been activated gets higher.
The results of this experiment show that agents using the blocking classiers
technique inside XCSR preserve important information in the classiers. This
might lead to better performance in environments with single step tasks and
Supply Chain Management Sales Using XCSR 161
Fig. 4. Mean experience of the XCSR population during 8800 days (40 simulations)
The aim of this experiment is to determine the best exploitation rate or value
for (1 ) for this particular problem. We tested the performance of the agent
using dierent exploitation rates (0.9; 0.7; 0.5; 0.3) to determine which one is
the most suitable for the problem that we are trying to solve. We also included
two extra exploitation rates 0 and 1 for control. Afterwards, we analyse the
two most interesting cases and compare them with the results of their dummy
competitors9 . The rest of the parameters of the algorithm and the agent remained
the same. For this experiment we used a population of 1000 individuals and the
blocking mechanism.
The TicTACtoe and dummy agents involved in this experiment will be referred
as tx and dx respectively, where x stands for the nal threshold exploitation rate
or 1 (See Section 6.4).
Figure 5 shows the results according to the main measures of performance: the
nal results and the number of received orders. In Figure 5(a) we can observe that
agents with the smallest nal balance in the bank account at the end of the game
are t0 followed by t100. The same behaviour can be observed in Figure 5(b).
This means that a constant exploration (t0) (always giving the price discount
9
In this experiment we ran our base agent only with dummy competitors, using
dierent policies each time.
162 M. Franco, I. Martnez, and C. Gorrin
8000
7000
2e+07
Final Result (US$)
Number of Orders
6000
1e+07
5000
0e+00
4000
1e+07
3000
t0 t10 t30 t50 t70 t90 t100 t0 t10 t30 t50 t70 t90 t100
Agents Agents
in a random manner) produces the worst results. On the other hand, a pure
exploitation (t100) does not achieve a good performance either, because it is
incapable of adapting to new environments. Agents that combine exploitation
and exploration during the whole learning process obtain the best results due
to the dynamic characteristics of the environment. According to Table 3, we
can say that the agents t0 and t100 are signicantly worse than the rest of the
agents, in terms of nal result and received orders.
It is worth noticing the curve in these two gures. This suggests that the
exploration rate does, in fact, aect the strategy developed, and balance between
exploitation and exploration is necessary to achieve good performance.
In Figure 5(b) we can see that the agent that serve more orders is t70. Ac-
cording to Table 5, there are not signicant dierences between t30, t50 and
t70 in terms of the nal result but there are dierences in terms of the orders.
Moreover, the agent with the highest nal result on average turns out to be t30.
This situation is claried by Table 6, which compares the performance of
these two agents against their dummy competitors. Despite of eorts of t70 of
serving the largest portion of the market, this agent gets plenty of penalties for
late deliveries. Moreover, although the agent t30 does not have as many orders
as t70, this situation helps the agent to full the orders that it has already. At
the end t30, does not have as much penalty as t70, producing a more steady
behaviour (lower variance). We could say that agent t30 is learning how to
handle a number of orders that minimizes the obtained penalty and maximizes
the nal revenue.
We can also notice in this table that the implementation of TicTACtoe, no mat-
ter the exploitation rate used (t30 or t70), gets a higher nal revenue and handles
a larger portion of the market than the dummy competitors. Furthermore, there
is a dierence also in the behaviour of both dummy agents since the performance
of the agents is relative to the competitors behaviour. We can notice that agent
t70 makes it more dicult for the competitor d70 to obtain costumers.
Regarding to the factory usage, it is considered that a good agent uses its fac-
tory capacity as much as possible to complete orders[13]. This helps the agent to
obtain higher revenues at the end of the game. Even though both congurations
Supply Chain Management Sales Using XCSR 163
Table 5. Statistical results from the comparisons of the agents using dierent values
for the exploitation rate. Column Kt shows the p-value for the Kruskal-Wallis test and
column W ilcox. test shows the p-values for the Wilcoxon tests.
Table 6. Comparisons between the agents using 30% and 70% exploitation rates in
terms of penalties, factory usage and total income
t30 and t70 have the same production strategy, t70 makes more usage of these
resourses than t30. This behaviour is explained by the fact that agent t70 has
more orders to attend. Consequently, considering this performance measure the
agent t70 learns a better strategy. Nevertheless, the production and purchase
strategies are still very simple, which makes it harder for this agent to deliver
these orders on time.
Regarding to the total income, we can notice that both TicTACtoe agents
have incomes proportional to the number of received orders. Also, both agents
have higher incomes than their competitors. This evidences that the developed
strategies give competitive prices according to the cost of the products and do
not oer the products below the production costs.
164 M. Franco, I. Martnez, and C. Gorrin
8 Conclusion
We designed and implemented a supply chain management agent for the TAC
SCM problem. Our agent solves the production and the purchases sub-problems
using static strategies, while it solves the sales sub-problem using a dynamic
strategy.
Moreover, the purchase strategy is based on the acquisition of components
considering production commitments for the next simulation days. The produc-
tion strategy is based on manufacturing goods prioritizing orders according to
their expected prots and due dates.
In addition, we implemented a dynamic sales strategy built on Wilsons XCSR
classier systems. Through the XCSR mechanism, we obtained a suitable set of
rules for the TAC SCM sales problem. This set of rules worked better than the
strategies used for control.
As our initial solution for the TAC SCM sales problem encountered an issue
when handling delayed rewards in a single-step environment, we introduced a
blocking classier technique. We showed that the use of this technique yields to
more experienced populations and improves the quality of the generated strate-
gies in this scenario. However, more experimentation needs to be carried out
regarding this matter.
References
1. Trading agent competition - TAC SCM game description,
http://www.sics.se/tac/page.php?id=13
2. Benisch, M., Sardinha, A., Andrews, J., Sadeh, N.: CMieux: adaptive strategies
for competitive supply chain trading. In: ICEC 2006: Proceedings of the 8th in-
ternational conference on Electronic commerce, pp. 4758. ACM Press, New York
(2006)
3. Bull, L.: Applications of Learning Classier Systems. Springer, Heidelberg (2004)
4. Butz, M.: Illigal Java-XCS - LCS Web (2006)
5. Butz, M.V., Wilson, S.W.: An algorithmic description of XCS. In: Lanzi, P.L.,
Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, p.
253. Springer, Heidelberg (2001)
6. Collins, J., Arunachalam, R., Sadeh, N., Eriksson, J., Finne, N., Janson, S.: The
Supply Chain Management Game for the 2007 Trading Agent Competition, Pitts-
bourg, Pensilvania (2006)
7. Conover, W.J.: Practical Nonparametric Statistics. John Wiley & Sons, Chichester
(December 1998)
8. Franco, M., Gorrin, C.: Dise no e implementaci
on de un agente de corretaje en una
cadena de suministros en un ambiente simulado, Universidad Sim on Bolvar (2007)
9. Holland, J.H.: Adaptation. In: Rosen, R., Snell, F.M. (eds.) Progress in theoretical
biology IV, pp. 263293. Academic Press, Nueva York (1976)
10. Lanzi, P.: Learning classier systems: then and now. Evolutionary Intelligence 1(1),
6382 (2008)
11. Pardoe, D., Stone, P.: Bidding for customer orders in TAC SCM. In: Faratin, P.,
Rodrguez-Aguilar, J.-A. (eds.) AMEC 2004. LNCS (LNAI), vol. 3435, pp. 143157.
Springer, Heidelberg (2006)
Supply Chain Management Sales Using XCSR 165
12. Pardoe, D., Stone, P.: An autonomous agent for supply chain management. In: Ado-
mavicius, G., Gupta, A. (eds.) Handbooks in Information Systems Series: Business
Computing, vol. 3, pp. 141172. Emerald Group (2009)
13. Stan, M., Stan, B., Florea, A.M.: A dynamic strategy agent for supply chain man-
agement. In: Proceedings of the Eighth International Symposium on Symbolic and
Numeric Algorithms for Scientic Computing, pp. 227232. IEEE Computer Soci-
ety, Los Alamitos (2006)
14. Wilson, S.W.: Classier tness based on accuracy. Evolutionary Computation 3(2),
149175 (1995)
15. Wilson, S.W.: Get real! XCS with Continuous-Valued inputs. In: Lanzi, P.L., Stolz-
mann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, p. 209.
Springer, Heidelberg (2000)
16. Wilson, S.W.: Mining oblique data with XCS. In: Lanzi, P.L., Stolzmann, W.,
Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 158174. Springer,
Heidelberg (2001)
17. Witten, I.H., Frank, E.: Data mining: Practical Machine Learning Tools and Tech-
niques. Morgan Kaufmann, San Francisco (2005)
Identifying Trade Entry and Exit Timing Using
Mathematical Technical Indicators in XCS
Richard Preen
Abstract. This paper extends current LCS research into financial time series
forecasting by analysing the performance of agents utilising mathematical tech-
nical indicators for both environment classification and in selecting actions to
be executed. It compares these agents with traditional models which only use
such indicators to classify the environment and exit at the close of the next day.
It is proposed that XCS agents utilising mathematical technical indicators for
exit conditions will not only outperform similar agents which close the trade at
the end of the next day, but also result in fewer trades and consequently lower
commissions paid. The results show that in four of five assets, agents using in-
dicator exit conditions outperformed those exiting at the close of the next day,
before commissions were factored in. After commissions are factored in, the
performance gap between the two agent classes further widens.
1 Introduction
The primary objective of this paper is to extend the current research into the use of the
XCS Learning Classifier System [28] within the domain of financial time series fore-
casting. Recent work (e.g., [9], [21], [13], and [24]) has demonstrated the successful
application of XCS in this area. However, in each of the studies, agents are trained on
daily price data to evolve trade entry rules composed of mathematical technical indi-
cators in conjunction with a fixed rule to close the trade the following day, i.e., the
exit timing is not evolved. It is posited that by utilising mathematical technical indica-
tors to identify the timing of the market exit, as opposed to simply exiting on the next
day, not only are the associated transaction costs reduced, but the excess returns are
increased due to an inherent noise reduction by requiring less prediction accuracy.
Initially, several XCS agents are produced to replicate the traditional model and
demonstrate their application to financial time series forecasting. In extending this
work, the agents additionally evolve mathematical technical indicators to identify
appropriate exit conditions. These two models are then compared and the agents are
furthermore benchmarked against a buy-and-hold strategy to evaluate whether market
beating excess returns can be generated.
J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 166184, 2010.
Springer-Verlag Berlin Heidelberg 2010
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators 167
Brock, Lakonishock and LeBaron [4] investigated two of the most popular trading
rules from technical analysis (moving averages and trading range breakout) on the
Dow Jones Industrial Average over the period 1897-1986. They generated typical
returns of 0.8% over a 10-day period compared to a normal 10-day upward drift of
0.17%. After the buy signals were generated, the market increased at a rate of 12%
per year. Following the sell signals, a decrease of 7% per year was noted. Subse-
quently, Detry and Grgoire [6] successfully replicated the results for the moving
average tests on a series of formally selected European indexes. Moreover, technical
analysis has been shown useful in the foreign exchange markets by Dooley and
Schaffer [8], Sweeney [25], Levich and Thomas [12], Neely et al. [15], Dewachter
[7], Okunev and White [16], and Olson [17].
The primary benefit from the use of mathematical technical indicators in financial
time series forecasting is that the algorithms are precisely defined. This means that the
signals they produce are free from errors of subjective human judgement and emotion,
are replicable, and can easily be tested over large amounts of data and varying assets
to quantify performance. Learning Classifier Systems (LCS) [10] can easily co-evolve
different combinations of these indicators to form entry/exit rules for financial trad-
ing, and even to evolve the technical indicators themselves.
2 Related Work
There has been widespread research on Artificial Neural Networks (ANN) and Ge-
netic Programming (GP) for financial time series forecasting. GP examples include
Neely et al. [15], Allen and Karjalainen [1], and Chen [5]. Examples of ANN fore-
casting financial time series include Tsibouris and Zeidenberg [25], Steiner and
Wittkemper [23], Kalyvas [11], and Srinivasa, Venugopal and Patnaik [22]. In con-
trast to ANN and GP, comparatively little research has been conducted into the use of
LCS for financial time series forecasting. Early examples of LCS research in this area
include Beltrametti et al. [2] using LCS to predict currencies, and Mahfoud and Mani
[14] and Schulenburg and Ross ([18], [19], and [20]) predicting stocks.
More recently, Stone and Bull [24] created a single-step ZCS [27] agent to forecast
long or short positions on the Foreign Exchange (FX) Market, trading with the full
amount of the balance each time. The architecture was modified by utilising the
NewBoole update mechanism, tweaking the covering algorithm, and introducing a
new specialize operator. The agent was required to always be in the market. Daily
price and interest rate data was used, covering the period of January 1974 to October
1995 for the U.S. Dollar (USD), German Deutsche Mark (DEM), British Pound
(GBP), Japanese Yen (JPY), and Swiss Frank (CHF). These were then used to create
currency pairs for USDGBP, USDDEM, USDCHF, USDJPY, DEMJPY, and
GBPCHF.
The mathematical technical indicators used were based on four primitive functions
of the time series which could return either the average price over a specified period,
the minimum price over a specified period, the maximum price over a specified pe-
riod, or the price at a specified day. ZCS was used to generate the indicators, where an
indicator is a ratio of two of the primitive functions. For example, a log indicator:
168 R. Preen
The payoff (or feedback) given to the agents for executing a particular action was
decided based upon whether the following days price closed above or below the
current day. A payoff of zero was awarded for executing a wrong action and a con-
stant non-zero value was awarded for executing a correct action. The agents were
assessed using daily price data for IBM, EXXON, Ford, CitiGroup, Coca-Cola, and
Banco Santander Cent Hispano. The training period ran from January 1990 to De-
cember 2003 and then an evaluation phase took place on data from January 2005 to
June 2006.
The results found that the Meta Agents usually outperformed the individual techni-
cal agents and that the Micro Agents could not outperform both the buy and hold and
bank strategies. Further, that the Meta Agent always outperformed the Random
Agent. However, in terms of accuracy, the Meta Agents performed the same or worse
than the Micro Agents. In summary, the major finding of this model was that a Hier-
archical XCS using multiple agents can produce better results than using a single
agent XCS. The fact that the Meta Agents always outperformed the Random Agent
also illustrates that the system is capable of learning useful rules, even though in this
case they were not able to outperform the relevant real-world benchmarks.
Schulenburg and Wong [21] explored Portfolio Allocation using a HXCS. Agents
received inputs from technical indicators and attempted to learn profitable rules to
trade the market data provided. In addition to a Technical Analysis (TA) Agent, a
Market (Mkt) Agent and an Options Agent were created to provide further informa-
tion to the decision making process.
The TA Agent incorporated rules based upon inputs from the following four
mathematical technical indicators: Rate of Change (ROC), Relative Strength Index
(RSI), Ultimate Oscillator (ULTOSC), and On Balance Volume (OBV). The Mkt
Agent integrated rules from the following three general market indicators: the daily
percent return of the S&P500 Index, the daily S&P500 Index volume, the daily 10
year T-note bond yield, and the daily 3 month T-bill bond yield. The Options Agent
included rules from the following 5 Options market indicators: Delta (i.e., the meas-
urement of the sensitivity of an Option value to the underlying stock price), Gamma
(i.e., the measurement of the second order sensitivity of the Option value to the under-
lying stock price), Vega (i.e., the measurement of the sensitivity of Option value to
the stock price volatility), Theta (i.e., the measurement of the sensitivity of Option
value to the passage of time), and implied volatility (i.e., the stock volatility estimate
given by the Black Scholes formula).
The daily stock data tested was for CitiGroup, IBM, General Motors, Eastman Ko-
dak, and Exxon Mobil over the period 4th January 1996 to 28th April 2006. A com-
mission fee of 0.5% of the transaction value was set. In contrast to Gershoffs HXCS,
the agents attempted to predict the price movement of tomorrows stock price and the
percentage of total wealth to invest, instead of just buy or sell signals. The agents
were given the choice between investing in the risky stock and investing in safe treas-
ury bills which returned a variable interest rate based on real world values.
The input data from the indicators was first divided into nine discrete cut points by
using leave-one-out-cross-validation. The target series then underwent two phases of
discretization. The first phase quantized the data using the unsupervised method of
histogram equalisation in order to add class label information to the target series.
Subsequently, the supervised method of entropy-based discretization was used to split
170 R. Preen
the series into intervals in order to maximise the information gain. Once quantization
had been completed, a binary vector was mapped to the intervals so that it could be
used by an XCS agent.
Next, the cumulative performance of the Meta Agent was evaluated. If its predic-
tion accuracy was less than the specified threshold value, all agents (including the
Meta Agent itself) were destroyed and a new set of agents with a new discretization
process was launched. The new set of cut points were based on the preceding ten
days. All new agents then started their training phases by exploring themselves into
the new training environment. After completing training they were placed back into
the real world environment.
The best results of the agents were compared against four benchmarks: buy and
hold, bank, price trend, and a Random Agent. In the case of CitiGroup, all of the XCS
agents outperformed all four of the benchmark agents. Moreover, in all five stocks, all
XCS agents outperformed the Random Agents. The authors suggest that there is a
mere 0.00003% probability that this occurred by chance and that it provides solid
proof that stock prices have a rational component.
Further, the XCS agents discovered a famous 1960s trading rule1. This last dis-
covery highlights one of the major benefits to using XCS (as opposed to other alterna-
tives such as an ANN) to forecast financial time series. The ability to have the rules in
an easily human readable form enables the researcher to evaluate the logic of any
discovered rule and decide whether it makes sense. This is important because if the
rule does not make any logical sense to a trader then it is quite possible that the rule
has been derived from over-fitting the data and its use in the future is questionable.
Interestingly, in contrast to Gershoffs findings, the Meta Agents here did not per-
form very well in comparison to the single agents. In 3 of the 5 stocks, the Meta
Agents underperformed all three of the single agents. If we are to use the best results
as indicative of performance (as suggested by [21]), this provides mixed information
on the effectiveness of HXCS as opposed to standard XCS agents.
Liu and Nagao [13] conducted a further assay on the application of HXCS to fi-
nancial time series forecasting. Here performance was evaluated on the prediction
accuracy of the direction of the next day. Two Meta Agents were used and their
binary perceptions set solely according to comparisons between various moving aver-
ages. The moving averages used were of the form MAt,m where the average is calcu-
lated from time t back to time t-m. Agent1 consisted of a bitstring of length 24 where
each bit was set according to the evaluation of 24 pairs of successive moving averages
with an interval length of 20. Agent2 consisted of a bitstring of length 18 where the
first 6 moving averages used an interval length of 10 and a further 12 moving aver-
ages with an interval length of 5, e.g., bit18 is set to logical 1 if MAt4,5<MAt, 5. Fur-
thermore a fuzzy matching mechanism was used where a classifier is said to have
matched the environment state even if 10% of the bits are non-matching. For each
environment state received by the HXCS, each Meta Agent receives the input, con-
structs a match set, and then calculates an average prediction value for the set. The
agent with the highest average match set prediction value is then chosen to advocate
1
If the ultimate oscillator is greater than 70, and the previous stock price change is within 2 to
3%, then tomorrows stock price will be -2.5 to -3.5%.
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators 171
an action in the normal XCS procedure and parameters are updated for the action set
of that Meta Agent.
Experiments were run on four indexes (NIKKEI, NASDAQ, TOPIX, HSI) and 11
other stocks selected from the NIKKEI using daily closing price data from January
2000 to December 2004. The direction hit-rate of both Meta Agents always provided
superior performance to a trend-following strategy that predicted the direction of the
next day based on the change from the previous day. In addition, HXCS outperformed
the Meta Agents by 2-3%. For example, the trend following strategy correctly pre-
dicted the direction 56% of the time for the NASDAQ, whereas Agent1 was correct
66.9% of the time, Agent2 70.8%, and HXCS 73.8%.
3 Learning Framework
Perhaps the biggest limitation that is consistent among [9], [21], [13], and [24] is that
they all attempt to use daily data to forecast the next days price. Since the accuracy
of agents predictions depends largely on how well the problem is represented [21]
we should adopt an approach that mimics how real trading is conducted as closely as
possible.
Figure 1 shows the daily price chart of the EURUSD currency pair with the vertical
dotted line in the centre marking August 15th 2007. At the close of this day, the Rela-
tive Strength Index (RSI) indicator set to 14 periods (i.e., to calculate the RSI over the
previous fourteen daily open, high, low, close bars), RSI(14), produces a value of
31.2109. For Agent 1 in [9], this value would set bit6 (RSI(14)35) to 1. On the
following day the price closed lower (at 1.3426) from its open (of 1.3442). Supposing
that the agent had identified that this rule was part of a buy signal, it would have re-
sulted in a loss under the model and negative feedback would have been given.
However, if we look at the bigger picture in Figure 2 we can see that in fact this
would have been an excellent place to enter the market. In Figure 2, the vertical dot-
ted line highlights the same day as in Figure 1 but illustrates that, in the bigger pic-
ture, the EUR continued to climb in value against the USD during the subsequent
months following the RSI signal. Clearly, the method of evaluation and providing
feedback to the model is far too short-sighted and asking for far too much accuracy.
Real traders utilise Stop Losses (SL) which are triggers set a certain distance from the
entry price and exit the market at a loss. This value is there in part because markets
are infamous for swaying noisily whilst actually moving towards a logical target (as
in a drunken man analogy). Furthermore, most real traders would never attempt to
predict the closing price of the next bar (e.g., next day when using daily data) because
it is asking for far too much accuracy within a widely acknowledged noisy system.
They would simply exit the market at their SL, or attempt to exit the market in profit
at some multiple of the initial risk (i.e., SL). Through such a method, successful trad-
ers can lose half, or more, of their trades whilst still finishing profitably.
If the models are intended to replicate real traders, we must adopt a more real-
world approach. Such an approach must seek to avoid pre-specifying the exact bar to
exit the trade and provide feedback. One approach commonly used in real trading is to
define the exit conditions in terms of fixed price numbers. For example, if the agent
discovered a buy signal, the SL is set $5 below the entry price, and a Take Profit (TP)
(i.e., a price level a trade is considered a winner and profit is taken) of $10 above the
entry price is set. A more sophisticated technique would be to test the combinations of
SL and TP to find the optimal pair in addition to the entry signal. However, this might
easily lead towards curve-fitting the model too specifically to the training set.
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators 173
Perhaps the most widely used method to identify when to exit a trade is the same as
that used to enter the trade in the first place: technical analysis. For example, if a rule
to buy an asset is if RSI(14)<30 then buy, then a suitable exit rule might be if
RSI(14)>70 then exit. This risks complicating the model and exponentially increas-
ing the search space, but it is the only way to provide a real-world measurement of
success.
4 Implementation
4.1 Data
The data used is the daily price/volume information over the period of February 3rd
1992 to December 14th 2007 for Exxon Mobil Corp. (XOM) (Figure 3.a) the Dow
Jones Industrial Average (DJI) (Figure 3.b), General Motors Corporation (GM)
(Figure 3.c), Intel Corp. (INTC) (Figure 3.d). In addition, data over the period of
December 26th 1991 to December 14th 2007 is used for 30-Year Treasury Bonds
(TYX) (Figure 3.e). They were chosen to include one index (DJI), two ranging assets
(GM and INTEL), one falling asset (TYX), and one increasing asset (XOM).
Moreover, the assets represent diverse market sectors: automobiles, technology,
bonds, oil, and an index average. For DJI, the adjusted closing price is divided by
1000 to enable the agents to purchase shares with a balance of $10,000 or less. In all
cases, 4000 data points (i.e., days) are used. The first 3000 data points form a training
set used to evolve new rules and the most recent 1000 data points are used as a trading
set to evaluate these rules.
4.2 XCS
The traditional ternary representation is used, where the environment inputs are dis-
cretized as outlined in the following sections. A fixed reward of 1000 is given to prof-
itable actions and 0 to actions which result in no profit or a loss. XCS parameters used
are as follows (taken from [3] and not further optimised so as not to bias the results
used to compare the models): =1, =0.2, =0.1, GA=25, del=20, sub=20, P#=0.6,
v=5, =0.8, 0=10, =0.04. Each agent is shown the training set only once before be-
ing evaluated on the trading set. The alteration between exploring and exploiting rules
is modified as in [21] to:
(1)
Running the equation above over 1000 iterations (i.e., the length of the trading set)
produced a range of 896 to 932 exploit steps being executed. Thus, over 1000 itera-
tions, exploits are conducted approximately 89.6 - 93.2% of the time. This produces
an increasing bias towards exploiting the knowledge acquired as the rules become
more evolved, which is important since the system will perform a single pass through
the data.
174 R. Preen
(e) TYX
Agent 1 utilises three stochastic indicators with the periods (8, 3, 3), (32, 12, 12), and
(128, 48, 48). The (8, 3, 3) was chosen simply because it is the most commonly used
configuration, then the two subsequent combinations are each four times greater,
thereby providing a short-term trend, intermediate-term trend, and long-term trend.
The direction of the stochastic indicators and their position (i.e., the value between 0
and 100) is used to classify the environment. The signal line was used for the (8,3,3)
parameters to smooth the line to reduce noise whereas the (32,12,12) and (128,48,48)
main lines are already sufficiently smoothed.
The real numbered indicators are discretized through a simple mechanism. A 9 bit
binary string is composed where the first two bits are used to classify the (8,3,3) sig-
nal lines position. The third and fourth bits are used to classify the (32,12,12) main
lines position and the fifth and sixth bits are used to classify the (128,48,48) main
lines position. The indicator to binary encoding for each indicators position is sum-
marised below in Figure 4.
Indicator Binary
0 - 24 00
25 - 49 01
50 - 74 10
75 - 100 11
Lastly, three bits are used to classify the direction of each of the stochastic lines as in
Figure 5.
The second agent is a trend following agent comprised mostly of Exponential Moving
Averages (EMA). A 20, 50 and 100 period EMA is constructed. The EMAs direction
(i.e., rising or falling) and the position of the current price relative to the EMA (i.e.,
above or below) is used to classify the environment. In addition, the direction of the
Moving Average Convergence Divergence (MACD) (12, 26, 9) main line, and the
direction of the Stochastic (32, 12, 12) main line are used to provide additional trend
information. The encoded is summarised below in Figure 6.
176 R. Preen
There are three sets of exit conditions for each agent. Firstly, there is the traditional
model where the next day is used as the only exit condition, meaning that any trade
entered today is exited at tomorrows closing price. In addition to this, there are two
sets of technical indicator exit conditions: a simple set with only 4 exit conditions (see
Figure 8) and a more advanced set comprising 16 exit conditions (see Figure 9). To
keep the current study simple, the agents were only allowed to buy or hold, with sell-
ing not permitted. In both the 4 and 16 exit sets, one of the actions causes the agent to
move to the next day without trading (i.e., holds for one day) where reward is given if
the price remained unchanged or decreased.
This is implemented by moving forward each day in the index and comparing the
indicators parameters with the exit conditions (as would happen in live trading.)
When a match is found, the result of the action is calculated, the balance updated, and
reward given. The comparison of the indicator parameters was implemented by indi-
vidually checking each rule. This was done for simplicity and to ensure that the rules
were functioning correctly. However, with a bigger set of exit conditions to test (since
we are testing every applicable combination), one would assign bits to each condition
in the same manner the environment conditions are constructed, and then any invalid
actions (e.g., EMA (20) cannot be rising and falling simultaneously) would be re-
moved by forcing XCS to choose another action.
1. Do not enter any trades today (i.e., hold for one day.)
2. Buy today and exit when MACD (12,26,9) decreases.
3. Buy today and exit when EMA (20) decreases.
4. Buy today and exit when Stochastic (32,12,12) decreases.
5. Buy today and exit when EMA (50) decreases.
6. Buy today and exit when MACD (12,26,9) and EMA (20) decrease.
7. Buy today and exit when MACD (12,26,9) and Stochastic (32,12,12) decrease.
8. Buy today and exit when MACD (12,26,9) and EMA (50) decrease.
9. Buy today and exit when EMA (20) and Stochastic (32,12,12) decrease.
10. Buy today and exit when EMA (20) and EMA (50) decrease.
11. Buy today and exit when Stochastic (32,12,12) and EMA (50) decrease.
12. Buy today and exit when MACD (12,26,9) and EMA (20) and Stochastic
(32,12,12) decrease.
13. Buy today and exit when MACD (12,26,9) and EMA (20) and EMA (50) de-
crease.
14. Buy today and exit when MACD (12,26,9) and Stochastic (32,12,12) and EMA
(50) decrease.
15. Buy today and exit when EMA (20) and Stochastic (32,12,12) and EMA (50)
decrease.
16. Buy today and exit when EMA (20) and Stochastic (32,12,12) and EMA(50)
and MACD (12,26,9) decrease.
5 Experimentation
Tables 1 to 5 present a comparison between the agents with the next day as the exit
condition, 4 technical indicator exits as the exit conditions, and with 16 technical
indicator exits as the exit conditions. Each agent starts with an initial balance of
$10,000. The results presented are the best run and the average run of 100 experi-
ments. The highest performing result in each category is highlighted in bold.
The results from the experiments comparing the next-day-exit agents with the
agents using technical indicator exit conditions, after being shown the training set
178 R. Preen
only once (Tables 1-5), show that for XOM, the agent with the highest balance
($25,648.75) and highest average balance ($15,899.56) was Agent 2 with 16 technical
indicator exits. For DJI, Agent 1 with 4 technical indicator exits produced the highest
balance ($15,120.46) and Agent 3 with 16 technical indicator exits achieved the high-
est average balance ($12,102.06). For INTEL, Agent 2 with 4 technical indicator exits
produced the highest balance ($21,000.59) and the highest average balance
($10,522.50). In the case of GM, again Agent 2 with 4 technical indicator exits pro-
duced both the highest balance ($20,116.72) and the highest average balance
($9,645.54). Lastly, for TYX, Agent 1 with next-day-exit conditions produced both
the highest balance ($15,671.20) and highest average balance ($11,389.56).
The results have shown that in all cases (except TYX), an agent using technical in-
dicator exits was superior to exiting at the next day for both the highest achievable
balance and the average balance over its experiments. Moreover, since commissions
are not factored into the agents at this stage, it is highly likely that the gap between
the two agent classes would further widen.
Table 1. XOM
Table 2. DJI
Table 3. INTEL
The results from the experiments comparing the next-day-exit agents with the
agents using technical indicator exit conditions, after being shown the training set
only once (Tables 1-5), show that for XOM, the agent with the highest balance
($25,648.75) and highest average balance ($15,899.56) was Agent 2 with 16 technical
indicator exits. For DJI, Agent 1 with 4 technical indicator exits produced the highest
balance ($15,120.46) and Agent 3 with 16 technical indicator exits achieved the high-
est average balance ($12,102.06). For INTEL, Agent 2 with 4 technical indicator exits
produced the highest balance ($21,000.59) and the highest average balance
($10,522.50). In the case of GM, again Agent 2 with 4 technical indicator exits pro-
duced both the highest balance ($20,116.72) and the highest average balance
($9,645.54). Lastly, for TYX, Agent 1 with next-day-exit conditions produced both
the highest balance ($15,671.20) and highest average balance ($11,389.56).
Table 4. GM
Table 5. TYX
Table 6. t-Stats of Tech Exits vs. Next Day (N.D.) exits. Two-Sample Assuming Unequal
Variances. Results in bold are statistically significant at the 95% confidence level.
The results have shown that in all cases (except TYX), an agent using technical in-
dicator exits was superior to exiting at the next day for both the highest achievable
balance and the average balance over its experiments. Moreover, since commissions
are not factored into the agents at this stage, it is highly likely that the gap between
the two agent classes would further widen.
However, in the case of TYX, the best performing agent was Agent 1 with next-
day-exit conditions. Furthermore, all next-day-exit agents surpassed the technical
indicator exit agents in terms of both highest balance and average balance, showing
that for some assets next-day-exits can be the best. However, introducing commis-
sions would likely reduce this gap and perhaps even supplant the next-day-exit agents.
Nevertheless, the fact that the next-day-exit agents beat the technical indicator exits is
perhaps explainable by the split between the training and trading set, since the train-
ing set for TYX primarily decreases but the trading set moves in a side-ways range.
Table 6 presents the t-Stats for the three agent types where exiting at the close of
the next day is compared with both the 4 and 16 technical indicator exit sets. It is
shown that almost all of the results are statistically significant at the 95% confidence
level. In particular, for XOM and DJI, all agents utilising technical indicator exits
surpassed the same agents when exiting at the close of the next day, and these results
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators 181
were statistically significant. Additionally, Agent 2 when using 4 indicator exits has
provided statistically significant and superior results when compared to exiting at the
close of the next day in all cases except for TYX.
Finally, when comparing the best performing agents with a buy and hold strategy,
we observe that for INTEL, GM, and TYX, all of the agents using technical indicator
exits beat this strategy. Further, the best performing agents on all assets were always
able to beat the buy and hold balance; however the average of the agents balances did
not. Furthermore, should commissions be introduced (the cost would vary from bro-
ker to broker) these results when compared to a buy and hold strategy would deterio-
rate to some extent.
However, the agents average balances only outperformed a buy and hold strategy
when the stocks declined. An explanation for this is that when the agent exits the
market wrongfully, although there is no actual loss, there is an opportunity cost be-
cause the market increases and the agent underperforms its benchmark. Thus, stocks
which generally decline over the period analysed are much easier to beat because
agents have the choice to be in or out of the market, while it is much harder to beat
those that are generally going up.
Table 7 shows the average number of trades executed over 100 tests of each asset
by Agent 2. Again, the agent is shown the training set only once before being assessed
in the trading set. The table shows that when using 4 technical indicator exits, the
agent always trades fewer times than with next-day-exit conditions. Further, this is
statistically significant (as shown in table 8). In some cases 40% less trades are exe-
cuted which would result in substantial transaction fee savings. When utilising 16
technical indicator exits, Agent 2 trades a similar number of times as the agents using
next-day-exit conditions. This is a result of adding more exit conditions which in-
crease the probability of closing the trade after a short period of time. Thus, the 16
technical indicator exit agents tested do not offer any transaction fee savings in com-
parison to the traditional model.
Table 8. t-Stats of Number of trades Executed by Agent 2 with Tech Exits vs. Next Day (N.D.)
exits. Two-Sample Assuming Unequal Variances. Results in bold are statistically significant at
the 95% confidence level.
6 Conclusions
Agents utilising mathematical technical indicators for the exit conditions outper-
formed similar agents which used the next day as the exit condition in all cases except
for TYX (30-Year Treasury bond), even before taking commissions into account,
which would penalise the most active agents (i.e., the agents using next-day-exit).
Moreover, these results were achieved with generic XCS parameters and not tuned to
improve performance.
The reason TYX was anomalous is attributable to either the position of the cut-off
point between the training and trading set, or the TYX data being inherently noisier
than the other assets, which were all stocks. The cut point in this asset is particularly
important because it resulted in a training set which primarily declined and a trading
set that ranged sideways. Thus, the agents would have adapted rules to trade within
this downward environment but were not prepared for the environment within which
they were assessed.
An analysis of the number of trades executed by each agent showed that, on aver-
age, 31.73% less trades were executed when using 4 technical indicator exit condi-
tions; this would result in substantial transaction savings and further boost the
performance of these agents in comparison to the agents using next-day-exit condi-
tions. However, the agents using 16 mathematical technical indicator exits executed
with approximately the same frequency as the agents using next-day-exit conditions.
This was a result of having more rules with different exit conditions that could be
triggered, so the agents were closing the trades with greater frequency.
References
1. Allen, F., Karjalainen, R.: Using Genetic Algorithms to find technical trading rules. Jour-
nal of Financial Economics 51(2), 245271 (1999)
2. Beltrametti, L., Fiorentini, R., Marengo, L., Tamborini, R.: A learning-to-forecast experi-
ment on the foreign exchange market with a Classifier System. Journal of Economic Dy-
namics and Control 21(8&9), 15431575 (1997)
3. Butz, M., Sastry, K., Goldberg, D.: Strong, Stable, and Reliable Fitness Pressure in XCS
due to Tournament Selection. Genetic Programming and Evolvable Machines 6(1), 5377
(2005)
4. Brock, W., Lakonishock, J., LeBaron, B.: Simple Technical Trading Rules and the Sto-
chastic Properties of Stock Returns. Journal of Finance 47, 17311764 (1992)
5. Chen, S.-H.: Genetic Algorithms and Genetic Programming in Computational Finance.
Kluwer Academic Publishers, Norwell (2002)
6. Detry, P.J., Grgoire, P.: Other evidences of the predictive power of technical analysis: the
moving average rules on European indexes, CeReFiM, Belgium, pp. 125 (1999)
7. Dewachter, H.: Can Markov switching models replicate chartist profits in the foreign ex-
change market? Journal of International Money and Finance 20(1), 2541 (2001)
8. Dooley, M., Schaffer, J.: Analysis of Short-Run Exchange Rate Behavior: March 1973 to
November 1981. In: Bigman, D., Taya, T. (eds.) Floating Exchange Rates and State of
World Trade and Payments, pp. 4370. Ballinger Publishing Company, Cambridge (1983)
9. Gershoff, M.: An investigation of HXCS Traders. School of Informatics. Vol. Master of
Sciences Edinburgh. University of Edinburgh (2006)
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators 183
10. Holland, J.: Adaptation in Natural and Artificial Systems. University of Michigan Press
(1975)
11. Kalyvas, E.: Using Neural Networks and Genetic Algorithms to Predict Stock Market Re-
turns. University of Manchester Master of Science thesis (2001)
12. Levich, R., Thomas, L.: The Merits of Active Currency Management: Evidence from In-
ternational Bond Portfolios. Financial Analysts Journal 49(5), 6370 (1993)
13. Liu, S., Nagao, T.: HXCS and its Application to Financial Time Series Forecasting. IEEJ
Transactions on Electrical and Electronic Engineering 1, 417425 (2006)
14. Mahfoud, S., Mani, G.: Financial forecasting using Genetic Algorithms. Applied Artificial
Intelligence 10(6), 543565 (1996)
15. Neely, C., Weller, P., Dittmar, R.: Is Technical Analysis in the Foreign Exchange Market
Profitable? A Genetic Programming Approach. Journal of Financial and Quantitative
Analysis 32(4), 405426 (1997)
16. Okunev, J., White, D.: Do momentum-based strategies still work in foreign currency mar-
kets? Journal of Financial and Quantitative Analysis 38, 425447 (2003)
17. Olson, D.: Have trading rule profits in the currency market declined over time? Journal of
Banking and Finance 28, 85105 (2004)
18. Schulenburg, S., Ross, P.: An Adaptive Agent Based Economic Model. In: Lanzi, P.L., et
al. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1996, pp. 265284. Springer, Heidelberg
(2001)
19. Schulenburg, S., Ross, P.: Strength and money: An LCS approach to increasing returns. In:
Lanzi, P.L. (ed.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 114137. Springer, Heidel-
berg (2001)
20. Schulenburg, S., Ross, P.: Explorations in LCS models of stock trading. In: Lanzi, P.L.,
Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2001. LNCS (LNAI), vol. 2321, pp. 151180.
Springer, Heidelberg (2002)
21. Schulenburg, S., Wong, S.Y.: Portfolio allocation using XCS experts in technical analysis,
market conditions and options market. In: Proceedings of the 2007 GECCO Conference
Companion on Genetic and Evolutionary Computation, pp. 29652972. ACM, New York
(2007)
22. Srinivasa, K.G., Venugopal, K.R., Patnaik, L.M.: An efficient fuzzy based neuro: genetic
algorithm for stock market prediction. International Journal of Hybrid Intelligent Sys-
tems 3(2), 6381, (2006)
23. Steiner, M., Wittkemper, H.G.: Neural networks as an alternative stock market model. In:
Refenes, A.P. (ed.) Neural networks in the capital markets, pp. 137149. John Wiley and
Sons, Chichester (1996)
24. Stone, C., Bull, L.: Foreign Exchange Trading using a Learning Classifier System. In:
Bull, L., Bernado-Mansilla, E., Holmes, J. (eds.) Learning Classifier Systems in Data Min-
ing, pp. 169190. Springer, Heidelberg (2008)
25. Sweeney, R.J.: Beating the foreign exchange market. Journal of Finance 41, 163182
(1986)
26. Tsibouris, G., Zeidenberg, M.: Testing the Efficient Market Hypothesis with Gradient De-
scent Algorithms, pp. 127136. John Wiley and Sons Ltd., Chichester (1996)
27. Wilson, S.W.: ZCS: A Zeroth Level Classifier. Evolutionary Computation 2, 118 (1994)
28. Wilson, S.W.: Classifier Fitness Based on Accuracy. Evolutionary Computation 3(2), 149
175 (1995)
184 R. Preen
1 Introduction
The assumption that a properly trained classier will be able to predict the
behavior of unseen data from the same problem is at the core of any automatic
classication process. However, this hypothesis tends to prove unreliable when
dealing with biological data (or other experimental sciences), especially when
such data is provided by more than one laboratory, even if they are following
the same protocols to obtain it.
This paper presents an example of such a case, a prostate cancer diagnosis
problem where a classier built using the data of the rst laboratory performs
J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 185197, 2010.
c Springer-Verlag Berlin Heidelberg 2010
186 J.G. Moreno-Torres et al.
very accurately on the test data from that same laboratory, but comparatively
poorly on the data from the second one. It is assumed that this behavior is due to
a fracture between the data of the two laboratories, and a Genetic Programming
(GP) method is developed to homogenize the data in subsequent subsets. We
consider this method a form of feature extraction because the new dataset is
constructed with new features which are functional mappings of the old ones.
The method presented in this paper attempts to optimize a transformation
over the data from the second laboratory, in terms of classier performance.
That is, the data from the second lab is transformed into a new dataset where
the classier, trained on the data from the rst lab, performs as accurately as
possible. If the performance achieved by the classier in this new, transformed,
dataset, is equivalent to the one obtained in the data from the rst lab, we
understand the data has been homogenized.
More formally, the classier f is trained on data from one laboratory (dataset
A), such that y = f (xA) is the class prediction for one instance xA of dataset
A. For the data from the other lab (dataset B), it is assumed that there exists
a transformation T such that f (T (xB)) is a good classier for instances xB
of dataset B. The goodness of the classier is measured by the loss function
l(f (T (xB)), y), where y is the class associated with xB, and l(., .) is a measure
of distance between f (T (xB)) and y. The aim is to nd a transformation T such
that the average loss over all instances in B is minimized.
The remainder of this paper is organized as follows: In Section 2, some prelimi-
naries about the techniques used and some approaches to similar problems in the
literature are presented. Section 3 has a description of the proposed algorithm.
Section 4 details the real-world biological dataset that motivates this paper. Sec-
tion 5 includes the experimental setup, along with the results obtained, and an
analysis. Finally, some concluding remarks are made in Section 6.
2 Preliminaries
This section is divided in the following way: In Section 2.1 we introduce the
notation that has been used in this paper. Then we include a brief summary of
what has been done in feature extraction in Section 2.2, and a short review of
the dierent approaches we found in the specialized literature on the use of GP
for feature extraction in Section 2.3.
2.1 Notation
When describing the problem, datasets A, B and S correspond to:
A: The original dataset, provided by the rst lab, that was used to build the
classier.
B: The problem dataset, from the second lab. The classier is not accurate
on this dataset, and that is what the proposed algorithm attempts to solve.
S: The solution dataset, result of applying the evolved transformation to the
samples in dataset B. The goal is to have the classier performance be as
high as possible on this dataset.
On the Homogenization of Data from Two Laboratories 187
The problem we are attempting to solve is the design of a method that can create
a transformation from a dataset (dataset B) where a classication model built
using the data from a dierent dataset (dataset A) is not accurate; into a new
dataset (dataset S) where the classier is more accurate. Said classier is kept
unchanged throughout the process.
We decided to use GP to solve the problem for a number of reasons:
1 It is well suited to evolve arbitrary expressions because its chromosomes are
trees. This is useful in our case because we want to have the maximum possi-
ble exibility in terms of the functional expressions of this transformations.
2 GP provides highly-interpretable solutions. This is an advantage because our
goal is not only to have a new dataset where the classier works, but also to
analyze what was the problem in the rst dataset.
Once GP was chosen, we needed to decide what terminals and operators to use,
how to calculate the tness of an individual and which evolutionary parameters
(population size, number of generations, selection and mutation rates, etc) are
appropriate for the problem at hand.
The tness evaluation procedure is probably the most treated aspect of design
in the literature when dealing with GP-based feature extraction. As has been
stated before, the idea is to have the provided classiers performance drive
the evolution. To achieve that, our method calculates tness as the classiers
accuracy over the dataset obtained by applying the transformations encoded in
the individual (training-set accuracy).
This section details the choices made for selection, crossover and mutation op-
erators. Since the objective of this work is not to squeeze the maximum possible
performance from GP, but rather to show that it is an appropriate technique for
the problem and that it can indeed solve it, we did not pay special attention to
these choices, and picked the most common ones in the specialized literature.
3.5 Parameters
Table 1 summarizes the parameters used for the experiments.
Parameter Value
Number of trees nv
Population size 400 nv
Duration of the run 100 generations
Selection operator Tournament without replacement
Tournament size log2 (nv ) + 1
Crossover operator One-point crossover
Crossover probability 0.9
Mutation operator Replacement & Swap mutations
Replacement mutation probability 0.001
Swap mutation probability 0.01
Maximum depth of the swapped in subtree 5
Function set {+, , , , cos, exp}
Terminal set {x0 ,x1 ,...,xnv 1, e}
data. However, the classier built from the data obtained from one laboratory
proved remarkably inaccurate when applied to classify data from a dierent
hospital. Since all the experimental procedure was identical; using the same ma-
chine, measuring and post-processing; and having the exact same lab protocols,
both for tissue extraction and staining; there was no factor that could explain
this discrepancy.
What we attempt to do with this work is develop an algorithm that can
evolve a transformation over the data from the second laboratory, creating a new
dataset where the classier built from the rst lab is as accurate as possible.
5 Experimental Study
This section is organized in the following way: To begin with, a general de-
scription of the experimental procedure is presented in Section 5.1, and the
parameters used for the experiment. The results obtained are presented in Sec-
tion 5.2, a statistical analysis is shown in Section 5.3, and lastly some sample
transformations are shown in Section 5.4.
2 From dataset A, build a classier. We chose C4.5 [26], but any other classier
would work exactly the same; due to the fact that the proposed method uses
the learned classier as a black box.
3 Apply our method to dataset B in order to evolve a transformation that will
create a solution dataset S. Use 5-fold cross validation over dataset S, so
that training and test set accuracy results can be obtained.
4 Check the performance of the step 2 classier on dataset S. Ideally, it should
be close to the one on dataset A, meaning the proposed method has success-
fully discovered the hidden transformation and inverted it.
This section presents the results for the Prostate Cancer problem, in terms of
classier accuracy. The results obtained can be seen in table 2.
The performance results are promising. First and foremost, the proposed
method was able to nd a transformation over the data from the second labora-
tory that made the classier work just as well as it did on the data from the rst
lab, eectively nding the fracture in the data (that is, the dierence in data
distribution between the data sets provided by the two labs) that prevented the
classier from working accurately.
6 Concluding Remarks
Acknowledgments
Jose Garca Moreno-Torres was supported by a scholarship from Obra Social
la Caixa and is currently supported by a FPU grant from the Ministerio de
Educacion y Ciencia of the Spanish Government and the KEEL project. Rohit
Bhargava would like to acknowledge collaborators over the years, especially Dr.
Stephen M. Hewitt and Dr. Ira W. Levin of the National Institutes of Health, for
numerous useful discussions and guidance. Funding for this work was provided in
part by University of Illinois Research Board and by the Department of Defense
Prostate Cancer Research Program. This work was also funded in part by the
National Center for Supercomputing Applications and the University of Illinois,
under the auspices of the NCSA/UIUC faculty fellows program.
References
1. Wyse, N., Dubes, R., Jain, A.: A critical evaluation of intrinsic dimensionality
algorithmsa critical evaluation of intrinsic dimensionality algorithms. In: Gelsema,
E.S., Kanal, L.N. (eds.) Pattern recognition in practice, Amsterdam, pp. 415425.
Morgan Kauman Publishers, Inc., San Francisco (1980)
2. Kim, K.A., Oh, S.Y., Choi, H.C.: Facial feature extraction using pca and wavelet
multi-resolution images. In: Sixth IEEE International Conference on Automatic
Face and Gesture Recognition, p. 439. IEEE Computer Society, Los Alamitos
(2004)
3. Podolak, I.T.: Facial component extraction and face recognition with support vec-
tor machines. In: FGR 2002: Proceedings of the Fifth IEEE International Confer-
ence on Automatic Face and Gesture Recognition, Washington, DC, USA, p. 83.
IEEE Computer Society, Los Alamitos (2002)
4. Pei, M., Goodman, E.D., Punch, W.F.: Pattern discovery from data using genetic
algorithms. In: Proceeding of 1st Pacic-Asia Conference Knowledge Discovery &
Data Mining, PAKDD 1997 (1997)
196 J.G. Moreno-Torres et al.
5. Liu, H., Motoda, H.: Feature extraction, construction and selection: a data mining
perspective. SECS, vol. 453. Kluwer Academic, Boston (1998)
6. Guyon, I., Elissee, A.: An introduction to variable and feature selection. J. Mach.
Learn. Res. 3, 11571182 (2003)
7. Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L. (eds.): Feature Extraction, Founda-
tions and Applications. Springer, Heidelberg (2006)
8. Tackett, W.A.: Genetic programming for feature discovery and image discrimina-
tion. In: Proceedings of the 5th International Conference on Genetic Algorithms,
pp. 303311. Morgan Kaufmann Publishers Inc., San Francisco (1993)
9. Sherrah, J.R., Bogner, R.E., Bouzerdoum, A.: The evolutionary pre-processor: Au-
tomatic feature extraction for supervised classication using genetic programming.
In: Proc. 2nd International Conference on Genetic Programming (GP 1997), pp.
304312. Morgan Kaufmann, San Francisco (1997)
10. Kotani, M., Ozawa, S., Nakai, M., Akazawa, K.: Emergence of feature extraction
function using genetic programming. In: KES, pp. 149152 (1999)
11. Bot, M.C.J.: Feature extraction for the k-nearest neighbour classier with ge-
netic programming. In: Miller, J., Tomassini, M., Lanzi, P.L., Ryan, C., Teta-
manzi, A.G.B., Langdon, W.B. (eds.) EuroGP 2001. LNCS, vol. 2038, pp. 256267.
Springer, Heidelberg (2001)
12. Zhang, Y., Rockett, P.I.: A generic optimal feature extraction method using mul-
tiobjective genetic programming. Technical Report VIE 2006/001, Department of
Electronic and Electrical Engineering, University of Sheeld, UK (2006)
13. Guo, H., Nandi, A.K.: Breast cancer diagnosis using genetic programming gener-
ated feature. Pattern Recognition 39(5), 980987 (2006)
14. Zhang, Y., Rockett, P.I.: A generic multi-dimensional feature extraction method
using multiobjective genetic programming. Evolutionary Computation 17(1), 89
115 (2009)
15. Harris, C.: An investigation into the Application of Genetic Programming tech-
niques to Signal Analysis and Feature Detection,September. University College,
London (September 26, 1997)
16. Smith, M.G., Bull, L.: Genetic programming with a genetic algorithm for feature
construction and selection. Genetic Programming and Evolvable Machines 6(3),
265281 (2005)
17. Wang, K., Zhou, S., Fu, C.A., Yu, J.X., Jerey, F., Yu, X.: Mining changes of classi-
cation by correspondence tracing. In: Proceedings of the 2003 SIAM International
Conference on Data Mining, SDM 2003 (2003)
18. Yang, Y., Wu, X., Zhu, X.: Conceptual equivalence for contrast mining in classi-
cation learning. Data & Knowledge Engineering 67(3), 413429 (2008)
19. Cieslak, D.A., Chawla, N.V.: A framework for monitoring classiers performance:
when and why failure occurs? Knowledge and Information Systems 18(1), 83108
(2009)
20. Koza, J.: Genetic Programming: On the Programming of Computers by Means of
Natural Selection. The MIT Press, Cambridge (1992)
21. AmericanCancerSociety: How many men get prostate cancer?
http://www.cancer.org/docroot/CRI/content/
CRI 2 2 1X How many men get prostate cancer 36.asp
22. Fernandez, D.C., Bhargava, R., Hewitt, S.M., Levin, I.W.: Infrared spectroscopic
imaging for histopathologic recognition. Nature Biotechnology 23(4), 469474
(2005)
On the Homogenization of Data from Two Laboratories 197
23. Levin, I.W., Bhargava, R.: Fourier transform infrared vibrational spectroscopic
imaging: integrating microscopy and molecular recognition. Annual Review of
Physical Chemistry 56, 429474 (2005)
24. Llor`a, X., Reddy, R., Matesic, B., Bhargava, R.: Towards better than human ca-
pability in diagnosing prostate cancer using infrared spectroscopic imaging. In:
Proceedings of the 9th Annual Conference on Genetic and Evolutionary Compu-
tation GECCO 2007, pp. 20982105. ACM, New York (2007)
25. Llor`a, X., Priya, A., Bhargava, R.: Observer-invariant histopathology using
genetics-based machine learning. Natural Computing: An International Jour-
nal 8(1), 101120 (2009)
26. Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers
Inc., San Francisco (1993)
27. Demsar, J.: Statistical comparisons of classiers over multiple data sets. Journal
of Machine Learning Research 7, 130 (2006)
28. Garca, S., Herrera, F.: An extension on statistical comparisons of classiers over
multiple data sets for all pairwise comparisons. Journal of Machine Learning Re-
search 9, 26772694 (2008)
29. Garca, S., Fern
andez, A., Luengo, J., Herrera, F.: A study of statistical techniques
and performance measures for genetics-based machine learning: Accuracy and in-
terpretability. Soft Computing 13(10), 959977 (2009)
30. Garca, S., Fern
andez, A., Luengo, J., Herrera, F.: Advanced nonparametric tests
for multiple comparisons in the design of experiments in computational intelligence
and data mining: Experimental analysis of power. Information Sciences 180(10),
20442064 (2010)
31. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bul-
letin 1(6), 8083 (1945)
32. Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures,
4th edn. Chapman & Hall/CRC (2007)
Author Index
ee, Gilles
En 107 Orriols-Puig, Albert 21